pyspark.RDD.groupBy#

RDD.groupBy(f, numPartitions=None, partitionFunc=<function portable_hash>)[source]#

Return an RDD of grouped items.

New in version 0.7.0.

Parameters
ffunction

a function to compute the key

numPartitionsint, optional

the number of partitions in new RDD

partitionFuncfunction, optional, default portable_hash

a function to compute the partition index

Returns
RDD

a new RDD of grouped items

Examples

>>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
>>> result = rdd.groupBy(lambda x: x % 2).collect()
>>> sorted([(x, sorted(y)) for (x, y) in result])
[(0, [2, 8]), (1, [1, 1, 3, 5])]