Skip to content
Snippets Groups Projects
  • Chen Chao's avatar
    9c40b9ea
    misleading task number of groupByKey · 9c40b9ea
    Chen Chao authored
    "By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389
    
    detail is as following code :
    
      def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
        val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
        for (r <- bySize if r.partitioner.isDefined) {
          return r.partitioner.get
        }
        if (rdd.context.conf.contains("spark.default.parallelism")) {
          new HashPartitioner(rdd.context.defaultParallelism)
        } else {
          new HashPartitioner(bySize.head.partitions.size)
        }
      }
    
    Author: Chen Chao <crazyjvm@gmail.com>
    
    Closes #403 from CrazyJvm/patch-4 and squashes the following commits:
    
    42f6c9e [Chen Chao] fix format
    829a995 [Chen Chao] fix format
    1568336 [Chen Chao] misleading task number of groupByKey
    9c40b9ea
    History
    misleading task number of groupByKey
    Chen Chao authored
    "By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389
    
    detail is as following code :
    
      def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
        val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
        for (r <- bySize if r.partitioner.isDefined) {
          return r.partitioner.get
        }
        if (rdd.context.conf.contains("spark.default.parallelism")) {
          new HashPartitioner(rdd.context.defaultParallelism)
        } else {
          new HashPartitioner(bySize.head.partitions.size)
        }
      }
    
    Author: Chen Chao <crazyjvm@gmail.com>
    
    Closes #403 from CrazyJvm/patch-4 and squashes the following commits:
    
    42f6c9e [Chen Chao] fix format
    829a995 [Chen Chao] fix format
    1568336 [Chen Chao] misleading task number of groupByKey