Skip to content
Snippets Groups Projects
  • Michael Armbrust's avatar
    accd0999
    [SQL] SPARK-1371 Hash Aggregation Improvements · accd0999
    Michael Armbrust authored
    Given:
    ```scala
    case class Data(a: Int, b: Int)
    val rdd =
      sparkContext
        .parallelize(1 to 200)
        .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
    rdd.registerAsTable("data")
    cacheTable("data")
    ```
    Before:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    16795.567ms
    SELECT a, SUM(b) FROM data GROUP BY a
    7536.436ms
    SELECT SUM(b) FROM data
    10954.1ms
    ```
    
    After:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    1372.175ms
    SELECT a, SUM(b) FROM data GROUP BY a
    2070.446ms
    SELECT SUM(b) FROM data
    958.969ms
    ```
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #295 from marmbrus/hashAgg and squashes the following commits:
    
    ec63575 [Michael Armbrust] Add comment.
    d0495a9 [Michael Armbrust] Use scaladoc instead.
    b4a6887 [Michael Armbrust] Address review comments.
    a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
    7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
    5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
    6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
    92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
    1279df2 [Michael Armbrust] Increase default number of partitions.
    accd0999
    History
    [SQL] SPARK-1371 Hash Aggregation Improvements
    Michael Armbrust authored
    Given:
    ```scala
    case class Data(a: Int, b: Int)
    val rdd =
      sparkContext
        .parallelize(1 to 200)
        .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
    rdd.registerAsTable("data")
    cacheTable("data")
    ```
    Before:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    16795.567ms
    SELECT a, SUM(b) FROM data GROUP BY a
    7536.436ms
    SELECT SUM(b) FROM data
    10954.1ms
    ```
    
    After:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    1372.175ms
    SELECT a, SUM(b) FROM data GROUP BY a
    2070.446ms
    SELECT SUM(b) FROM data
    958.969ms
    ```
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #295 from marmbrus/hashAgg and squashes the following commits:
    
    ec63575 [Michael Armbrust] Add comment.
    d0495a9 [Michael Armbrust] Use scaladoc instead.
    b4a6887 [Michael Armbrust] Address review comments.
    a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
    7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
    5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
    6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
    92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
    1279df2 [Michael Armbrust] Increase default number of partitions.
hercules_cg NaN GiB