Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    14174abd
    [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
    Davies Liu authored
    During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
    
    It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1460 from davies/spill and squashes the following commits:
    
    cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
    37d71f7 [Davies Liu] balance the partitions
    902f036 [Davies Liu] add shuffle.py into run-tests
    dcf03a9 [Davies Liu] fix memory_info() of psutil
    67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
    f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
    e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
    400be01 [Davies Liu] address all the comments
    6178844 [Davies Liu] refactor and improve docs
    fdd0a49 [Davies Liu] add long doc string for ExternalMerger
    1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
    e6cc7f9 [Davies Liu] Merge branch 'master' into spill
    3652583 [Davies Liu] address comments
    e78a0a0 [Davies Liu] fix style
    24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
    57ee7ef [Davies Liu] update docs
    286aaff [Davies Liu] let spilled aggregation in Python configurable
    e9a40f6 [Davies Liu] recursive merger
    6edbd1f [Davies Liu] Hash based disk spilling aggregation
    14174abd
    History
    [SPARK-2538] [PySpark] Hash based disk spilling aggregation
    Davies Liu authored
    During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
    
    It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1460 from davies/spill and squashes the following commits:
    
    cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
    37d71f7 [Davies Liu] balance the partitions
    902f036 [Davies Liu] add shuffle.py into run-tests
    dcf03a9 [Davies Liu] fix memory_info() of psutil
    67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
    f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
    e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
    400be01 [Davies Liu] address all the comments
    6178844 [Davies Liu] refactor and improve docs
    fdd0a49 [Davies Liu] add long doc string for ExternalMerger
    1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
    e6cc7f9 [Davies Liu] Merge branch 'master' into spill
    3652583 [Davies Liu] address comments
    e78a0a0 [Davies Liu] fix style
    24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
    57ee7ef [Davies Liu] update docs
    286aaff [Davies Liu] let spilled aggregation in Python configurable
    e9a40f6 [Davies Liu] recursive merger
    6edbd1f [Davies Liu] Hash based disk spilling aggregation