Skip to content
Snippets Groups Projects
  • Andrew Or's avatar
    ecf30ee7
    [SPARK-1777] Prevent OOMs from single partitions · ecf30ee7
    Andrew Or authored
    **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
    
    **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
    
    **New configurations.**
    - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
    - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
    
    For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
    
    e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    c7c8832 [Andrew Or] Simplify logic + update a few comments
    269d07b [Andrew Or] Very minor changes to tests
    6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    b7e165c [Andrew Or] Add new tests for unrolling blocks
    f12916d [Andrew Or] Slightly clean up tests
    71672a7 [Andrew Or] Update unrollSafely tests
    369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
    f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
    a66fbd2 [Andrew Or] Rename a few things + update comments
    68730b3 [Andrew Or] Fix weird scalatest behavior
    e40c60d [Andrew Or] Fix MIMA excludes
    ff77aa1 [Andrew Or] Fix tests
    1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
    ed6cda4 [Andrew Or] Formatting fix (super minor)
    f9ff82e [Andrew Or] putValues -> putIterator + putArray
    beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    8448c9b [Andrew Or] Fix tests
    a49ba4d [Andrew Or] Do not expose unroll memory check period
    69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
    3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
    dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    8288228 [Andrew Or] Synchronize put and unroll properly
    4f18a3d [Andrew Or] bufferFraction -> unrollFraction
    28edfa3 [Andrew Or] Update a few comments / log messages
    728323b [Andrew Or] Do not synchronize every 1000 elements
    5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    129c441 [Andrew Or] Fix bug: Use toArray rather than array
    9a65245 [Andrew Or] Update a few comments + minor control flow changes
    57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
    3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
    f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    0871835 [Andrew Or] Add an effective storage level interface to BlockManager
    64e7d4c [Andrew Or] Add/modify a few comments (minor)
    8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
    ecc8c2d [Andrew Or] Fix binary incompatibility
    24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
    2b7ee66 [Andrew Or] Fix bug in SizeTracking*
    9b9a273 [Andrew Or] Fix tests
    20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    649bdb3 [Andrew Or] Document spark.storage.bufferFraction
    a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
    e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
    198e374 [Andrew Or] Unfold -> unroll
    0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
    ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
    22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
    0871535 [Andrew Or] Fix tests in BlockManagerSuite
    d68f31e [Andrew Or] Safely unfold blocks for all memory puts
    5961f50 [Andrew Or] Fix tests
    195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
    1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    d5dd3b4 [Andrew Or] Free buffer memory in finally
    ea02eec [Andrew Or] Fix tests
    b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    87aa75c [Andrew Or] Fix mima excludes again (typo)
    11eb921 [Andrew Or] Clarify comment (minor)
    50cae44 [Andrew Or] Remove now duplicate mima exclude
    7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    df47265 [Andrew Or] Fix binary incompatibility
    6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    f94f5af [Andrew Or] Update a few comments (minor)
    776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
    bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
    97ea499 [Andrew Or] Change BlockManager interface to use Arrays
    c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
    ecf30ee7
    History
    [SPARK-1777] Prevent OOMs from single partitions
    Andrew Or authored
    **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
    
    **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
    
    **New configurations.**
    - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
    - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
    
    For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
    
    e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    c7c8832 [Andrew Or] Simplify logic + update a few comments
    269d07b [Andrew Or] Very minor changes to tests
    6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    b7e165c [Andrew Or] Add new tests for unrolling blocks
    f12916d [Andrew Or] Slightly clean up tests
    71672a7 [Andrew Or] Update unrollSafely tests
    369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
    f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
    a66fbd2 [Andrew Or] Rename a few things + update comments
    68730b3 [Andrew Or] Fix weird scalatest behavior
    e40c60d [Andrew Or] Fix MIMA excludes
    ff77aa1 [Andrew Or] Fix tests
    1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
    ed6cda4 [Andrew Or] Formatting fix (super minor)
    f9ff82e [Andrew Or] putValues -> putIterator + putArray
    beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    8448c9b [Andrew Or] Fix tests
    a49ba4d [Andrew Or] Do not expose unroll memory check period
    69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
    3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
    dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    8288228 [Andrew Or] Synchronize put and unroll properly
    4f18a3d [Andrew Or] bufferFraction -> unrollFraction
    28edfa3 [Andrew Or] Update a few comments / log messages
    728323b [Andrew Or] Do not synchronize every 1000 elements
    5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    129c441 [Andrew Or] Fix bug: Use toArray rather than array
    9a65245 [Andrew Or] Update a few comments + minor control flow changes
    57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
    3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
    f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    0871835 [Andrew Or] Add an effective storage level interface to BlockManager
    64e7d4c [Andrew Or] Add/modify a few comments (minor)
    8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
    ecc8c2d [Andrew Or] Fix binary incompatibility
    24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
    2b7ee66 [Andrew Or] Fix bug in SizeTracking*
    9b9a273 [Andrew Or] Fix tests
    20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    649bdb3 [Andrew Or] Document spark.storage.bufferFraction
    a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
    e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
    198e374 [Andrew Or] Unfold -> unroll
    0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
    ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
    22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
    0871535 [Andrew Or] Fix tests in BlockManagerSuite
    d68f31e [Andrew Or] Safely unfold blocks for all memory puts
    5961f50 [Andrew Or] Fix tests
    195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
    1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    d5dd3b4 [Andrew Or] Free buffer memory in finally
    ea02eec [Andrew Or] Fix tests
    b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    87aa75c [Andrew Or] Fix mima excludes again (typo)
    11eb921 [Andrew Or] Clarify comment (minor)
    50cae44 [Andrew Or] Remove now duplicate mima exclude
    7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    df47265 [Andrew Or] Fix binary incompatibility
    6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
    f94f5af [Andrew Or] Update a few comments (minor)
    776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
    bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
    97ea499 [Andrew Or] Change BlockManager interface to use Arrays
    c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
MimaExcludes.scala 6.74 KiB