-
- Downloads
[SPARK-11419][STREAMING] Parallel recovery for FileBasedWriteAheadLog + minor recovery tweaks
The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway. However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized. This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery. Author: Burak Yavuz <brkyvz@gmail.com> Closes #9373 from brkyvz/par-recovery.
Showing
- streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala 4 additions, 2 deletions...a/org/apache/spark/streaming/scheduler/JobScheduler.scala
- streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala 54 additions, 24 deletions.../apache/spark/streaming/util/FileBasedWriteAheadLog.scala
- streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLogRandomReader.scala 1 addition, 1 deletion...k/streaming/util/FileBasedWriteAheadLogRandomReader.scala
- streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLogReader.scala 15 additions, 2 deletions...e/spark/streaming/util/FileBasedWriteAheadLogReader.scala
- streaming/src/main/scala/org/apache/spark/streaming/util/HdfsUtils.scala 22 additions, 2 deletions...ain/scala/org/apache/spark/streaming/util/HdfsUtils.scala
- streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala 90 additions, 1 deletion...rg/apache/spark/streaming/ReceivedBlockTrackerSuite.scala
- streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala 82 additions, 5 deletions.../org/apache/spark/streaming/util/WriteAheadLogSuite.scala
Loading
Please register or sign in to comment