Skip to content
Snippets Groups Projects
  • Hari Shreedharan's avatar
    800ecff4
    [STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... · 800ecff4
    Hari Shreedharan authored
    ...sh model
    
    Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
    receiver fails, it currently has to be restarted on the same node to be able to receive data.
    
    This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
    DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
    Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
    multiple threads for better performance.
    
    Author: Hari Shreedharan <harishreedharan@gmail.com>
    Author: Hari Shreedharan <hshreedharan@apache.org>
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    Author: harishreedharan <hshreedharan@cloudera.com>
    
    Closes #807 from harishreedharan/master and squashes the following commits:
    
    e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
    96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
    5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
    981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
    a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    1f47364 [Hari Shreedharan] Minor fixes.
    73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
    65b76b4 [Hari Shreedharan] Fixing the unit test.
    e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
    f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
    799509f [Hari Shreedharan] Fix a compile issue.
    3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
    10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
    1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
    8c00289 [Hari Shreedharan] More debug messages
    393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
    120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
    9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
    8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
    86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    205034d [Hari Shreedharan] Merging master in
    4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
    bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
    0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
    3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
    70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
    9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
    c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
    0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    800ecff4
    History
    [STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu...
    Hari Shreedharan authored
    ...sh model
    
    Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
    receiver fails, it currently has to be restarted on the same node to be able to receive data.
    
    This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
    DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
    Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
    multiple threads for better performance.
    
    Author: Hari Shreedharan <harishreedharan@gmail.com>
    Author: Hari Shreedharan <hshreedharan@apache.org>
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    Author: harishreedharan <hshreedharan@cloudera.com>
    
    Closes #807 from harishreedharan/master and squashes the following commits:
    
    e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
    96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
    5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
    981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
    a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    1f47364 [Hari Shreedharan] Minor fixes.
    73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
    65b76b4 [Hari Shreedharan] Fixing the unit test.
    e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
    f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
    799509f [Hari Shreedharan] Fix a compile issue.
    3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
    10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
    1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
    8c00289 [Hari Shreedharan] More debug messages
    393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
    120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
    9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
    8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
    86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
    205034d [Hari Shreedharan] Merging master in
    4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
    bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
    0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
    3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
    70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
    e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
    9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
    c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
    0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
    6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model