diff --git a/docs/plugin-custom-receiver.md b/docs/streaming-custom-receivers.md similarity index 95% rename from docs/plugin-custom-receiver.md rename to docs/streaming-custom-receivers.md index 0eb4246158e535e0ce0eeef0f92ec4cfbf2a9b41..5476c00d020cb2cfece90853d7e010d08055c00d 100644 --- a/docs/plugin-custom-receiver.md +++ b/docs/streaming-custom-receivers.md @@ -1,9 +1,9 @@ --- layout: global -title: Tutorial - Spark streaming, Plugging in a custom receiver. +title: Tutorial - Spark Streaming, Plugging in a custom receiver. --- -A "Spark streaming" receiver can be a simple network stream, streams of messages from a message queue, files etc. A receiver can also assume roles more than just receiving data like filtering, preprocessing, to name a few of the possibilities. The api to plug-in any user defined custom receiver is thus provided to encourage development of receivers which may be well suited to ones specific need. +A "Spark Streaming" receiver can be a simple network stream, streams of messages from a message queue, files etc. A receiver can also assume roles more than just receiving data like filtering, preprocessing, to name a few of the possibilities. The api to plug-in any user defined custom receiver is thus provided to encourage development of receivers which may be well suited to ones specific need. This guide shows the programming model and features by walking through a simple sample receiver and corresponding Spark Streaming application. diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index ded43e67cd707c411d4fefb7da80d9b9cc2a7953..0e618a06c796fcb821e4f4aa4d46e274e5f48494 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -365,14 +365,14 @@ There are two failure behaviors based on which input sources are used. Since all data is modeled as RDDs with their lineage of deterministic operations, any recomputation always leads to the same result. As a result, all DStream transformations are guaranteed to have _exactly-once_ semantics. That is, the final transformed result will be same even if there were was a worker node failure. However, output operations (like `foreach`) have _at-least once_ semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to HDFS using the `saveAs*Files` operations (as the file will simply get over-written by the same data), additional transactions-like mechanisms may be necessary to achieve exactly-once semantics for output operations. -## Failure of a Driver Node -A system that is required to operate 24/7 needs to be able tolerate the failure of the drive node as well. Spark Streaming does this by saving the state of the DStream computation periodically to a HDFS file, that can be used to restart the streaming computation in the event of a failure of the driver node. To elaborate, the following state is periodically saved to a file. +## Failure of the Driver Node +A system that is required to operate 24/7 needs to be able tolerate the failure of the driver node as well. Spark Streaming does this by saving the state of the DStream computation periodically to a HDFS file, that can be used to restart the streaming computation in the event of a failure of the driver node. This checkpointing is enabled by setting a HDFS directory for checkpointing using `ssc.checkpoint(<checkpoint directory>)` as described [earlier](#rdd-checkpointing-within-dstreams). To elaborate, the following state is periodically saved to a file. 1. The DStream operator graph (input streams, output streams, etc.) 1. The configuration of each DStream (checkpoint interval, etc.) 1. The RDD checkpoint files of each DStream -All this is periodically saved in the file `<checkpoint directory>/graph` where `<checkpoint directory>` is the HDFS path set using `ssc.checkpoint(...)` as described earlier. To recover, a new Streaming Context can be created with this directory by using +All this is periodically saved in the file `<checkpoint directory>/graph`. To recover, a new Streaming Context can be created with this directory by using {% highlight scala %} val ssc = new StreamingContext(checkpointDirectory) diff --git a/pom.xml b/pom.xml index 7e06cae052b58d6ebc1e5240f4810601287d9ca2..99eb17856a813a42c47cc5388637e196b6b5d4ae 100644 --- a/pom.xml +++ b/pom.xml @@ -84,9 +84,9 @@ </snapshots> </repository> <repository> - <id>typesafe-repo</id> - <name>Typesafe Repository</name> - <url>http://repo.typesafe.com/typesafe/releases/</url> + <id>akka-repo</id> + <name>Akka Repository</name> + <url>http://repo.akka.io/releases/</url> <releases> <enabled>true</enabled> </releases> diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 22bdc93602ea55992e64c91cc00e691d7d1ea8f2..b0b6e21681b29b5d9742a8b744bee8d89de3abfb 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -114,7 +114,7 @@ object SparkBuild extends Build { def coreSettings = sharedSettings ++ Seq( name := "spark-core", resolvers ++= Seq( - "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/", + "Akka Repository" at "http://repo.akka.io/releases/", "JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/", "Spray Repository" at "http://repo.spray.cc/", "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/", @@ -162,9 +162,6 @@ object SparkBuild extends Build { def streamingSettings = sharedSettings ++ Seq( name := "spark-streaming", - resolvers ++= Seq( - "Akka Repository" at "http://repo.akka.io/releases" - ), libraryDependencies ++= Seq( "org.apache.flume" % "flume-ng-sdk" % "1.2.0" % "compile", "com.github.sgroschupf" % "zkclient" % "0.1", diff --git a/run b/run index 6b2d84d48dbf034613c34c0b089fe4f380a450ba..ecbf7673c660f76c7294f76e9f28104b51453602 100755 --- a/run +++ b/run @@ -111,14 +111,13 @@ CLASSPATH+=":$FWDIR/conf" CLASSPATH+=":$CORE_DIR/target/scala-$SCALA_VERSION/classes" if [ -n "$SPARK_TESTING" ] ; then CLASSPATH+=":$CORE_DIR/target/scala-$SCALA_VERSION/test-classes" + CLASSPATH+=":$STREAMING_DIR/target/scala-$SCALA_VERSION/test-classes" fi CLASSPATH+=":$CORE_DIR/src/main/resources" CLASSPATH+=":$REPL_DIR/target/scala-$SCALA_VERSION/classes" CLASSPATH+=":$EXAMPLES_DIR/target/scala-$SCALA_VERSION/classes" CLASSPATH+=":$STREAMING_DIR/target/scala-$SCALA_VERSION/classes" -for jar in `find "$STREAMING_DIR/lib" -name '*jar'`; do - CLASSPATH+=":$jar" -done +CLASSPATH+=":$STREAMING_DIR/lib/org/apache/kafka/kafka/0.7.2-spark/*" # <-- our in-project Kafka Jar if [ -e "$FWDIR/lib_managed" ]; then CLASSPATH+=":$FWDIR/lib_managed/jars/*" CLASSPATH+=":$FWDIR/lib_managed/bundles/*" diff --git a/run2.cmd b/run2.cmd index c913a5195ef95f83845dffcd1e6cdc497b9747c8..705a4d1ff6830af502631da58eb1b0ee105c2274 100644 --- a/run2.cmd +++ b/run2.cmd @@ -47,11 +47,14 @@ set CORE_DIR=%FWDIR%core set REPL_DIR=%FWDIR%repl set EXAMPLES_DIR=%FWDIR%examples set BAGEL_DIR=%FWDIR%bagel +set STREAMING_DIR=%FWDIR%streaming set PYSPARK_DIR=%FWDIR%python rem Build up classpath set CLASSPATH=%SPARK_CLASSPATH%;%MESOS_CLASSPATH%;%FWDIR%conf;%CORE_DIR%\target\scala-%SCALA_VERSION%\classes set CLASSPATH=%CLASSPATH%;%CORE_DIR%\target\scala-%SCALA_VERSION%\test-classes;%CORE_DIR%\src\main\resources +set CLASSPATH=%CLASSPATH%;%STREAMING_DIR%\target\scala-%SCALA_VERSION%\classes;%STREAMING_DIR%\target\scala-%SCALA_VERSION%\test-classes +set CLASSPATH=%CLASSPATH%;%STREAMING_DIR%\lib\org\apache\kafka\kafka\0.7.2-spark\* set CLASSPATH=%CLASSPATH%;%REPL_DIR%\target\scala-%SCALA_VERSION%\classes;%EXAMPLES_DIR%\target\scala-%SCALA_VERSION%\classes set CLASSPATH=%CLASSPATH%;%FWDIR%lib_managed\jars\* set CLASSPATH=%CLASSPATH%;%FWDIR%lib_managed\bundles\* diff --git a/streaming/pom.xml b/streaming/pom.xml index 92b17fc3af6f4c711515c491b3f5634a173fa489..15523eadcb78198d84a5a5d5e21518e514c3dc78 100644 --- a/streaming/pom.xml +++ b/streaming/pom.xml @@ -20,17 +20,6 @@ <id>lib</id> <url>file://${project.basedir}/lib</url> </repository> - <repository> - <id>akka-repo</id> - <name>Akka Repository</name> - <url>http://repo.akka.io/releases</url> - <releases> - <enabled>true</enabled> - </releases> - <snapshots> - <enabled>false</enabled> - </snapshots> - </repository> </repositories> <dependencies>