diff --git a/docs/css/main.css b/docs/css/main.css index 8c2dc740294d28bf7aa7e39c60755d8871f10200..c8aaa8ad22f7631d85b71098c4d8cff4c9a96cce 100755 --- a/docs/css/main.css +++ b/docs/css/main.css @@ -48,6 +48,10 @@ code { color: #902000; } +a code { + color: #0088cc; +} + pre { font-family: "Menlo", "Lucida Console", monospace; } diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md index 546d69bfe5528c5a3eb7c5e286181414ccabe095..2411e078494f5363c0fce2de28bcc452982ae86d 100644 --- a/docs/java-programming-guide.md +++ b/docs/java-programming-guide.md @@ -3,22 +3,22 @@ layout: global title: Java Programming Guide --- -The Spark Java API -([spark.api.java]({{HOME_PATH}}api/core/index.html#spark.api.java.package)) defines -[`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) and -[`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes, -which support -the same methods as their Scala counterparts but take Java functions and return -Java data and collection types. - -Because Java API is similar to the Scala API, this programming guide only -covers Java-specific features; -the [Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) -provides a more general introduction to Spark concepts and should be read -first. - - -# Key differences in the Java API +The Spark Java API exposes all the Spark features available in the Scala version to Java. +To learn the basics of Spark, we recommend reading through the +[Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) first; it should be +easy to follow even if you don't know Scala. +This guide will show how to use the Spark features described there in Java. + +The Spark Java API is defined in the +[`spark.api.java`]({{HOME_PATH}}api/core/index.html#spark.api.java.package) package, and includes +a [`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) for +initializing Spark and [`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes, +which support the same methods as their Scala counterparts but take Java functions and return +Java data and collection types. The main differences have to do with passing functions to RDD +operations (e.g. map) and handling RDDs of different types, as discussed next. + +# Key Differences in the Java API + There are a few key differences between the Java and Scala APIs: * Java does not support anonymous or first-class functions, so functions must @@ -27,21 +27,25 @@ There are a few key differences between the Java and Scala APIs: [`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc. classes. * To maintain type safety, the Java API defines specialized Function and RDD - classes for key-value pairs and doubles. -* RDD methods like `collect` and `countByKey` return Java collections types, + classes for key-value pairs and doubles. For example, + [`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD) + stores key-value pairs. +* RDD methods like `collect()` and `countByKey()` return Java collections types, such as `java.util.List` and `java.util.Map`. - +* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented + by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)` ## RDD Classes -Spark defines additional operations on RDDs of doubles and key-value pairs, such -as `stdev` and `join`. + +Spark defines additional operations on RDDs of key-value pairs and doubles, such +as `reduceByKey`, `join`, and `stdev`. In the Scala API, these methods are automatically added using Scala's [implicit conversions](http://www.scala-lang.org/node/130) mechanism. -In the Java API, the extra methods are defined in -[`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) and +In the Java API, the extra methods are defined in the [`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD) +and [`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) classes. RDD methods like `map` are overloaded by specialized `PairFunction` and `DoubleFunction` classes, allowing them to return RDDs of the appropriate types. Common methods like `filter` and `sample` are implemented by @@ -57,22 +61,25 @@ class has a single abstract method, `call()`, that must be implemented. <table class="table"> <tr><th>Class</th><th>Function Type</th></tr> -<tr><td>Function<T, R></td><td>T -> R </td></tr> -<tr><td>DoubleFunction<T></td><td>T -> Double </td></tr> -<tr><td>PairFunction<T, K, V></td><td>T -> Tuple2<K, V> </td></tr> +<tr><td>Function<T, R></td><td>T => R </td></tr> +<tr><td>DoubleFunction<T></td><td>T => Double </td></tr> +<tr><td>PairFunction<T, K, V></td><td>T => Tuple2<K, V> </td></tr> -<tr><td>FlatMapFunction<T, R></td><td>T -> Iterable<R> </td></tr> -<tr><td>DoubleFlatMapFunction<T></td><td>T -> Iterable<Double> </td></tr> -<tr><td>PairFlatMapFunction<T, K, V></td><td>T -> Iterable<Tuple2<K, V>> </td></tr> +<tr><td>FlatMapFunction<T, R></td><td>T => Iterable<R> </td></tr> +<tr><td>DoubleFlatMapFunction<T></td><td>T => Iterable<Double> </td></tr> +<tr><td>PairFlatMapFunction<T, K, V></td><td>T => Iterable<Tuple2<K, V>> </td></tr> -<tr><td>Function2<T1, T2, R></td><td>T1, T2 -> R (function of two arguments)</td></tr> +<tr><td>Function2<T1, T2, R></td><td>T1, T2 => R (function of two arguments)</td></tr> </table> + # Other Features + The Java API supports other Spark features, including [accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators), -[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast_variables), and -[caching]({{HOME_PATH}}scala-programming-guide.html#caching). +[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast-variables), and +[caching]({{HOME_PATH}}scala-programming-guide.html#rdd-persistence). + # Example @@ -130,8 +137,6 @@ JavaPairRDD<String, Integer> ones = words.map( Note that `map` was passed a `PairFunction<String, String, Integer>` and returned a `JavaPairRDD<String, Integer>`. - - To finish the word count program, we will use `reduceByKey` to count the occurrences of each word: @@ -161,12 +166,23 @@ JavaPairRDD<String, Integer> counts = lines.flatMap( ... ); {% endhighlight %} + There is no performance difference between these approaches; the choice is -a matter of style. +just a matter of style. + +# Javadoc + +We currently provide documentation for the Java API as Scaladoc, in the +[`spark.api.java` package]({{HOME_PATH}}api/core/index.html#spark.api.java.package), because +some of the classes are implemented in Scala. The main downside is that the types and function +definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of +`T reduce(Function2<T, T> func)`). +We hope to generate documentation with Java-style syntax in the future. + +# Where to Go from Here -# Where to go from here -Spark includes several sample jobs using the Java API in +Spark includes several sample programs using the Java API in `examples/src/main/java`. You can run them by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.JavaWordCount`. Each example program prints usage help when run diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 1936c1969d81ce4938fd7289278c91ba3bceeaed..9a97736b6b9101641ac82de8ac347ef030086fff 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -205,6 +205,10 @@ The following tables list the transformations and actions currently supported (s <td> <b>saveAsSequenceFile</b>(<i>path</i>) </td> <td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td> </tr> +<tr> + <td> <b>countByKey</b>() </td> + <td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td> +</tr> <tr> <td> <b>foreach</b>(<i>func</i>) </td> <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td> @@ -273,6 +277,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing As you can see, Spark supports a variety of storage levels that give different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one: + * If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY_DESER`), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. * If not, try using `MEMORY_ONLY` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects @@ -329,4 +334,4 @@ res2: Int = 10 You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website. -In addition, Spark includes several sample jobs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments. +In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.