Skip to content
Snippets Groups Projects
Commit c5754bb9 authored by Matei Zaharia's avatar Matei Zaharia
Browse files

Fixes to Java guide

parent f1246cc7
No related branches found
No related tags found
No related merge requests found
......@@ -48,6 +48,10 @@ code {
color: #902000;
}
a code {
color: #0088cc;
}
pre {
font-family: "Menlo", "Lucida Console", monospace;
}
......
......@@ -3,22 +3,22 @@ layout: global
title: Java Programming Guide
---
The Spark Java API
([spark.api.java]({{HOME_PATH}}api/core/index.html#spark.api.java.package)) defines
[`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) and
[`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes,
which support
the same methods as their Scala counterparts but take Java functions and return
Java data and collection types.
Because Java API is similar to the Scala API, this programming guide only
covers Java-specific features;
the [Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html)
provides a more general introduction to Spark concepts and should be read
first.
# Key differences in the Java API
The Spark Java API exposes all the Spark features available in the Scala version to Java.
To learn the basics of Spark, we recommend reading through the
[Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) first; it should be
easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Java.
The Spark Java API is defined in the
[`spark.api.java`]({{HOME_PATH}}api/core/index.html#spark.api.java.package) package, and includes
a [`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) for
initializing Spark and [`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes,
which support the same methods as their Scala counterparts but take Java functions and return
Java data and collection types. The main differences have to do with passing functions to RDD
operations (e.g. map) and handling RDDs of different types, as discussed next.
# Key Differences in the Java API
There are a few key differences between the Java and Scala APIs:
* Java does not support anonymous or first-class functions, so functions must
......@@ -27,21 +27,25 @@ There are a few key differences between the Java and Scala APIs:
[`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc.
classes.
* To maintain type safety, the Java API defines specialized Function and RDD
classes for key-value pairs and doubles.
* RDD methods like `collect` and `countByKey` return Java collections types,
classes for key-value pairs and doubles. For example,
[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
stores key-value pairs.
* RDD methods like `collect()` and `countByKey()` return Java collections types,
such as `java.util.List` and `java.util.Map`.
* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented
by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)`
## RDD Classes
Spark defines additional operations on RDDs of doubles and key-value pairs, such
as `stdev` and `join`.
Spark defines additional operations on RDDs of key-value pairs and doubles, such
as `reduceByKey`, `join`, and `stdev`.
In the Scala API, these methods are automatically added using Scala's
[implicit conversions](http://www.scala-lang.org/node/130) mechanism.
In the Java API, the extra methods are defined in
[`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) and
In the Java API, the extra methods are defined in the
[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
and [`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD)
classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
......@@ -57,22 +61,25 @@ class has a single abstract method, `call()`, that must be implemented.
<table class="table">
<tr><th>Class</th><th>Function Type</th></tr>
<tr><td>Function&lt;T, R&gt;</td><td>T -&gt; R </td></tr>
<tr><td>DoubleFunction&lt;T&gt;</td><td>T -&gt; Double </td></tr>
<tr><td>PairFunction&lt;T, K, V&gt;</td><td>T -&gt; Tuple2&lt;K, V&gt; </td></tr>
<tr><td>Function&lt;T, R&gt;</td><td>T =&gt; R </td></tr>
<tr><td>DoubleFunction&lt;T&gt;</td><td>T =&gt; Double </td></tr>
<tr><td>PairFunction&lt;T, K, V&gt;</td><td>T =&gt; Tuple2&lt;K, V&gt; </td></tr>
<tr><td>FlatMapFunction&lt;T, R&gt;</td><td>T -&gt; Iterable&lt;R&gt; </td></tr>
<tr><td>DoubleFlatMapFunction&lt;T&gt;</td><td>T -&gt; Iterable&lt;Double&gt; </td></tr>
<tr><td>PairFlatMapFunction&lt;T, K, V&gt;</td><td>T -&gt; Iterable&lt;Tuple2&lt;K, V&gt;&gt; </td></tr>
<tr><td>FlatMapFunction&lt;T, R&gt;</td><td>T =&gt; Iterable&lt;R&gt; </td></tr>
<tr><td>DoubleFlatMapFunction&lt;T&gt;</td><td>T =&gt; Iterable&lt;Double&gt; </td></tr>
<tr><td>PairFlatMapFunction&lt;T, K, V&gt;</td><td>T =&gt; Iterable&lt;Tuple2&lt;K, V&gt;&gt; </td></tr>
<tr><td>Function2&lt;T1, T2, R&gt;</td><td>T1, T2 -&gt; R (function of two arguments)</td></tr>
<tr><td>Function2&lt;T1, T2, R&gt;</td><td>T1, T2 =&gt; R (function of two arguments)</td></tr>
</table>
# Other Features
The Java API supports other Spark features, including
[accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators),
[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast_variables), and
[caching]({{HOME_PATH}}scala-programming-guide.html#caching).
[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast-variables), and
[caching]({{HOME_PATH}}scala-programming-guide.html#rdd-persistence).
# Example
......@@ -130,8 +137,6 @@ JavaPairRDD<String, Integer> ones = words.map(
Note that `map` was passed a `PairFunction<String, String, Integer>` and
returned a `JavaPairRDD<String, Integer>`.
To finish the word count program, we will use `reduceByKey` to count the
occurrences of each word:
......@@ -161,12 +166,23 @@ JavaPairRDD<String, Integer> counts = lines.flatMap(
...
);
{% endhighlight %}
There is no performance difference between these approaches; the choice is
a matter of style.
just a matter of style.
# Javadoc
We currently provide documentation for the Java API as Scaladoc, in the
[`spark.api.java` package]({{HOME_PATH}}api/core/index.html#spark.api.java.package), because
some of the classes are implemented in Scala. The main downside is that the types and function
definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of
`T reduce(Function2<T, T> func)`).
We hope to generate documentation with Java-style syntax in the future.
# Where to Go from Here
# Where to go from here
Spark includes several sample jobs using the Java API in
Spark includes several sample programs using the Java API in
`examples/src/main/java`. You can run them by passing the class name to the
`run` script included in Spark -- for example, `./run
spark.examples.JavaWordCount`. Each example program prints usage help when run
......
......@@ -205,6 +205,10 @@ The following tables list the transformations and actions currently supported (s
<td> <b>saveAsSequenceFile</b>(<i>path</i>) </td>
<td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
</tr>
<tr>
<td> <b>countByKey</b>() </td>
<td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td>
</tr>
<tr>
<td> <b>foreach</b>(<i>func</i>) </td>
<td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td>
......@@ -273,6 +277,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
As you can see, Spark supports a variety of storage levels that give different tradeoffs between memory usage
and CPU efficiency. We recommend going through the following process to select one:
* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY_DESER`), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
* If not, try using `MEMORY_ONLY` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects
......@@ -329,4 +334,4 @@ res2: Int = 10
You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website.
In addition, Spark includes several sample jobs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment