-
Matei Zaharia authoredMatei Zaharia authored
layout: global
title: Java Programming Guide
The Spark Java API exposes all the Spark features available in the Scala version to Java. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don't know Scala. This guide will show how to use the Spark features described there in Java.
The Spark Java API is defined in the
org.apache.spark.api.java
package, and includes
a JavaSparkContext
for
initializing Spark and JavaRDD
classes,
which support the same methods as their Scala counterparts but take Java functions and return
Java data and collection types. The main differences have to do with passing functions to RDD
operations (e.g. map) and handling RDDs of different types, as discussed next.
Key Differences in the Java API
There are a few key differences between the Java and Scala APIs:
- Java does not support anonymous or first-class functions, so functions must
be implemented by extending the
org.apache.spark.api.java.function.Function
,Function2
, etc. classes. - To maintain type safety, the Java API defines specialized Function and RDD
classes for key-value pairs and doubles. For example,
JavaPairRDD
stores key-value pairs. - RDD methods like
collect()
andcountByKey()
return Java collections types, such asjava.util.List
andjava.util.Map
. - Key-value pairs, which are simply written as
(key, value)
in Scala, are represented by thescala.Tuple2
class, and need to be created usingnew Tuple2<K, V>(key, value)
.
RDD Classes
Spark defines additional operations on RDDs of key-value pairs and doubles, such
as reduceByKey
, join
, and stdev
.
In the Scala API, these methods are automatically added using Scala's implicit conversions mechanism.
In the Java API, the extra methods are defined in the
JavaPairRDD
and JavaDoubleRDD
classes. RDD methods like map
are overloaded by specialized PairFunction
and DoubleFunction
classes, allowing them to return RDDs of the appropriate
types. Common methods like filter
and sample
are implemented by
each specialized RDD class, so filtering a PairRDD
returns a new PairRDD
,
etc (this acheives the "same-result-type" principle used by the Scala collections
framework).