-
CodingCat authored
https://spark-project.atlassian.net/browse/SPARK-1105 fix site scala version error Author: CodingCat <zhunansjtu@gmail.com> Closes #618 from CodingCat/doc_version and squashes the following commits: 39bb8aa [CodingCat] more fixes 65bedb0 [CodingCat] fix site scala version error in doc
CodingCat authoredhttps://spark-project.atlassian.net/browse/SPARK-1105 fix site scala version error Author: CodingCat <zhunansjtu@gmail.com> Closes #618 from CodingCat/doc_version and squashes the following commits: 39bb8aa [CodingCat] more fixes 65bedb0 [CodingCat] fix site scala version error in doc
layout: global
title: Spark Programming Guide
- This will become a table of contents (this text will be scraped). {:toc}
Overview
At a high level, every Spark application consists of a driver program that runs the user's main
function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only "added" to, such as counters and sums.
This guide shows each of these features and walks through some samples. It assumes some familiarity with Scala, especially with the syntax for closures. Note that you can also run Spark interactively using the bin/spark-shell
script. We highly recommend doing that to follow along!
Linking with Spark
Spark {{site.SPARK_VERSION}} uses Scala {{site.SCALA_BINARY_VERSION}}. If you write applications in Scala, you will need to use a compatible Scala version (e.g. {{site.SCALA_BINARY_VERSION}}.X) -- newer major versions may not work.
To write a Spark application, you need to add a dependency on Spark. If you use SBT or Maven, Spark is available through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_{{site.SCALA_BINARY_VERSION}}
version = {{site.SPARK_VERSION}}
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client
for your version of HDFS:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
For other build systems, you can run sbt/sbt assembly
to pack Spark and its dependencies into one JAR (assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar
), then add this to your CLASSPATH. Set the HDFS version as described here.
Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines:
{% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ {% endhighlight %}
Initializing Spark
The first thing a Spark program must do is to create a SparkContext
object, which tells Spark how to access a cluster.
This is done through the following constructor:
{% highlight scala %} new SparkContext(master, appName, [sparkHome], [jars]) {% endhighlight %}
or through new SparkContext(conf)
, which takes a SparkConf
object for more advanced configuration.
The master
parameter is a string specifying a Spark or Mesos cluster URL to connect to, or a special "local" string to run in local mode, as described below. appName
is a name for your application, which will be shown in the cluster web UI. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.