Skip to content
Snippets Groups Projects
Commit 42935330 authored by Matei Zaharia's avatar Matei Zaharia
Browse files

Update docs about HDFS versions

parent f3a96484
No related branches found
No related tags found
No related merge requests found
......@@ -3,42 +3,37 @@ layout: global
title: Spark Overview
---
Apache Spark is a cluster computing engine that aims to make data analytics both easier and faster.
It provides rich, language-integrated APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html), and a powerful execution engine that supports general operator graphs.
Apache Spark is a cluster computing system that aims to make data analytics faster to run and faster to write.
It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html), and a general execution engine that supports rich operator graphs.
Spark can run on the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager ("standalone mode").
# Downloading
Get Spark from the [downloads page](http://spark.incubator.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.
Get Spark by visiting the [downloads page](http://spark.incubator.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.
# Building
Spark requires [Scala {{site.SCALA_VERSION}}](http://www.scala-lang.org/). You will need to have Scala's `bin` directory in your `PATH`,
or you will need to set the `SCALA_HOME` environment variable to point
to where you've installed Scala. Scala must also be accessible through one
of these methods on slave nodes on your cluster.
Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run
sbt/sbt assembly
Spark also supports building using Maven. If you would like to build using Maven, see the [instructions for building Spark with Maven](building-with-maven.html).
For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you will need to use this same version of Scala in your own program -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/).
# Testing the Build
Spark comes with a number of sample programs in the `examples` directory.
Spark comes with several sample programs in the `examples` directory.
To run one of the samples, use `./run-example <class> <params>` in the top-level Spark directory
(the `run` script sets up the appropriate paths and launches that program).
For example, `./run-example spark.examples.SparkPi` will run a sample program that estimates Pi. Each of the
examples prints usage help if no params are given.
(the `run-example` script sets up the appropriate paths and launches that program).
For example, `./run-example spark.examples.SparkPi` will run a sample program that estimates Pi. Each
example prints usage help if no params are given.
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing.
Finally, Spark can be used interactively from a modified version of the Scala interpreter that you can start through
`./spark-shell`. This is a great way to learn Spark.
Finally, Spark can be used interactively through modified versions of the Scala shell (`./spark-shell`) or
Python interpreter (`./pyspark`). These are a great way to learn Spark.
# A Note About Hadoop Versions
......@@ -50,7 +45,7 @@ You can do this by setting the `SPARK_HADOOP_VERSION` variable when compiling:
SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
In addition, if you wish to run Spark on [YARN](running-on-yarn.md), you should also
set `SPARK_YARN` to `true`:
set `SPARK_YARN`:
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
......
......@@ -142,7 +142,13 @@ resolvers ++= Seq(
"Spray Repository" at "http://repo.spray.cc/")
{% endhighlight %}
Of course, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the job's code, then use `sbt run` to execute our example job.
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
{% highlight scala %}
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"
{% endhighlight %}
Finally, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the job's code, then use `sbt run` to execute our example job.
{% highlight bash %}
$ find .
......@@ -223,6 +229,16 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep
</project>
{% endhighlight %}
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
{% highlight xml %}
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>...</version>
</dependency>
{% endhighlight %}
We lay out these files according to the canonical Maven directory structure:
{% highlight bash %}
$ find .
......@@ -281,3 +297,5 @@ Lines with a: 46, Lines with b: 23
{% endhighlight python %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
Also, this example links against the default version of HDFS that Spark builds with (1.0.4). You can run it against other HDFS versions by [building Spark with another HDFS version](index.html#a-note-about-hadoop-versions).
......@@ -17,15 +17,23 @@ This guide shows each of these features and walks through some samples. It assum
# Linking with Spark
To write a Spark application, you will need to add both Spark and its dependencies to your CLASSPATH. If you use sbt or Maven, Spark is available through Maven Central at:
Spark {{site.SPARK_VERSION}} uses Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you'll need to use this same version of Scala in your program -- newer major versions may not work.
To write a Spark application, you need to add a dependency on Spark. If you use SBT or Maven, Spark is available through Maven Central at:
groupId = org.spark-project
artifactId = spark-core_{{site.SCALA_VERSION}}
version = {{site.SPARK_VERSION}}
For other build systems or environments, you can run `sbt/sbt assembly` to build both Spark and its dependencies into one JAR (`core/target/spark-core-assembly-0.6.0.jar`), then add this to your CLASSPATH.
In addition, if you wish to access an HDFS cluster, you need to add a dependency on `hadoop-client` for your version of HDFS:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
For other build systems, you can run `sbt/sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions).
In addition, you'll need to import some Spark classes and implicit conversions. Add the following lines at the top of your program:
Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines:
{% highlight scala %}
import spark.SparkContext
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment