README.md 3.72 KB
Newer Older
Matei Zaharia's avatar
Matei Zaharia committed
1
# Apache Spark
2

3
Spark is a fast and general cluster computing system for Big Data. It provides
4
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
5
supports general computation graphs for data analysis. It also supports a
6
7
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
8
and Spark Streaming for stream processing.
9
10

<http://spark.apache.org/>
11
12
13
14
15


## Online Documentation

You can find the latest Spark documentation, including a programming
16
guide, on the [project web page](http://spark.apache.org/documentation.html).
17
This README file only contains basic setup instructions.
18

Reynold Xin's avatar
Reynold Xin committed
19
## Building Spark
20

21
22
Spark is built using [Apache Maven](http://maven.apache.org/).
To build Spark and its example programs, run:
23

24
    build/mvn -DskipTests clean package
25

26
(You do not need to do this if you downloaded a pre-built package.)
27
28

You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).
29
More detailed documentation is available from the project site, at
30
["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
31

32
For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](http://spark.apache.org/developer-tools.html).
33

Reynold Xin's avatar
Reynold Xin committed
34
35
36
## Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:
37

38
    ./bin/spark-shell
39

Reynold Xin's avatar
Reynold Xin committed
40
41
42
43
44
45
46
47
48
Try the following command, which should return 1000:

    scala> sc.parallelize(1 to 1000).count()

## Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

    ./bin/pyspark
49

Reynold Xin's avatar
Reynold Xin committed
50
51
52
53
54
And run the following command, which should also return 1000:

    >>> sc.parallelize(range(1000)).count()

## Example Programs
55

Matei Zaharia's avatar
Matei Zaharia committed
56
Spark also comes with several sample programs in the `examples` directory.
57
To run one of them, use `./bin/run-example <class> [params]`. For example:
58

59
    ./bin/run-example SparkPi
60

61
will run the Pi example locally.
62

63
You can set the MASTER environment variable when running examples to submit
64
examples to a cluster. This can be a mesos:// or spark:// URL,
65
"yarn" to run on YARN, and "local" to run
66
locally with one thread, or "local[N]" to run locally with N threads. You
67
68
can also use an abbreviated class name if the class is in the `examples`
package. For instance:
69

70
71
72
    MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.
73

Reynold Xin's avatar
Reynold Xin committed
74
## Running Tests
75

Reynold Xin's avatar
Reynold Xin committed
76
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
77
can be run using:
78

79
    ./dev/run-tests
Reynold Xin's avatar
Reynold Xin committed
80

81
Please see the guidance on how to
82
[run tests for a module, or individual tests](http://spark.apache.org/developer-tools.html#individual-tests).
83

Matei Zaharia's avatar
Matei Zaharia committed
84
## A Note About Hadoop Versions
Matei Zaharia's avatar
Matei Zaharia committed
85
86

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
Jey Kottalam's avatar
Jey Kottalam committed
87
storage systems. Because the protocols have changed in different versions of
Matei Zaharia's avatar
Matei Zaharia committed
88
Hadoop, you must build Spark against the same version that your cluster runs.
Jey Kottalam's avatar
Jey Kottalam committed
89

90
Please refer to the build documentation at
91
["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
92
for detailed guidance on building for a particular distribution of Hadoop, including
93
building for particular Hive and Hive Thriftserver distributions.
94

95
96
## Configuration

97
Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html)
Matei Zaharia's avatar
Matei Zaharia committed
98
in the online documentation for an overview on how to configure Spark.
99

100
## Contributing
101

102
103
Please review the [Contribution to Spark guide](http://spark.apache.org/contributing.html)
for information on how to get started contributing to the project.