From 9ddad0dcb47e3326151a53e270448b5135805ae5 Mon Sep 17 00:00:00 2001 From: Matei Zaharia <matei@eecs.berkeley.edu> Date: Sat, 31 Aug 2013 17:40:33 -0700 Subject: [PATCH] Fixes suggested by Patrick --- conf/spark-env.sh.template | 2 +- docs/hardware-provisioning.md | 1 - docs/index.md | 9 +++++---- docs/quick-start.md | 10 ++-------- 4 files changed, 8 insertions(+), 14 deletions(-) diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template index a367d59d64..d92d2e2ae3 100755 --- a/conf/spark-env.sh.template +++ b/conf/spark-env.sh.template @@ -4,7 +4,7 @@ # spark-env.sh and edit that to configure Spark for your site. # # The following variables can be set in this file: -# - SPARK_LOCAL_IP, to override the IP address binds to +# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node # - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos # - SPARK_JAVA_OPTS, to set node-specific JVM options for Spark. Note that # we recommend setting app-wide options in the application's driver program. diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md index d21e2a3d70..e5f054cb14 100644 --- a/docs/hardware-provisioning.md +++ b/docs/hardware-provisioning.md @@ -21,7 +21,6 @@ Hadoop and Spark on a common cluster manager like [Mesos](running-on-mesos.html) [Hadoop YARN](running-on-yarn.html). * If this is not possible, run Spark on different nodes in the same local-area network as HDFS. -If your cluster spans multiple racks, include some Spark nodes on each rack. * For low-latency data stores like HBase, it may be preferrable to run computing jobs on different nodes than the storage system to avoid interference. diff --git a/docs/index.md b/docs/index.md index bcd7dad6ae..0ea0e103e4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,12 +40,13 @@ Python interpreter (`./pyspark`). These are a great way to learn Spark. Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported storage systems. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster uses. -You can do this by setting the `SPARK_HADOOP_VERSION` variable when compiling: +By default, Spark links to Hadoop 1.0.4. You can change this by setting the +`SPARK_HADOOP_VERSION` variable when compiling: SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly -In addition, if you wish to run Spark on [YARN](running-on-yarn.md), you should also -set `SPARK_YARN`: +In addition, if you wish to run Spark on [YARN](running-on-yarn.md), set +`SPARK_YARN` to `true`: SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly @@ -94,7 +95,7 @@ set `SPARK_YARN`: exercises about Spark, Shark, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/agenda-2012), [slides](http://ampcamp.berkeley.edu/agenda-2012) and [exercises](http://ampcamp.berkeley.edu/exercises-2012) are available online for free. -* [Code Examples](http://spark.incubator.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/mesos/spark/tree/master/examples/src/main/scala/spark/examples) of Spark +* [Code Examples](http://spark.incubator.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/mesos/spark/tree/master/examples/src/main/scala/) of Spark * [Paper Describing Spark](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) * [Paper Describing Spark Streaming](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) diff --git a/docs/quick-start.md b/docs/quick-start.md index bac5d690a6..11d4370a1d 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -126,7 +126,7 @@ object SimpleJob { This job simply counts the number of lines containing 'a' and the number containing 'b' in the Spark README. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes. -This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` which explains that Spark is a dependency. This file also adds two repositories which host Spark dependencies: +This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` which explains that Spark is a dependency. This file also adds a repository that Spark depends on: {% highlight scala %} name := "Simple Project" @@ -137,9 +137,7 @@ scalaVersion := "{{site.SCALA_VERSION}}" libraryDependencies += "org.spark-project" %% "spark-core" % "{{site.SPARK_VERSION}}" -resolvers ++= Seq( - "Akka Repository" at "http://repo.akka.io/releases/", - "Spray Repository" at "http://repo.spray.cc/") +resolvers += "Akka Repository" at "http://repo.akka.io/releases/" {% endhighlight %} If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS: @@ -210,10 +208,6 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep <packaging>jar</packaging> <version>1.0</version> <repositories> - <repository> - <id>Spray.cc repository</id> - <url>http://repo.spray.cc</url> - </repository> <repository> <id>Akka repository</id> <url>http://repo.akka.io/releases</url> -- GitLab