diff --git a/README.md b/README.md index 2c08a4ac638c351b0feeda7a0240e21c882e784f..b91e4cf86713b6d207d5a7e16895d7d2cd3b2b83 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,9 @@ This README file only contains basic setup instructions. ## Building Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT), -which can be obtained [here](http://www.scala-sbt.org). To build Spark and its example programs, run: +which can be obtained [here](http://www.scala-sbt.org). If SBT is installed we +will use the system version of sbt otherwise we will attempt to download it +automatically. To build Spark and its example programs, run: ./sbt/sbt assembly @@ -55,22 +57,22 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions without YARN, use: # Apache Hadoop 1.2.1 - $ SPARK_HADOOP_VERSION=1.2.1 sbt assembly + $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly # Cloudera CDH 4.2.0 with MapReduce v1 - $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt assembly + $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, also set `SPARK_YARN=true`: # Apache Hadoop 2.0.5-alpha - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly + $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly # Cloudera CDH 4.2.0 with MapReduce v2 - $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt assembly + $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly # Apache Hadoop 2.2.X and newer - $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt assembly + $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly When developing a Spark application, specify the Hadoop version by adding the "hadoop-client" artifact to your project's dependencies. For example, if you're diff --git a/docs/README.md b/docs/README.md index e3d6c9a5bc211d567dbdad358dd7a94fe7c4cfda..dfcf7535538f06c99ec145d03a18a8a75fda9c8b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -27,10 +27,10 @@ To mark a block of code in your markdown to be syntax highlighted by jekyll duri ## API Docs (Scaladoc and Epydoc) -You can build just the Spark scaladoc by running `sbt doc` from the SPARK_PROJECT_ROOT directory. +You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory. Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. -When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). +When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`. diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb index ef9912c8082593cc926770e1ce1b4a3ebbb5e056..431de909cbf4b0807ac772cc69a6b62c0330bf8a 100644 --- a/docs/_plugins/copy_api_dirs.rb +++ b/docs/_plugins/copy_api_dirs.rb @@ -26,8 +26,8 @@ if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1') curr_dir = pwd cd("..") - puts "Running sbt doc from " + pwd + "; this may take a few minutes..." - puts `sbt doc` + puts "Running sbt/sbt doc from " + pwd + "; this may take a few minutes..." + puts `sbt/sbt doc` puts "Moving back into docs dir." cd("docs") diff --git a/docs/api.md b/docs/api.md index 11e2c15324ef06e28ce87e124ff1d1019a442707..e86d07770a80be61dd0831606421e3ee7d73d284 100644 --- a/docs/api.md +++ b/docs/api.md @@ -3,7 +3,7 @@ layout: global title: Spark API documentation (Scaladoc) --- -Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt doc` from the Spark project home directory. +Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory. - [Spark](api/core/index.html) - [Spark Examples](api/examples/index.html) diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md index 141d475ba6610a875390e88de2a0854875de8490..de6a2b0a43bd5ee4de7dcd1c22a75991c50f2581 100644 --- a/docs/hadoop-third-party-distributions.md +++ b/docs/hadoop-third-party-distributions.md @@ -12,7 +12,7 @@ with these distributions: When compiling Spark, you'll need to [set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions): - SPARK_HADOOP_VERSION=1.0.4 sbt assembly + SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that some Hadoop releases are binary compatible across client versions. This means the pre-built Spark diff --git a/docs/index.md b/docs/index.md index bf8d1c3375e8526e1f1df14e22cf35918346eff3..86d574daaab4a8fc530f5a7f80a179133c4553a6 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,7 +17,7 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you n Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run - sbt assembly + sbt/sbt assembly For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you will need to use this same version of Scala in your own program -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/). @@ -56,12 +56,12 @@ Hadoop, you must build Spark against the same version that your cluster uses. By default, Spark links to Hadoop 1.0.4. You can change this by setting the `SPARK_HADOOP_VERSION` variable when compiling: - SPARK_HADOOP_VERSION=2.2.0 sbt assembly + SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly In addition, if you wish to run Spark on [YARN](running-on-yarn.html), set `SPARK_YARN` to `true`: - SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly + SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`. diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index 5d48cb676a1422480811a76b7079b52745e99a70..dc187b3efec9b7b7b8f6075275a4faf9da40b6fe 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -69,7 +69,7 @@ The script automatically adds the `bin/pyspark` package to the `PYTHONPATH`. The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options: {% highlight bash %} -$ sbt assembly +$ sbt/sbt assembly $ ./bin/pyspark {% endhighlight %} diff --git a/docs/quick-start.md b/docs/quick-start.md index 9b9261cfff8eabdf9083449cdae08cd821663f1f..153081bdaa2863d732956e28dc6e9ad588769f27 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -12,7 +12,7 @@ See the [programming guide](scala-programming-guide.html) for a more complete re To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run: {% highlight bash %} -$ sbt assembly +$ sbt/sbt assembly {% endhighlight %} # Interactive Analysis with the Spark Shell @@ -146,7 +146,7 @@ If you also wish to read data from Hadoop's HDFS, you will also need to add a de libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>" {% endhighlight %} -Finally, for sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the application's code, then use `sbt run` to execute our program. +Finally, for sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the application's code, then use `sbt/sbt run` to execute our program. {% highlight bash %} $ find . @@ -157,8 +157,8 @@ $ find . ./src/main/scala ./src/main/scala/SimpleApp.scala -$ sbt package -$ sbt run +$ sbt/sbt package +$ sbt/sbt run ... Lines with a: 46, Lines with b: 23 {% endhighlight %} diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index a35e003cdc1ee5130f0e472a520a98499f40b716..717071d72c9b9ef93eb567a45cda24c82fd22bd1 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -12,7 +12,7 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0. We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster. This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows: - SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly + SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly The assembled JAR will be something like this: `./assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`. @@ -25,7 +25,7 @@ The build process now also supports new YARN versions (2.2.x). See below. - The assembled jar can be installed into HDFS or used locally. - Your application code must be packaged into a separate JAR file. -If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. +If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. # Configuration @@ -72,7 +72,7 @@ The command to launch the YARN Client is as follows: For example: # Build the Spark assembly JAR and the Spark examples JAR - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly + $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly # Configure logging $ cp conf/log4j.properties.template conf/log4j.properties diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 3d0e8923d54279f9e319d55211d4f379c67ed971..c1ef46a1cded743d59f7d1e333d3936e001bd3df 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -31,7 +31,7 @@ In addition, if you wish to access an HDFS cluster, you need to add a dependency artifactId = hadoop-client version = <your-hdfs-version> -For other build systems, you can run `sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions). +For other build systems, you can run `sbt/sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions). Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines: diff --git a/make-distribution.sh b/make-distribution.sh index 6c466c8a06a6da151a1ddd07265ca086366c852c..61e6654dcb70b529df9336d6fde846dcb8bf9742 100755 --- a/make-distribution.sh +++ b/make-distribution.sh @@ -44,13 +44,16 @@ DISTDIR="$FWDIR/dist" # Get version from SBT export TERM=dumb # Prevents color codes in SBT output -if ! test `which sbt` ;then +VERSIONSTRING=$FWDIR/sbt/sbt "show version" + +if [ $? == -1 ] ;then echo -e "You need sbt installed and available on your path." echo -e "Download sbt from http://www.scala-sbt.org/" exit -1; fi -VERSION=$(sbt "show version" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/') +VERSION=$(echo "${VERSIONSTRING}" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/') +echo "Version is ${VERSION}" # Initialize defaults SPARK_HADOOP_VERSION=1.0.4 diff --git a/sbt/sbt b/sbt/sbt index 6d2caca120ca93386acbafe5650c35ca94b124ed..09cc5a0b4ac48891eb600aea95dec844a68ba8f7 100755 --- a/sbt/sbt +++ b/sbt/sbt @@ -27,22 +27,17 @@ else wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O ${JAR} else printf "You do not have curl or wget installed, please install sbt manually from http://www.scala-sbt.org/\n" - exit + exit -1 fi fi if [ ! -f ${JAR} ]; then # We failed to download - printf "Our attempt to download sbt locally to {$JAR} failed. Please install sbt manually from http://www.scala-sbt.org/\n" - exit + printf "Our attempt to download sbt locally to ${JAR} failed. Please install sbt manually from http://www.scala-sbt.org/\n" + exit -1 fi printf "Launching sbt from .sbtlib\n" java \ - -Duser.timezone=UTC \ - -Djava.awt.headless=true \ - -Dfile.encoding=UTF-8 \ - -XX:MaxPermSize=256m \ - -Xmx1g \ - -noverify \ + -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ "$@" fi