diff --git a/README.md b/README.md index 1550a8b5512d976b8a22cba2fe4293211886ae79..22e7ab824577ae8d5f0b445a9d1c57670c4055bb 100644 --- a/README.md +++ b/README.md @@ -13,9 +13,9 @@ This README file only contains basic setup instructions. ## Building Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT), -which is packaged with it. To build Spark and its example programs, run: +which can be obtained from [here](http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html). To build Spark and its example programs, run: - sbt/sbt assembly + sbt assembly Once you've built Spark, the easiest way to start using it is the shell: @@ -36,6 +36,22 @@ All of the Spark samples take a `<master>` parameter that is the cluster URL to connect to. This can be a mesos:// or spark:// URL, or "local" to run locally with one thread, or "local[N]" to run locally with N threads. +## Running tests + +### With sbt. (you need sbt installed) +Once you have built spark with `sbt assembly` mentioned in [Building](#Building) section. Test suits can be run as follows on *nix based systems using sbt. + +`SPARK_HOME=$(pwd) SPARK_TESTING=1 sbt test` + +TODO: figure out instructions for windows. + +### With maven. + +1. Build assembly by +`mvn package -DskipTests` + +2. Run tests +`mvn test` ## A Note About Hadoop Versions @@ -49,22 +65,22 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions without YARN, use: # Apache Hadoop 1.2.1 - $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly + $ SPARK_HADOOP_VERSION=1.2.1 sbt assembly # Cloudera CDH 4.2.0 with MapReduce v1 - $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly + $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt assembly For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, also set `SPARK_YARN=true`: # Apache Hadoop 2.0.5-alpha - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly + $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly # Cloudera CDH 4.2.0 with MapReduce v2 - $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly + $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt assembly # Apache Hadoop 2.2.X and newer - $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly + $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt assembly When developing a Spark application, specify the Hadoop version by adding the "hadoop-client" artifact to your project's dependencies. For example, if you're diff --git a/docs/README.md b/docs/README.md index dfcf7535538f06c99ec145d03a18a8a75fda9c8b..e3d6c9a5bc211d567dbdad358dd7a94fe7c4cfda 100644 --- a/docs/README.md +++ b/docs/README.md @@ -27,10 +27,10 @@ To mark a block of code in your markdown to be syntax highlighted by jekyll duri ## API Docs (Scaladoc and Epydoc) -You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory. +You can build just the Spark scaladoc by running `sbt doc` from the SPARK_PROJECT_ROOT directory. Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. -When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). +When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`. diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb index 431de909cbf4b0807ac772cc69a6b62c0330bf8a..ef9912c8082593cc926770e1ce1b4a3ebbb5e056 100644 --- a/docs/_plugins/copy_api_dirs.rb +++ b/docs/_plugins/copy_api_dirs.rb @@ -26,8 +26,8 @@ if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1') curr_dir = pwd cd("..") - puts "Running sbt/sbt doc from " + pwd + "; this may take a few minutes..." - puts `sbt/sbt doc` + puts "Running sbt doc from " + pwd + "; this may take a few minutes..." + puts `sbt doc` puts "Moving back into docs dir." cd("docs") diff --git a/docs/api.md b/docs/api.md index e86d07770a80be61dd0831606421e3ee7d73d284..11e2c15324ef06e28ce87e124ff1d1019a442707 100644 --- a/docs/api.md +++ b/docs/api.md @@ -3,7 +3,7 @@ layout: global title: Spark API documentation (Scaladoc) --- -Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory. +Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt doc` from the Spark project home directory. - [Spark](api/core/index.html) - [Spark Examples](api/examples/index.html) diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md index de6a2b0a43bd5ee4de7dcd1c22a75991c50f2581..141d475ba6610a875390e88de2a0854875de8490 100644 --- a/docs/hadoop-third-party-distributions.md +++ b/docs/hadoop-third-party-distributions.md @@ -12,7 +12,7 @@ with these distributions: When compiling Spark, you'll need to [set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions): - SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly + SPARK_HADOOP_VERSION=1.0.4 sbt assembly The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that some Hadoop releases are binary compatible across client versions. This means the pre-built Spark diff --git a/docs/index.md b/docs/index.md index d3ac696d1e818a0dee8587756a6f8974276ac289..5278e33e1c054459568d3b57b5c2a92dc6c4505b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,7 +17,7 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you n Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run - sbt/sbt assembly + sbt assembly For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you will need to use this same version of Scala in your own program -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/). @@ -56,12 +56,12 @@ Hadoop, you must build Spark against the same version that your cluster uses. By default, Spark links to Hadoop 1.0.4. You can change this by setting the `SPARK_HADOOP_VERSION` variable when compiling: - SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly + SPARK_HADOOP_VERSION=2.2.0 sbt assembly In addition, if you wish to run Spark on [YARN](running-on-yarn.html), set `SPARK_YARN` to `true`: - SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly + SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`. diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index 55e39b1de17a06c4c4a9e025769c66b0977cffb8..a33977ed82859ae33c7f9c6af276871bc99f1d2b 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -69,7 +69,7 @@ The script automatically adds the `pyspark` package to the `PYTHONPATH`. The `pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options: {% highlight bash %} -$ sbt/sbt assembly +$ sbt assembly $ ./pyspark {% endhighlight %} diff --git a/docs/quick-start.md b/docs/quick-start.md index 8f782db5b822b5a60a22c94be84b01cbb5bf9eeb..5c55def3985c6c351e9f7fc2f1aa5d347fc4445a 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -12,7 +12,7 @@ See the [programming guide](scala-programming-guide.html) for a more complete re To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run: {% highlight bash %} -$ sbt/sbt assembly +$ sbt assembly {% endhighlight %} # Interactive Analysis with the Spark Shell diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index aa75ca43241fb5fa831895a8c1871d36bcebbccc..13d5fd3685bfbaec238155320575f87ab0822469 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -12,7 +12,7 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0. We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster. This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows: - SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true ./sbt/sbt assembly + SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true ./sbt assembly The assembled JAR will be something like this: `./assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`. @@ -25,7 +25,7 @@ The build process now also supports new YARN versions (2.2.x). See below. - The assembled jar can be installed into HDFS or used locally. - Your application code must be packaged into a separate JAR file. -If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. +If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. # Configuration @@ -72,7 +72,7 @@ The command to launch the YARN Client is as follows: For example: # Build the Spark assembly JAR and the Spark examples JAR - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true ./sbt/sbt assembly + $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true ./sbt assembly # Configure logging $ cp conf/log4j.properties.template conf/log4j.properties diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 56d2a3a4a020282b935b9563dc43455367b6e6b8..3e7075c38203a58bd7d5518ad98e6880ab56c0df 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -31,7 +31,7 @@ In addition, if you wish to access an HDFS cluster, you need to add a dependency artifactId = hadoop-client version = <your-hdfs-version> -For other build systems, you can run `sbt/sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions). +For other build systems, you can run `sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions). Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines: diff --git a/make-distribution.sh b/make-distribution.sh index 32bbdb90a5bdad25237147e366cf02f1c6bb9613..a2c8e645971432f75251efa665c82ee7deb7c124 100755 --- a/make-distribution.sh +++ b/make-distribution.sh @@ -43,7 +43,13 @@ DISTDIR="$FWDIR/dist" # Get version from SBT export TERM=dumb # Prevents color codes in SBT output -VERSION=$($FWDIR/sbt/sbt "show version" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/') + +if ! test `which sbt` ;then + echo -e "You need sbt installed and available on path, please follow the instructions here: http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html" + exit -1; +fi + +VERSION=$(sbt "show version" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/') # Initialize defaults SPARK_HADOOP_VERSION=1.0.4 @@ -83,7 +89,9 @@ fi # Build fat JAR export SPARK_HADOOP_VERSION export SPARK_YARN -"$FWDIR/sbt/sbt" "assembly/assembly" +cd $FWDIR + +"sbt" "assembly/assembly" # Make directories rm -rf "$DISTDIR" diff --git a/pyspark b/pyspark index 12cc926ddafa588425f06f8c8da8bc9f64e0dc3d..1d003e2a008feca6c4044ffde8b688907e6ffff4 100755 --- a/pyspark +++ b/pyspark @@ -31,7 +31,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar >& /dev/null if [[ $? != 0 ]]; then echo "Failed to find Spark assembly in $FWDIR/assembly/target" >&2 - echo "You need to build Spark with sbt/sbt assembly before running this program" >&2 + echo "You need to build Spark with sbt assembly before running this program" >&2 exit 1 fi fi diff --git a/run-example b/run-example index a78192d31d384194f757ee213e5f916652d9ea80..fbd81fe6f331fb70da03cc67236f2b4c3b5a6a72 100755 --- a/run-example +++ b/run-example @@ -55,7 +55,7 @@ if [ -e "$EXAMPLES_DIR"/target/spark-examples*[0-9Tg].jar ]; then fi if [[ -z $SPARK_EXAMPLES_JAR ]]; then echo "Failed to find Spark examples assembly in $FWDIR/examples/target" >&2 - echo "You need to build Spark with sbt/sbt assembly before running this program" >&2 + echo "You need to build Spark with sbt assembly before running this program" >&2 exit 1 fi diff --git a/sbt/sbt b/sbt/sbt deleted file mode 100755 index 5942280585ba6e94fe0e4b9d4978412a472444ef..0000000000000000000000000000000000000000 --- a/sbt/sbt +++ /dev/null @@ -1,43 +0,0 @@ -#!/usr/bin/env bash - -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -cygwin=false -case "`uname`" in - CYGWIN*) cygwin=true;; -esac - -EXTRA_ARGS="-Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m" -if [ "$MESOS_HOME" != "" ]; then - EXTRA_ARGS="$EXTRA_ARGS -Djava.library.path=$MESOS_HOME/lib/java" -fi - -export SPARK_HOME=$(cd "$(dirname $0)/.." 2>&1 >/dev/null ; pwd) -export SPARK_TESTING=1 # To put test classes on classpath - -SBT_JAR="$SPARK_HOME"/sbt/sbt-launch-*.jar -if $cygwin; then - SBT_JAR=`cygpath -w $SBT_JAR` - export SPARK_HOME=`cygpath -w $SPARK_HOME` - EXTRA_ARGS="$EXTRA_ARGS -Djline.terminal=jline.UnixTerminal -Dsbt.cygwin=true" - stty -icanon min 1 -echo > /dev/null 2>&1 - java $EXTRA_ARGS $SBT_OPTS -jar $SBT_JAR "$@" - stty icanon echo > /dev/null 2>&1 -else - java $EXTRA_ARGS $SBT_OPTS -jar $SBT_JAR "$@" -fi \ No newline at end of file diff --git a/sbt/sbt-launch-0.11.3-2.jar b/sbt/sbt-launch-0.11.3-2.jar deleted file mode 100644 index 23e5c3f31149bbf2bddbf1ae8d1fd02aba7910ad..0000000000000000000000000000000000000000 Binary files a/sbt/sbt-launch-0.11.3-2.jar and /dev/null differ diff --git a/sbt/sbt.cmd b/sbt/sbt.cmd deleted file mode 100644 index 681fe00f9210818221e94afcc11db5256ae54ac3..0000000000000000000000000000000000000000 --- a/sbt/sbt.cmd +++ /dev/null @@ -1,25 +0,0 @@ -@echo off - -rem -rem Licensed to the Apache Software Foundation (ASF) under one or more -rem contributor license agreements. See the NOTICE file distributed with -rem this work for additional information regarding copyright ownership. -rem The ASF licenses this file to You under the Apache License, Version 2.0 -rem (the "License"); you may not use this file except in compliance with -rem the License. You may obtain a copy of the License at -rem -rem http://www.apache.org/licenses/LICENSE-2.0 -rem -rem Unless required by applicable law or agreed to in writing, software -rem distributed under the License is distributed on an "AS IS" BASIS, -rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -rem See the License for the specific language governing permissions and -rem limitations under the License. -rem - -set EXTRA_ARGS= -if not "%MESOS_HOME%x"=="x" set EXTRA_ARGS=-Djava.library.path=%MESOS_HOME%\lib\java - -set SPARK_HOME=%~dp0.. - -java -Xmx1200M -XX:MaxPermSize=200m -XX:ReservedCodeCacheSize=256m %EXTRA_ARGS% -jar %SPARK_HOME%\sbt\sbt-launch-0.11.3-2.jar "%*" diff --git a/spark-class b/spark-class index 1858ea62476d9216aa5cf31563962408983ae279..254ddee04ae7bf50349ed676eb806df3ee825fb5 100755 --- a/spark-class +++ b/spark-class @@ -104,7 +104,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then jars_list=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep "spark-assembly.*hadoop.*.jar") if [ "$num_jars" -eq "0" ]; then echo "Failed to find Spark assembly in $FWDIR/assembly/target/scala-$SCALA_VERSION/" >&2 - echo "You need to build Spark with 'sbt/sbt assembly' before running this program." >&2 + echo "You need to build Spark with 'sbt assembly' before running this program." >&2 exit 1 fi if [ "$num_jars" -gt "1" ]; then