diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 5034111ecbc016cc0a6353e14bf6dad36b4a9992..238ad26de0212f427d59eef1fcebc0c7744f7274 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -51,7 +51,7 @@ <div class="navbar-inner"> <div class="container"> <div class="brand"><a href="index.html"> - <img src="img/spark-logo-77x50px-hd.png" /></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span> + <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span> </div> <ul class="nav"> <!--TODO(andyk): Add class="active" attribute to li some how.--> @@ -103,6 +103,7 @@ <li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li> <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li> <li><a href="job-scheduling.html">Job Scheduling</a></li> + <li class="divider"></li> <li><a href="building-with-maven.html">Building Spark with Maven</a></li> <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li> </ul> diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 9e781bbf1fa7c76bfbb3b6793d740a1186855e98..143f93171fde9db695e51999cc19629ec5e9cc7e 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -3,3 +3,68 @@ layout: global title: Cluster Mode Overview --- +This document gives a short overview of how Spark runs on clusters, to make it easier to understand +the components involved. + +# Components + +Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext +object in your main program (called the _driver program_). +Specifically, to run on a cluster, the SparkContext can connect to several types of _cluster managers_ +(either Spark's own standalone cluster manager or Mesos/YARN), which allocate resources across +applications. Once connected, Spark acquires *executors* on nodes in the cluster, which are +worker processes that run computations and store data for your application. +Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to +the executors. Finally, SparkContext sends *tasks* for the executors to run. + +<p style="text-align: center;"> + <img src="img/cluster-overview.png" title="Spark cluster components" alt="Spark cluster components" /> +</p> + +There are several useful things to note about this architecture: + +1. Each application gets its own executor processes, which stay up for the duration of the whole + application and run tasks in multiple threads. This has the benefit of isolating applications + from each other, on both the scheduling side (each driver schedules its own tasks) and executor + side (tasks from different applications run in different JVMs). However, it also means that + data cannot be shared across different Spark applications (instances of SparkContext) without + writing it to an external storage system. +2. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor + processes, and these communicate with each other, it is relatively easy to run it even on a + cluster manager that also supports other applications (e.g. Mesos/YARN). +3. Because the driver schedules tasks on the cluster, it should be run close to the worker + nodes, preferably on the same local area network. If you'd like to send requests to the + cluster remotely, it's better to open an RPC to the driver and have it submit operations + from nearby than to run a driver far away from the worker nodes. + +# Cluster Manager Types + +The system currently supports three cluster managers: + +* [Standalone](spark-standalone.html) -- a simple cluster manager included with Spark that makes it + easy to set up a cluster. +* [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce + and service applications. +* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.0. + +In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone +cluster on Amazon EC2. + +# Shipping Code to the Cluster + +The recommended way to ship your code to the cluster is to pass it through SparkContext's constructor, +which takes a list of JAR files (Java/Scala) or .egg and .zip libraries (Python) to disseminate to +worker nodes. You can also dynamically add new files to be sent to executors with `SparkContext.addJar` +and `addFile`. + +# Monitoring + +Each driver program has a web UI, typically on port 3030, that displays information about running +tasks, executors, and storage usage. Simply go to `http://<driver-node>:3030` in a web browser to +access this UI. The [monitoring guide](monitoring.html) also describes other monitoring options. + +# Job Scheduling + +Spark gives control over resource allocation both _across_ applications (at the level of the cluster +manager) and _within_ applications (if multiple computations are happening on the same SparkContext). +The [job scheduling overview](job-scheduling.html) describes this in more detail. diff --git a/docs/img/cluster-overview.png b/docs/img/cluster-overview.png new file mode 100644 index 0000000000000000000000000000000000000000..2a1cf02fcfe9215a5059766755dc52440ab5dd92 Binary files /dev/null and b/docs/img/cluster-overview.png differ diff --git a/docs/img/cluster-overview.pptx b/docs/img/cluster-overview.pptx new file mode 100644 index 0000000000000000000000000000000000000000..2a61db352dd712d207bca1ac4a2da17cc4ea7497 Binary files /dev/null and b/docs/img/cluster-overview.pptx differ diff --git a/docs/img/spark-logo-hd.png b/docs/img/spark-logo-hd.png new file mode 100644 index 0000000000000000000000000000000000000000..1381e3004dca9fd2a5fada59fafa9a30843a2475 Binary files /dev/null and b/docs/img/spark-logo-hd.png differ diff --git a/docs/index.md b/docs/index.md index 1814cb19c8d72e47976352ba4055d1a86318749b..bd386a8a8fdb6579fc7c7641c7a4a753f394e279 100644 --- a/docs/index.md +++ b/docs/index.md @@ -48,11 +48,6 @@ options for deployment: * [Apache Mesos](running-on-mesos.html) * [Hadoop YARN](running-on-yarn.html) -There is a script, `./make-distribution.sh`, which will create a binary distribution of Spark for deployment -to any machine with only the Java runtime as a necessary dependency. -Running the script creates a distribution directory in `dist/`, or the `-tgz` option to create a .tgz file. -Check the script for additional options. - # A Note About Hadoop Versions Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported diff --git a/docs/monitoring.md b/docs/monitoring.md index 0ec987107c7183f349c74d471e07dd5b0d00dacc..e9832e04669dcc3e79b593cf51585b78beede75c 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -3,19 +3,30 @@ layout: global title: Monitoring and Instrumentation --- -There are several ways to monitor the progress of Spark jobs. +There are several ways to monitor Spark applications. # Web Interfaces -When a SparkContext is initialized, it launches a web server (by default at port 3030) which -displays useful information. This includes a list of active and completed scheduler stages, -a summary of RDD blocks and partitions, and environmental information. If multiple SparkContexts -are running on the same host, they will bind to succesive ports beginning with 3030 (3031, 3032, -etc). -Spark's Standlone Mode scheduler also has its own -[web interface](spark-standalone.html#monitoring-and-logging). +Every SparkContext launches a web UI, by default on port 3030, that +displays useful information about the application. This includes: + +* A list of scheduler stages and tasks +* A summary of RDD sizes and memory usage +* Information about the running executors +* Environmental information. + +You can access this interface by simply opening `http://<driver-node>:3030` in a web browser. +If multiple SparkContexts are running on the same host, they will bind to succesive ports +beginning with 3030 (3031, 3032, etc). + +Spark's Standlone Mode cluster manager also has its own +[web UI](spark-standalone.html#monitoring-and-logging). + +Note that in both of these UIs, the tables are sortable by clicking their headers, +making it easy to identify slow tasks, data skew, etc. + +# Metrics -# Spark Metrics Spark has a configurable metrics system based on the [Coda Hale Metrics Library](http://metrics.codahale.com/). This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV @@ -35,6 +46,7 @@ The syntax of the metrics configuration file is defined in an example configurat `$SPARK_HOME/conf/metrics.conf.template`. # Advanced Instrumentation + Several external tools can be used to help profile the performance of Spark jobs: * Cluster-wide monitoring tools, such as [Ganglia](http://ganglia.sourceforge.net/), can provide