diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 143f93171fde9db695e51999cc19629ec5e9cc7e..cf6b48c05eb5fdf19ae078b085b817fde9928081 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -68,3 +68,50 @@ access this UI. The [monitoring guide](monitoring.html) also describes other mon Spark gives control over resource allocation both _across_ applications (at the level of the cluster manager) and _within_ applications (if multiple computations are happening on the same SparkContext). The [job scheduling overview](job-scheduling.html) describes this in more detail. + +# Glossary + +The following table summarizes terms you'll see used to refer to cluster concepts: + +<table class="table"> + <thead> + <tr><th style="width: 130px;">Term</th><th>Meaning</th></tr> + </thead> + <tbody> + <tr> + <td>Application</td> + <td>Any user program invoking Spark</td> + </tr> + <tr> + <td>Driver program</td> + <td>The process running the main() function of the application and creating the SparkContext</td> + </tr> + <tr> + <td>Cluster manager</td> + <td>An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)</td> + </tr> + <tr> + <td>Worker node</td> + <td>Any node that can run application code in the cluster</td> + </tr> + <tr> + <td>Executor</td> + <td>A process launched for an application on a worker node, that runs tasks and keeps data in memory + or disk storage across them. Each application has its own executors.</td> + </tr> + <tr> + <td>Task</td> + <td>A unit of work that will be sent to one executor</td> + </tr> + <tr> + <td>Job</td> + <td>A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action + (e.g. <code>save</code>, <code>collect</code>); you'll see this term used in the driver's logs.</td> + </tr> + <tr> + <td>Stage</td> + <td>Each job gets divided into smaller sets of tasks called <em>stages</em> that depend on each other + (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.</td> + </tr> + </tbody> +</table> diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md index 11b733137d5ef6d0c1f642e61fc72a8bff117f79..d304c5497bdb35ef0a48294f79fa1c3c000d2f79 100644 --- a/docs/job-scheduling.md +++ b/docs/job-scheduling.md @@ -25,7 +25,7 @@ different options to manage allocation, depending on the cluster manager. The simplest option, available on all cluster managers, is _static partitioning_ of resources. With this approach, each application is given a maximum amount of resources it can use, and holds onto them -for its whole duration. This is the only approach available in Spark's [standalone](spark-standalone.html) +for its whole duration. This is the approach used in Spark's [standalone](spark-standalone.html) and [YARN](running-on-yarn.html) modes, as well as the [coarse-grained Mesos mode](running-on-mesos.html#mesos-run-modes). Resource allocation can be configured as follows, based on the cluster type: