Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
spark
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
cs525-sp18-g07
spark
Commits
23762efd
Commit
23762efd
authored
11 years ago
by
Matei Zaharia
Browse files
Options
Downloads
Patches
Plain Diff
New hardware provisioning doc, and updates to menus
parent
1b0f69c6
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
docs/_layouts/global.html
+13
-9
13 additions, 9 deletions
docs/_layouts/global.html
docs/hardware-provisioning.md
+70
-0
70 additions, 0 deletions
docs/hardware-provisioning.md
with
83 additions
and
9 deletions
docs/_layouts/global.html
+
13
−
9
View file @
23762efd
...
...
@@ -61,20 +61,24 @@
<a
href=
"#"
class=
"dropdown-toggle"
data-toggle=
"dropdown"
>
Programming Guides
<b
class=
"caret"
></b></a>
<ul
class=
"dropdown-menu"
>
<li><a
href=
"quick-start.html"
>
Quick Start
</a></li>
<li><a
href=
"scala-programming-guide.html"
>
Scala
</a></li>
<li><a
href=
"java-programming-guide.html"
>
Java
</a></li>
<li><a
href=
"python-programming-guide.html"
>
Python
</a></li>
<li><a
href=
"scala-programming-guide.html"
>
Spark in Scala
</a></li>
<li><a
href=
"java-programming-guide.html"
>
Spark in Java
</a></li>
<li><a
href=
"python-programming-guide.html"
>
Spark in Python
</a></li>
<li
class=
"divider"
></li>
<li><a
href=
"streaming-programming-guide.html"
>
Spark Streaming
</a></li>
<li><a
href=
"bagel-programming-guide.html"
>
Bagel (Pregel on Spark)
</a></li>
</ul>
</li>
<li
class=
"dropdown"
>
<a
href=
"#"
class=
"dropdown-toggle"
data-toggle=
"dropdown"
>
API Docs
<b
class=
"caret"
></b></a>
<ul
class=
"dropdown-menu"
>
<li><a
href=
"api/core/index.html"
>
Spark Java/Scala (Scaladoc)
</a></li>
<li><a
href=
"api/pyspark/index.html"
>
Spark Python (Epydoc)
</a></li>
<li><a
href=
"api/streaming/index.html"
>
Spark Streaming Java/Scala (Scaladoc)
</a></li>
<li><a
href=
"api/mllib/index.html"
>
Spark ML Library (Scaladoc)
</a></li>
<li><a
href=
"api/core/index.html"
>
Spark Core for Java/Scala
</a></li>
<li><a
href=
"api/pyspark/index.html"
>
Spark Core for Python
</a></li>
<li
class=
"divider"
></li>
<li><a
href=
"api/streaming/index.html"
>
Spark Streaming
</a></li>
<li><a
href=
"api/bagel/index.html"
>
Bagel (Pregel on Spark)
</a></li>
<li><a
href=
"api/mllib/index.html"
>
MLlib (Machine Learning)
</a></li>
</ul>
</li>
...
...
@@ -91,10 +95,10 @@
<li
class=
"dropdown"
>
<a
href=
"api.html"
class=
"dropdown-toggle"
data-toggle=
"dropdown"
>
More
<b
class=
"caret"
></b></a>
<ul
class=
"dropdown-menu"
>
<li><a
href=
"building-with-maven.html"
>
Building Spark with Maven
</a></li>
<li><a
href=
"configuration.html"
>
Configuration
</a></li>
<li><a
href=
"tuning.html"
>
Tuning Guide
</a></li>
<li><a
href=
"bagel-programming-guide.html"
>
Bagel (Pregel on Spark)
</a></li>
<li><a
href=
"hardware-provisioning.html"
>
Hardware Provisioning
</a></li>
<li><a
href=
"building-with-maven.html"
>
Building Spark with Maven
</a></li>
<li><a
href=
"contributing-to-spark.html"
>
Contributing to Spark
</a></li>
</ul>
</li>
...
...
This diff is collapsed.
Click to expand it.
docs/hardware-provisioning.md
0 → 100644
+
70
−
0
View file @
23762efd
---
layout
:
global
title
:
Hardware Provisioning
---
A common question received by Spark developers is how to configure hardware for it. While the right
hardware will depend on the situation, we make the following recommendations.
# Storage Systems
Because most Spark jobs will likely have to read input data from an external storage system (e.g.
the Hadoop File System, or HBase), it is important to place it
**
as close to this system as
possible
**
. We recommend the following:
*
If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark
[
standalone mode cluster
](
spark-standalone.html
)
on the same nodes, and configure Spark and
Hadoop's memory and CPU usage to avoid interference (for Hadoop, the relevant options are
`mapred.child.java.opts`
for the per-task memory and
`mapred.tasktracker.map.tasks.maximum`
and
`mapred.tasktracker.reduce.tasks.maximum`
for number of tasks). Alternatively, you can run
Hadoop and Spark on a common cluster manager like
[
Mesos
](
running-on-mesos.html
)
or
[
Hadoop YARN
](
running-on-yarn.html
)
.
*
If this is not possible, run Spark on different nodes in the same local-area network as HDFS.
If your cluster spans multiple racks, include some Spark nodes on each rack.
*
For low-latency data stores like HBase, it may be preferrable to run computing jobs on different
nodes than the storage system to avoid interference.
# Local Disks
While Spark can perform a lot of its computation in memory, it still uses local disks to store
data that doesn't fit in RAM, as well as to preserve intermediate output between stages. We
recommend having
**4-8 disks**
per node, configured _without_ RAID (just as separate mount points).
In Linux, mount the disks with the
[
`noatime` option
](
http://www.centos.org/docs/5/html/Global_File_System/s2-manage-mountnoatime.html
)
to reduce unnecessary writes. In Spark,
[
configure
](
configuration.html
)
the
`spark.local.dir`
variable to be a comma-separated list of the local disks. If you are running HDFS, it's fine to
use the same disks as HDFS.
# Memory
In general, Spark can run well with anywhere from
**8 GB to hundreds of gigabytes**
of memory per
machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the
rest for the operating system and buffer cache.
How much memory you will need will depend on your application. To determine how much your
application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the
Storage tab of Spark's monitoring UI (
`http://<driver-node>:3030`
) to see its size in memory.
Note that memory usage is greatly affected by storage level and serialization format -- see
the
[
tuning guide
](
tuning.html
)
for tips on how to reduce it.
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you
purchase machines with more RAM than this, you can run _multiple worker JVMs per node_. In
Spark's
[
standalone mode
](
spark-standalone.html
)
, you can set the number of workers per node
with the
`SPARK_WORKER_INSTANCES`
variable in
`conf/spark-env.sh`
, and the number of cores
per worker with
`SPARK_WORKER_CORES`
.
# Network
In our experience, when the data is in memory, a lot of Spark applications are network-bound.
Using a
**10 Gigabit**
or higher network is the best way to make these applications faster.
This is especially true for "distributed reduce" applications such as group-bys, reduce-bys, and
SQL joins. In any given application, you can see how much data Spark shuffles across the network
from the application's monitoring UI (
`http://<driver-node>:3030`
).
# CPU Cores
Spark scales well to tens of CPU cores per machine because it performes minimal sharing between
threads. You should likely provision at least
**8-16 cores**
per machine. Depending on the CPU
cost of your workload, you may also need more: once data is in memory, most applications are
either CPU- or network-bound.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment