Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
spark
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
cs525-sp18-g07
spark
Commits
3d2b900b
Commit
3d2b900b
authored
12 years ago
by
Shivaram Venkataraman
Browse files
Options
Downloads
Patches
Plain Diff
First cut at adding documentation for GC tuning
parent
97cbd699
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
docs/tuning.md
+63
-5
63 additions, 5 deletions
docs/tuning.md
with
63 additions
and
5 deletions
docs/tuning.md
+
63
−
5
View file @
3d2b900b
...
@@ -67,13 +67,14 @@ object you will serialize.
...
@@ -67,13 +67,14 @@ object you will serialize.
Finally, if you don't register your classes, Kryo will still work, but it will have to store the
Finally, if you don't register your classes, Kryo will still work, but it will have to store the
full class name with each object, which is wasteful.
full class name with each object, which is wasteful.
# Memory Tuning
# Memory Tuning
There are three considerations in tuning memory usage: the
*amount*
of memory used by your objects
There are three considerations in tuning memory usage: the
*amount*
of memory used by your objects
(you likely want your entire dataset to fit in memory), the
*cost*
of accessing those objects, and the
(you likely want your entire dataset to fit in memory), the
*cost*
of accessing those objects, and the
overhead of
*garbage collection*
(if you have high turnover in terms of objects).
overhead of
*garbage collection*
(if you have high turnover in terms of objects).
## Efficient Data Structures
By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space
By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space
than the "raw" data inside their fields. This is due to several reasons:
than the "raw" data inside their fields. This is due to several reasons:
...
@@ -119,10 +120,67 @@ need to trace through all your Java objects and find the unused ones. The main p
...
@@ -119,10 +120,67 @@ need to trace through all your Java objects and find the unused ones. The main p
that
*the cost of garbage collection is proportional to the number of Java objects*
, so using data
that
*the cost of garbage collection is proportional to the number of Java objects*
, so using data
structures with fewer objects (e.g. an array of
`Int`
s instead of a
`LinkedList`
) greatly reduces
structures with fewer objects (e.g. an array of
`Int`
s instead of a
`LinkedList`
) greatly reduces
this cost. An even better method is to persist objects in serialized form, as described above: now
this cost. An even better method is to persist objects in serialized form, as described above: now
there will be only
*one*
object (a byte array) per RDD partition. There is a lot of
there will be only
*one*
object (a byte array) per RDD partition. Before trying other advanced
[
detailed information on GC tuning
](
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
)
techniques, the first thing to try if GC is a problem is to use serialized caching.
available online, but at a high level, the first thing to try if GC is a problem is to use serialized caching.
## Cache Size Tuning
One of the important configuration parameters passed to Spark is the amount of memory that should be used for
caching RDDs. By default, Spark uses 66% of the configured memory (
`SPARK_MEM`
) to cache RDDs. This means that
around 33% of memory is available for any objects created during task execution.
In case your tasks slow down and you find that your JVM is using almost all of its allocated memory, lowering
this value will help reducing the memory consumption. To change this to say 50%, you can call
`System.setProperty("spark.storage.memoryFraction", "0.5")`
. Combined with the use of serialized caching,
using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
In case you are interested in further tuning the Java GC, continue reading below.
## GC Tuning
The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of
time spent GC. This can be done by adding
`-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps`
to
`SPARK_JAVA_OPTS`
environment variable. Next time your Spark job is run, you will see messages printed on the
console whenever a JVM garbage collection takes place. Note that garabage collections that occur at the executor can be
found in the executor logs and not on the
`spark-shell`
.
Some basic information about memory management in the JVM:
*
Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects
while the Old generation is intended for objects with longer lifetimes.
*
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
*
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects
that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old
enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.
The goal of GC-tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that
the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect
temporary objects created during task execution. Some steps which may be useful are:
*
Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for
before a task completes, it means that there isn't enough memory available for executing tasks.
*
In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching.
This can be done using the
`spark.storage.memoryFraction`
property. It is better to cache fewer objects than to slow
down task execution !
*
If there are too many minor collections but not too many major GCs, allocating more memory for Eden would help. You
can approximate the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
is determined to be
`E`
, then you can set the size of the Young generation using the option
`-Xmn=4/3*E`
. (The scaling
up by 4/3 is to account for space used by survivor regions as well)
*
As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
size of the block. So if we wish to have 3 or 4 tasks worth of working space, and the HDFS block size is 64 MB,
we can estimate size of Eden to be
`4*3*64MB`
.
*
Monitor how the frequency and time taken by garbage collection changes with the new settings.
Our experience suggests that the effect of GC tuning depends on your application and the amount of memory available.
There are
[
many more tuning options
](
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
)
described online,
but at a high level, managing how frequently full GC takes place can help in reducing the overhead.
# Other Considerations
# Other Considerations
...
@@ -165,4 +223,4 @@ This has been a quick guide to point out the main concerns you should know about
...
@@ -165,4 +223,4 @@ This has been a quick guide to point out the main concerns you should know about
Spark application -- most importantly, data serialization and memory tuning. For most programs,
Spark application -- most importantly, data serialization and memory tuning. For most programs,
switching to Kryo serialization and persisting data in serialized form will solve most common
switching to Kryo serialization and persisting data in serialized form will solve most common
performance issues. Feel free to ask on the
performance issues. Feel free to ask on the
[
Spark mailing list
](
http://groups.google.com/group/spark-users
)
about other tuning best practices.
[
Spark mailing list
](
http://groups.google.com/group/spark-users
)
about other tuning best practices.
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment