Commits · a35472e1dd2ea1b5a0b1fb6b382f5a98f5aeba5a · cs525-sp18-g07 / spark

Nov 04, 2013
- Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and... · a35472e1
  tgravescs authored 11 years ago
  
  Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.
  a35472e1
Nov 03, 2013

Merge pull request #70 from rxin/hash1 · b5dc3393

Reynold Xin authored 11 years ago

Fast, memory-efficient hash set, hash table implementations optimized for primitive data types.

This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space).

Details:

This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs.

Building on top of OpenHashSet, this PR adds two different hash tables implementations:
1. OpenHashSet: for nullable keys, optional specialization for primitive values
2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values

I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items:

Int to Int: ~30%
java.lang.Integer to java.lang.Integer: ~100%
Int to java.lang.Integer: ~50%
java.lang.Integer to Int: ~85%

b5dc3393

Code review feedback. · eb5f8a3f
Reynold Xin authored 11 years ago

eb5f8a3f
Fixed a bug that uses twice amount of memory for the primitive arrays due to a scala compiler bug. · 1e9543b5
Reynold Xin authored 11 years ago
```
Also addressed Matei's code review comment.
```
1e9543b5
Merge branch 'master' into hash1 · da6bb0ae
Reynold Xin authored 11 years ago

da6bb0ae

Nov 02, 2013
- Merge pull request #133 from Mistobaan/link_fix · 41ead7a7
  Reynold Xin authored 11 years ago
  
  update default github
  41ead7a7
- Merge pull request #134 from rxin/readme · d407c073
  Reynold Xin authored 11 years ago
  
  Fixed a typo in Hadoop version in README.
  d407c073
- Fixed a typo in Hadoop version in README. · 895747bb
  Reynold Xin authored 11 years ago
  
  895747bb
Nov 01, 2013
- update default github · 4b5d61f3
  Fabrizio (Misto) Milo authored 11 years ago
  
  4b5d61f3
- Merge pull request #132 from Mistobaan/doc_fix · e7c7b804
  Reynold Xin authored 11 years ago
  
  fix persistent-hdfs
  e7c7b804
- fix persistent-hdfs · 3f89354c
  Fabrizio (Misto) Milo authored 11 years ago
  
  3f89354c
- Merge pull request #129 from velvia/2013-11/document-local-uris · d6d11c2e
  Matei Zaharia authored 11 years ago
  
  Document & finish support for local: URIs Review all the supported URI schemes for addJar / addFile to the Cluster Overview page. Add support for local: URI to addFile.
  d6d11c2e
- Add local: URI support to addFile as well · f3679fd4
  Evan Chan authored 11 years ago
  
  f3679fd4
- Document all the URIs for addJar/addFile · e54a37fe
  Evan Chan authored 11 years ago
  
  e54a37fe
Oct 30, 2013

Merge pull request #117 from stephenh/avoid_concurrent_modification_exception · 8f1098a3

Matei Zaharia authored 11 years ago

Handle ConcurrentModificationExceptions in SparkContext init.

System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.

8f1098a3

Merge pull request #126 from kayousterhout/local_fix · dc9ce16f

Matei Zaharia authored 11 years ago

Fixed incorrect log message in local scheduler

This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)

dc9ce16f

Merge pull request #124 from tgravescs/sparkHadoopUtilFix · 33de11c5

Matei Zaharia authored 11 years ago

Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886)

Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized.   This causes issues like https://spark-project.atlassian.net/browse/SPARK-886.  It also makes it hard to use in singleton objects.  For instance I want to use it in the security code.

33de11c5

Fixed incorrect log message in local scheduler · ff038eb4
Kay Ousterhout authored 11 years ago

ff038eb4

Merge pull request #125 from velvia/2013-10/local-jar-uri · 618c1f6c

Matei Zaharia authored 11 years ago

Add support for local:// URI scheme for addJars()

This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.

I would add something to the docs, but it's not obvious where to add it.

Oh, and it would be great if this could be merged in time for 0.8.1.

618c1f6c

Avoid match errors when filtering for spark.hadoop settings. · 09f3b677
Stephen Haberman authored 11 years ago

09f3b677
move the hadoopJobMetadata back into SparkEnv · f231aaa2
tgravescs authored 11 years ago

f231aaa2
Add support for local:// URI scheme for addJars() · de028555
Evan Chan authored 11 years ago
```
This indicates that a jar is available locally on each worker node.
```
de028555
Merge remote-tracking branch 'upstream/master' into sparkHadoopUtilFix · 54d9c6f2
tgravescs authored 11 years ago

54d9c6f2

Merge pull request #118 from JoshRosen/blockinfo-memory-usage · 745dc429

Matei Zaharia authored 11 years ago

Reduce the memory footprint of BlockInfo objects

This pull request reduces the memory footprint of all BlockInfo objects and makes additional optimizations for shuffle blocks.  For all BlockInfo objects, these changes remove two boolean fields and one Object field.  For shuffle blocks, we additionally remove an Object field and a boolean field.

When storing tens of thousands of these objects, this may add up to significant memory savings.  A ShuffleBlockInfo now only needs to wrap a single long.

This was motivated by a [report of high blockInfo memory usage during shuffles](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C20131026134353.202b2b9b%40sh9%3E).

I haven't run benchmarks to measure the exact memory savings.

/cc @aarondav

745dc429

Oct 29, 2013
- fix sparkhdfs lr test · e5e0ebdb
  tgravescs authored 11 years ago
  
  e5e0ebdb
- Extract BlockInfo classes from BlockManager. · cb9c8a92
  Josh Rosen authored 11 years ago
  
  This saves space, since the inner classes needed to keep a reference to the enclosing BlockManager.
  cb9c8a92
- Use Properties.clone() instead. · 3a388c32
  Stephen Haberman authored 11 years ago
  
  3a388c32
- Store fewer BlockInfo fields for shuffle blocks. · 846b1cf5
  Josh Rosen authored 11 years ago
  
  846b1cf5
- Remove SparkHadoopUtil stuff from SparkEnv · eeb5f64c
  tgravescs authored 11 years ago
  
  eeb5f64c
- Merge pull request #119 from soulmachine/master · f0e23a02
  Reynold Xin authored 11 years ago
  
  A little revise for the document
  f0e23a02
Oct 28, 2013
- A little revise for the document · a197137f
  soulmachine authored 11 years ago
  
  a197137f
- Restructure BlockInfo fields to reduce memory use. · 2d7cf6a2
  Josh Rosen authored 11 years ago
  
  2d7cf6a2
Oct 27, 2013

Merge pull request #112 from kayousterhout/ui_task_attempt_id · aec9bf90

Matei Zaharia authored 11 years ago

Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId

Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.

This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.

aec9bf90

Merge pull request #115 from aarondav/shuffle-fix · d4df4749

Reynold Xin authored 11 years ago

Eliminate extra memory usage when shuffle file consolidation is disabled

Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.

d4df4749

Handle ConcurrentModificationExceptions in SparkContext init. · a6ae2b48

Stephen Haberman authored 11 years ago

System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.

a6ae2b48

Use flag instead of name check. · 4261e834
Aaron Davidson authored 11 years ago

4261e834
Eliminate extra memory usage when shuffle file consolidation is disabled · 596f1847
Aaron Davidson authored 11 years ago
```
Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.
```
596f1847
Display both task ID and task index in UI · ae22b4dd
Kay Ousterhout authored 11 years ago

ae22b4dd

Oct 26, 2013

Merge pull request #113 from pwendell/master · e018f2d0

Patrick Wendell authored 11 years ago

Improve error message when multiple assembly jars are present.

This can happen easily if building different hadoop versions. Right now it gives a class not found exception.

e018f2d0

Merge pull request #114 from soulmachine/master · 662ee9f3
Reynold Xin authored 11 years ago
```
A little revise for the document
```
662ee9f3