- Nov 04, 2013
-
-
tgravescs authored
Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.
-
- Nov 03, 2013
-
-
Reynold Xin authored
Fast, memory-efficient hash set, hash table implementations optimized for primitive data types. This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space). Details: This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs. Building on top of OpenHashSet, this PR adds two different hash tables implementations: 1. OpenHashSet: for nullable keys, optional specialization for primitive values 2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items: Int to Int: ~30% java.lang.Integer to java.lang.Integer: ~100% Int to java.lang.Integer: ~50% java.lang.Integer to Int: ~85%
-
Reynold Xin authored
-
Reynold Xin authored
Also addressed Matei's code review comment.
-
Reynold Xin authored
-
- Nov 02, 2013
-
-
Reynold Xin authored
update default github
-
Reynold Xin authored
Fixed a typo in Hadoop version in README.
-
Reynold Xin authored
-
- Nov 01, 2013
-
-
Fabrizio (Misto) Milo authored
-
Reynold Xin authored
fix persistent-hdfs
-
Fabrizio (Misto) Milo authored
-
Matei Zaharia authored
Document & finish support for local: URIs Review all the supported URI schemes for addJar / addFile to the Cluster Overview page. Add support for local: URI to addFile.
-
Evan Chan authored
-
Evan Chan authored
-
- Oct 30, 2013
-
-
Matei Zaharia authored
Handle ConcurrentModificationExceptions in SparkContext init. System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.
-
Matei Zaharia authored
Fixed incorrect log message in local scheduler This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)
-
Matei Zaharia authored
Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886) Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized. This causes issues like https://spark-project.atlassian.net/browse/SPARK-886. It also makes it hard to use in singleton objects. For instance I want to use it in the security code.
-
Kay Ousterhout authored
-
Matei Zaharia authored
Add support for local:// URI scheme for addJars() This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`. The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration. I would add something to the docs, but it's not obvious where to add it. Oh, and it would be great if this could be merged in time for 0.8.1.
-
Stephen Haberman authored
-
tgravescs authored
-
Evan Chan authored
This indicates that a jar is available locally on each worker node.
-
tgravescs authored
-
Matei Zaharia authored
Reduce the memory footprint of BlockInfo objects This pull request reduces the memory footprint of all BlockInfo objects and makes additional optimizations for shuffle blocks. For all BlockInfo objects, these changes remove two boolean fields and one Object field. For shuffle blocks, we additionally remove an Object field and a boolean field. When storing tens of thousands of these objects, this may add up to significant memory savings. A ShuffleBlockInfo now only needs to wrap a single long. This was motivated by a [report of high blockInfo memory usage during shuffles](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C20131026134353.202b2b9b%40sh9%3E). I haven't run benchmarks to measure the exact memory savings. /cc @aarondav
-
- Oct 29, 2013
-
-
tgravescs authored
-
Josh Rosen authored
This saves space, since the inner classes needed to keep a reference to the enclosing BlockManager.
-
Stephen Haberman authored
-
Josh Rosen authored
-
tgravescs authored
-
Reynold Xin authored
A little revise for the document
-
- Oct 28, 2013
-
-
soulmachine authored
-
Josh Rosen authored
-
- Oct 27, 2013
-
-
Matei Zaharia authored
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant. This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.
-
Reynold Xin authored
Eliminate extra memory usage when shuffle file consolidation is disabled Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled. Fixing SPARK-946 is still forthcoming.
-
Stephen Haberman authored
System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.
-
Aaron Davidson authored
-
Aaron Davidson authored
Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled. Fixing SPARK-946 is still forthcoming.
-
Kay Ousterhout authored
-
- Oct 26, 2013
-
-
Patrick Wendell authored
Improve error message when multiple assembly jars are present. This can happen easily if building different hadoop versions. Right now it gives a class not found exception.
-
Reynold Xin authored
A little revise for the document
-