- Nov 24, 2013
-
-
Reynold Xin authored
Also changed the semantics of the index parameter in mapPartitionsWithIndex from the partition index of the output partition to the partition index in the current RDD.
-
- Nov 23, 2013
-
-
Reynold Xin authored
AppendOnlyMap fixes - Chose a more random reshuffling step for values returned by Object.hashCode to avoid some long chaining that was happening for consecutive integers (e.g. `sc.makeRDD(1 to 100000000, 100).map(t => (t, t)).reduceByKey(_ + _).count`) - Some other small optimizations throughout (see commit comments)
-
Matei Zaharia authored
- Don't check keys for equality when re-inserting due to growing the table; the keys will already be unique - Remember the grow threshold instead of recomputing it on each insert
-
Matei Zaharia authored
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode instead of the simpler version in java.util.HashMap; the latter one had trouble with ranges of consecutive integers. Murmur Hash 3 is used by fastutil. - Use Object.equals() instead of Scala's == to compare keys, because the latter does extra casts for numeric types (see the equals method in https://github.com/scala/scala/blob/master/src/library/scala/runtime/BoxesRunTime.java)
-
Reynold Xin authored
Support preservesPartitioning in RDD.zipPartitions In `RDD.zipPartitions`, add support for a `preservesPartitioning` option (similar to `RDD.mapPartitions`) that reuses the first RDD's partitioner.
-
Ankur Dave authored
-
- Nov 21, 2013
-
-
Reynold Xin authored
Fix Kryo Serializer buffer documentation inconsistency The documentation here is inconsistent with the coded default and other documentation.
-
Reynold Xin authored
TimeTrackingOutputStream should pass on calls to close() and flush(). Without this fix you get a huge number of open files when running shuffles.
-
Patrick Wendell authored
Without this fix you get a huge number of open shuffles after running shuffles.
-
- Nov 20, 2013
-
-
Neal Wiggins authored
The documentation here is inconsistent with the coded default and other documentation.
-
Reynold Xin authored
PartitionPruningRDD is using index from parent I was getting a ArrayIndexOutOfBoundsException exception after doing union on pruned RDD. The index it was using on the partition was the index in the original RDD not the new pruned RDD.
-
Matei Zaharia authored
Cleanup to remove semicolons (;) from Scala code -) The main reason for this PR is to remove semicolons from single statements of Scala code. -) Remove unused imports as I see them -) Fix ASF comment header from some of files (bad copy paste I suppose)
-
- Nov 19, 2013
-
-
Henry Saputra authored
-
Henry Saputra authored
Passed the sbt/sbt compile and test
-
Matei Zaharia authored
correct number of tasks in ExecutorsUI Index `a` is not `execId` here
-
Matei Zaharia authored
Impove Spark on Yarn Error handling Improve cli error handling and only allow a certain number of worker failures before failing the application. This will help prevent users from doing foolish things and their jobs running forever. For instance using 32 bit java but trying to allocate 8G containers. This loops forever without this change, now it errors out after a certain number of retries. The number of tries is configurable. Also increase the frequency we ping the RM to increase speed at which we get containers if they die. The Yarn MR app defaults to pinging the RM every 1 seconds, so the default of 5 seconds here is fine. But that is configurable as well in case people want to change it. I do want to make sure there aren't any cases that calling stopExecutors in CoarseGrainedSchedulerBackend would cause problems? I couldn't think of any and testing on standalone cluster as well as yarn.
-
Matei Zaharia authored
Enable the Broadcast examples to work in a cluster setting Since they rely on println to display results, we need to first collect those results to the driver to have them actually display locally. This issue came up on the mailing lists [here](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3C2013111909591557147628%40ict.ac.cn%3E).
-
tgravescs authored
-
Henry Saputra authored
Also remove unused imports as I found them along the way. Remove return statements when returning value in the Scala code. Passing compile and tests.
-
Matthew Taylor authored
-
Aaron Davidson authored
Since they rely on println to display results, we need to first collect those results to the driver to have them actually display locally.
-
Matthew Taylor authored
-
- Nov 17, 2013
-
-
shiyun.wxm authored
-
Reynold Xin authored
Slightly enhanced PrimitiveVector: 1. Added trim() method 2. Added size method. 3. Renamed getUnderlyingArray to array. 4. Minor documentation update.
-
Reynold Xin authored
Add PrimitiveVectorSuite and fix bug in resize()
-
Aaron Davidson authored
-
Reynold Xin authored
-
BlackNiuza authored
-
Reynold Xin authored
1. Added trim() method 2. Added size method. 3. Renamed getUnderlyingArray to array. 4. Minor documentation update.
-
BlackNiuza authored
-
- Nov 16, 2013
-
-
Matei Zaharia authored
Simple cleanup on Spark's Scala code Simple cleanup on Spark's Scala code while testing some modules: -) Remove some of unused imports as I found them -) Remove ";" in the imports statements -) Remove () at the end of method calls like size that does not have size effect.
-
- Nov 15, 2013
-
-
Henry Saputra authored
-) Remove some of unused imports as I found them -) Remove ";" in the imports statements -) Remove () at the end of method call like size that does not have size effect.
-
Matei Zaharia authored
Fix bug where scheduler could hang after task failure. When a task fails, we need to call reviveOffers() so that the task can be rescheduled on a different machine. In the current code, the state in ClusterTaskSetManager indicating which tasks are pending may be updated after revive offers is called (there's a race condition here), so when revive offers is called, the task set manager does not yet realize that there are failed tasks that need to be relaunched. This isn't currently unit tested but will be once my pull request for merging the cluster and local schedulers goes in -- at which point many more of the unit tests will exercise the code paths through the cluster scheduler (currently the failure test suite uses the local scheduler, which is why we didn't see this bug before).
-
- Nov 14, 2013
-
-
Matei Zaharia authored
Don't retry tasks when they fail due to a NotSerializableException As with my previous pull request, this will be unit tested once the Cluster and Local schedulers get merged.
-
Matei Zaharia authored
Write Spark UI url to driver file on HDFS This makes the SIMR code path simpler
-
Kay Ousterhout authored
-
Kay Ousterhout authored
When a task fails, we need to call reviveOffers() so that the task can be rescheduled on a different machine. In the current code, the state in ClusterTaskSetManager indicating which tasks are pending may be updated after revive offers is called (there's a race condition here), so when revive offers is called, the task set manager does not yet realize that there are failed tasks that need to be relaunched.
-
Reynold Xin authored
Don't ignore spark.cores.max when using Mesos Coarse mode totalCoresAcquired is decremented but never incremented, causing Spark to effectively ignore spark.cores.max in coarse grained Mesos mode.
-
Reynold Xin authored
Fixed a scaladoc typo in HadoopRDD.scala
-
Reynold Xin authored
Fixed typos in the CDH4 distributions version codes. Nothing important, but annoying when doing a copy/paste...
-