- Dec 10, 2013
-
-
Prashant Sharma authored
-
- Dec 09, 2013
-
-
Prashant Sharma authored
-
- Dec 07, 2013
-
-
Prashant Sharma authored
Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster.
-
- Dec 06, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Prashant Sharma authored
-
Prashant Sharma authored
-
- Dec 02, 2013
-
-
Prashant Sharma authored
-
Aaron Davidson authored
-
Aaron Davidson authored
-
- Dec 01, 2013
-
-
Prashant Sharma authored
-
- Nov 29, 2013
-
-
Prashant Sharma authored
-
Prashant Sharma authored
-
- Nov 28, 2013
-
-
Prashant Sharma authored
-
- Nov 27, 2013
-
-
Matei Zaharia authored
add http timeout for httpbroadcast While pulling task bytecode from HttpBroadcast server, there's no timeout value set. This may cause spark executor code hang and other task in the same executor process wait for the lock. I have encountered the issue in my cluster. Here's the stacktrace I captured : https://gist.github.com/haitaoyao/7655830 So add a time out value to ensure the task fail fast.
-
Prashant Sharma authored
Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
-
Prashant Sharma authored
-
- Nov 26, 2013
-
-
Matei Zaharia authored
Custom Serializers for PySpark This pull request adds support for custom serializers to PySpark. For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext. For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers. It's pretty easy to add support for other serializers, although I still need to add instructions on this. A few notable changes: - The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings. The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD. This mechanism could also be used to support other data exchange formats, such as MsgPack. - Several magic numbers were refactored into constants. - Batching is implemented by wrapping / decorating an unbatched SerDe.
-
Matei Zaharia authored
Log a warning if a task's serialized size is very big As per Reynold's instructions, we now create a warning level log entry if a task's serialized size is too big. "Too big" is currently defined as 100kb. This warning message is generated at most once for each stage.
-
Matei Zaharia authored
[SPARK-963] Fixed races in JobLoggerSuite
-
Josh Rosen authored
-
hhd authored
-
Mark Hamstra authored
-
Reynold Xin authored
Improve docs for shuffle instrumentation
-
Prashant Sharma authored
-
Prashant Sharma authored
-
haitao.yao authored
-
Prashant Sharma authored
-
Matei Zaharia authored
Add histogram functionality to DoubleRDDFunctions This pull request add histogram functionality to the DoubleRDDFunctions.
-
Patrick Wendell authored
-
- Nov 25, 2013
-
-
Holden Karau authored
-
Matei Zaharia authored
OpenHashSet fixes Incorporated ideas from pull request #200. - Use Murmur Hash 3 finalization step to scramble the bits of HashCode instead of the simpler version in java.util.HashMap; the latter one had trouble with ranges of consecutive integers. Murmur Hash 3 is used by fastutil. - Don't check keys for equality when re-inserting due to growing the table; the keys will already be unique. - Remember the grow threshold instead of recomputing it on each insert Also added unit tests for size estimation for specialized hash sets and maps.
-
Matei Zaharia authored
Update tuning.md Clarify when serializer is used based on recent user@ mailing list discussion.
-
Matei Zaharia authored
Use the proper partition index in mapPartitionsWIthIndex mapPartitionsWithIndex uses TaskContext.partitionId as the partition index. TaskContext.partitionId used to be identical to the partition index in a RDD. However, pull request #186 introduced a scenario (with partition pruning) that the two can be different. This pull request uses the right partition index in all mapPartitionsWithIndex related calls. Also removed the extra MapPartitionsWIthContextRDD and put all the mapPartitions related functionality in MapPartitionsRDD.
-
Andrew Ash authored
Clarify when serializer is used based on recent user@ mailing list discussion.
-
Matei Zaharia authored
For SPARK-527, Support spark-shell when running on YARN sync to trunk and resubmit here In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote. This approaching won't support application that involve local interaction and need to be run on where it is launched. So In this pull request I have a YarnClientClusterScheduler and backend added. With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well. This enables spark-shell to run upon YARN. This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on. Docs also updated to show how to use this yarn-client mode.
-
Prashant Sharma authored
Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala project/SparkBuild.scala
-
Reynold Xin authored
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode instead of the simpler version in java.util.HashMap; the latter one had trouble with ranges of consecutive integers. Murmur Hash 3 is used by fastutil. - Don't check keys for equality when re-inserting due to growing the table; the keys will already be unique - Remember the grow threshold instead of recomputing it on each insert
-
Reynold Xin authored
-