Commits · 17db6a9041d5e83d7b6fe47f9c36758d0613fcd6 · cs525-sp18-g07 / spark

Dec 10, 2013
- Style fixes and addressed review comments at #221 · 17db6a90
  Prashant Sharma authored 11 years ago
  
  17db6a90
Dec 09, 2013
- fixed yarn build · c1201f47
  Prashant Sharma authored 11 years ago
  
  c1201f47
Dec 07, 2013
- Incorporated Patrick's feedback comment on #211 and made maven... · 7ad6921a
  Prashant Sharma authored 11 years ago
  
  Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster.
  7ad6921a
Dec 06, 2013
- Fix long lines · 94b5881e
  Aaron Davidson authored 11 years ago
  
  94b5881e
- Rename SparkActorSystem to IndestructibleActorSystem · 5a864e3f
  Aaron Davidson authored 11 years ago
  
  5a864e3f
- Merge branch 'wip-scala-2.10' into akka-bug-fix · c9cd2af7
  Prashant Sharma authored 11 years ago
  
  c9cd2af7
- A left over akka -> akka.tcp changes · 4e704800
  Prashant Sharma authored 11 years ago
  
  4e704800
Dec 02, 2013
- Made running SparkActorSystem specific to executors only. · 09e8be9a
  Prashant Sharma authored 11 years ago
  
  09e8be9a
- Cleanup and documentation of SparkActorSystem · 0f24576c
  Aaron Davidson authored 11 years ago
  
  0f24576c
- Cleanup and documentation of SparkActorSystem · f6c8c1c7
  Aaron Davidson authored 11 years ago
  
  f6c8c1c7
Dec 01, 2013
- Made akka capable of tolerating fatal exceptions and moving on. · 5b11028a
  Prashant Sharma authored 11 years ago
  
  5b11028a
Nov 29, 2013
- Merge branch 'master' into wip-scala-2.10 · 5618af68
  Prashant Sharma authored 11 years ago
  
  5618af68
- Changed defaults for akka to almost disable failure detector. · 1bc83ca7
  Prashant Sharma authored 11 years ago
  
  1bc83ca7
Nov 28, 2013
- Fixed the broken build. · 3ec5d747
  Prashant Sharma authored 11 years ago
  
  3ec5d747
Nov 27, 2013

Merge pull request #210 from haitaoyao/http-timeout · 743a31a7

Matei Zaharia authored 11 years ago

add http timeout for httpbroadcast

While pulling task bytecode from HttpBroadcast server, there's no timeout value set. This may cause spark executor code hang and other task in the same executor process wait for the lock. I have encountered the issue in my cluster. Here's the stacktrace I captured : https://gist.github.com/haitaoyao/7655830

So add a time out value to ensure the task fail fast.

743a31a7

Merge branch 'master' into wip-scala-2.10 · 17987778

Prashant Sharma authored 11 years ago

Conflicts:
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	python/pyspark/rdd.py

17987778

Improvements from the review comments and followed Boy Scout Rule. · 54862af5
Prashant Sharma authored 11 years ago

54862af5

Nov 26, 2013

Merge pull request #146 from JoshRosen/pyspark-custom-serializers · fb6875dd

Matei Zaharia authored 11 years ago

Custom Serializers for PySpark

This pull request adds support for custom serializers to PySpark.  For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext.

For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers.  It's pretty easy to add support for other serializers, although I still need to add instructions on this.

A few notable changes:

- The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings.  The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD.  This mechanism could also be used to support other data exchange formats, such as MsgPack.
- Several magic numbers were refactored into constants.
- Batching is implemented by wrapping / decorating an unbatched SerDe.

fb6875dd

Merge pull request #207 from henrydavidge/master · 330ada17

Matei Zaharia authored 11 years ago

Log a warning if a task's serialized size is very big

As per Reynold's instructions, we now create a warning level log entry if a task's serialized size is too big. "Too big" is currently defined as 100kb. This warning message is generated at most once for each stage.

330ada17

Merge pull request #212 from markhamstra/SPARK-963 · 615213fb
Matei Zaharia authored 11 years ago
```
[SPARK-963] Fixed races in JobLoggerSuite
```
615213fb
Removed unused basestring case from dump_stream. · 1b74a27d
Josh Rosen authored 11 years ago

1b74a27d
Emit warning when task size > 100KB · 57579934
hhd authored 11 years ago

57579934
[SPARK-963] Wait for SparkListenerBus eventQueue to be empty before checking jobLogger state · ed7ecb93
Mark Hamstra authored 11 years ago

ed7ecb93
Merge pull request #209 from pwendell/better-docs · cb976dfb
Reynold Xin authored 11 years ago
```
Improve docs for shuffle instrumentation
```
cb976dfb
Documenting the newly added spark properties. · dca946ff
Prashant Sharma authored 11 years ago

dca946ff
Restored master address for client. · 560e44a8
Prashant Sharma authored 11 years ago

560e44a8
add http timeout for httpbroadcast · db998a6e
haitao.yao authored 11 years ago

db998a6e
Fixed compile time warnings and formatting post merge. · d092a8cc
Prashant Sharma authored 11 years ago

d092a8cc

Merge pull request #86 from holdenk/master · 18d6df0e

Matei Zaharia authored 11 years ago

Add histogram functionality to DoubleRDDFunctions

This pull request add histogram functionality to the DoubleRDDFunctions.

18d6df0e

Improve docs for shuffle instrumentation · 297c09d4
Patrick Wendell authored 11 years ago

297c09d4

Nov 25, 2013

Fix the test · 7222ee29
Holden Karau authored 11 years ago

7222ee29

Merge pull request #204 from rxin/hash · 0e2109dd

Matei Zaharia authored 11 years ago

OpenHashSet fixes

Incorporated ideas from pull request #200.
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique.
- Remember the grow threshold instead of recomputing it on each insert

Also added unit tests for size estimation for specialized hash sets and maps.

0e2109dd

Merge pull request #206 from ash211/patch-2 · c46067f0

Matei Zaharia authored 11 years ago

Update tuning.md

Clarify when serializer is used based on recent user@ mailing list discussion.

c46067f0

Merge pull request #201 from rxin/mappartitions · 14bb465b

Matei Zaharia authored 11 years ago

Use the proper partition index in mapPartitionsWIthIndex

mapPartitionsWithIndex uses TaskContext.partitionId as the partition index. TaskContext.partitionId used to be identical to the partition index in a RDD. However, pull request #186 introduced a scenario (with partition pruning) that the two can be different. This pull request uses the right partition index in all mapPartitionsWithIndex related calls.

Also removed the extra MapPartitionsWIthContextRDD and put all the mapPartitions related functionality in MapPartitionsRDD.

14bb465b

Update tuning.md · 08afef37

Andrew Ash authored 11 years ago

Clarify when serializer is used based on recent user@ mailing list discussion.

08afef37

Merge pull request #101 from colorant/yarn-client-scheduler · eb4296c8

Matei Zaharia authored 11 years ago

For SPARK-527, Support spark-shell when running on YARN

sync to trunk and resubmit here

In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote.

This approaching won't support application that involve local interaction and need to be run on where it is launched.

So In this pull request I have a YarnClientClusterScheduler and backend added.

With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well.

This enables spark-shell to run upon YARN.

This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on.

Docs also updated to show how to use this yarn-client mode.

eb4296c8

Merge branch 'master' into scala-2.10-wip · 44fd30d3
Prashant Sharma authored 11 years ago
```
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	project/SparkBuild.scala
```
44fd30d3
Remote death watch has a funny bug. · 489862a6
Prashant Sharma authored 11 years ago
```
https://gist.github.com/ScrapCodes/4805fd84906e40b7b03d
```
489862a6

Incorporated ideas from pull request #200. · 466fd064

Reynold Xin authored 11 years ago

- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.

- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique

- Remember the grow threshold instead of recomputing it on each insert

466fd064

Added unit tests for size estimation for specialized hash sets and maps. · 95c55df1
Reynold Xin authored 11 years ago

95c55df1