Commits · f18fd05b513b136363c94adb3e5b841f8bf48134 · cs525-sp18-g07 / spark

May 21, 2014

[SPARK-1519] Support minPartitions param of wholeTextFiles() in PySpark · f18fd05b

Kan Zhang authored 11 years ago

Author: Kan Zhang <kzhang@apache.org>

Closes #697 from kanzhang/SPARK-1519 and squashes the following commits:

4f8d1ed [Kan Zhang] [SPARK-1519] Support minPartitions param of wholeTextFiles() in PySpark

f18fd05b

May 07, 2014

SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions · 3308722c

Aaron Davidson authored 11 years ago

This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:

- The Python daemon waits for Spark to close the socket before exiting,
  in order to avoid causing spurious IOExceptions in Spark's
  `PythonRDD::WriterThread`.
- Removes the Python Monitor Thread, which polled for task cancellations
  in order to kill the Python worker. Instead, we do this in the
  onCompleteCallback, since this is guaranteed to be called during
  cancellation.
- Adds a "completed" variable to TaskContext to avoid the issue noted in
  [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
  Along with this, I removed the "context.interrupted = true" flag in
  the onCompleteCallback.
- Extracts PythonRDD::WriterThread to its own class.

Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with

```
sc.textFile("latlon.tsv").take(5)
```

many times without error.

Additionally, in order to test the unswallowed exceptions, I performed

```
sc.textFile("s3n://<big file>").count()
```

and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.

Author: Aaron Davidson <aaron@databricks.com>

Closes #640 from aarondav/pyspark-io and squashes the following commits:

b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions

3308722c

May 06, 2014

[SPARK-1549] Add Python support to spark-submit · 951a5d93

Matei Zaharia authored 11 years ago

This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones

951a5d93

Apr 24, 2014

[SPARK-986]: Job cancelation for PySpark · e53eb4f0

Ahir Reddy authored 11 years ago

* Additions to the PySpark API to cancel jobs
* Monitor Thread in PythonRDD to kill Python workers if a task is interrupted

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #541 from ahirreddy/python-cancel and squashes the following commits:

dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer
6c860ab [Ahir Reddy] PR Comments
4b4100a [Ahir Reddy] Success flag
adba6ed [Ahir Reddy] Destroy python workers
27a2f8f [Ahir Reddy] Start the writer thread...
d422f7b [Ahir Reddy] Remove unnecesssary vals
adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker
d9e472f [Ahir Reddy] Revert "removed unnecessary vals"
5b9cae5 [Ahir Reddy] removed unnecessary vals
07b54d9 [Ahir Reddy] Fix canceling unit test
8ae9681 [Ahir Reddy] Don't interrupt worker
7722342 [Ahir Reddy] Monitor Thread for python workers
db04e16 [Ahir Reddy] Added canceling api to PySpark

e53eb4f0

Apr 18, 2014

SPARK-1483: Rename minSplits to minPartitions in public APIs · e31c8ffc

CodingCat authored 11 years ago

https://issues.apache.org/jira/browse/SPARK-1483

From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz

Author: CodingCat <zhunansjtu@gmail.com>

Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:

4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs

e31c8ffc

Apr 04, 2014

SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde

Haoyuan Li authored 11 years ago

Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark

b50ddfde

SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7

Matei Zaharia authored 11 years ago

Also clarified comment on each file having to fit in memory

Author: Matei Zaharia <matei@databricks.com>

Closes #327 from mateiz/py-whole-files and squashes the following commits:

9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles

60e18ce7

Mar 10, 2014

[SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted) · f5518989

jyotiska authored 11 years ago

Author: jyotiska <jyotiska123@gmail.com>

Closes #34 from jyotiska/pyspark_code and squashes the following commits:

c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py

f5518989

Mar 06, 2014

SPARK-1187, Added missing Python APIs · 3d3acef0

Prabin Banka authored 11 years ago

The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()

was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #75 from prabinb/python-api-backup and squashes the following commits:

cc3c6cd [Prabin Banka] Added missing Python APIs

3d3acef0

Feb 20, 2014

SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795

Ahir Reddy authored 11 years ago

Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:

a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

59b13795

Jan 28, 2014

Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72

Josh Rosen authored 11 years ago

This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.

1381fc72

Jan 01, 2014
- Fix Python code after change of getOrElse · 7e8d2e8a
  Matei Zaharia authored 11 years ago
  
  7e8d2e8a
Dec 30, 2013
- Updated docs for SparkConf and handled review comments · 0fa58097
  Matei Zaharia authored 11 years ago
  
  0fa58097
Dec 29, 2013
- Properly show Spark properties on web UI, and change app name property · 994f080f
  Matei Zaharia authored 11 years ago
  
  994f080f
- Fix some Python docs and make sure to unset SPARK_TESTING in Python · eaa8a68f
  Matei Zaharia authored 11 years ago
  
  tests so we don't get the test spark.conf on the classpath.
  eaa8a68f
- Add Python docs about SparkConf · 58c6fa20
  Matei Zaharia authored 11 years ago
  
  58c6fa20
- Fix some other Python tests due to initializing JVM in a different way · 615fb649
  Matei Zaharia authored 11 years ago
  
  The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.
  615fb649
- Add SparkConf support in Python · cd00225d
  Matei Zaharia authored 11 years ago
  
  cd00225d
Dec 28, 2013
- Fix Python use of getLocalDir · 1c11f54a
  Matei Zaharia authored 11 years ago
  
  1c11f54a
Dec 24, 2013
- Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. · d4dfab50
  Tathagata Das authored 11 years ago
  
  d4dfab50
Dec 18, 2013
- Add collectPartition to JavaRDD interface. · af0cd6bd
  Shivaram Venkataraman authored 11 years ago
  
  Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
  af0cd6bd
Nov 10, 2013

FramedSerializer: _dumps => dumps, _loads => loads. · 13122ceb
Josh Rosen authored 11 years ago

13122ceb

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Nov 03, 2013

Remove Pickle-wrapping of Java objects in PySpark. · 7d68a81a

Josh Rosen authored 11 years ago

If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.

7d68a81a

Oct 22, 2013

Pass self to SparkContext._ensure_initialized. · 317a9eb1

Ewen Cheslack-Postava authored 11 years ago

The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.

317a9eb1

Add classmethod to SparkContext to set system properties. · 56d230e6

Ewen Cheslack-Postava authored 11 years ago

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.

56d230e6

Sep 08, 2013
- Whoopsy daisy · a3868544
  Aaron Davidson authored 11 years ago
  
  a3868544
Sep 07, 2013

Export StorageLevel and refactor · c1cc8c4d
Aaron Davidson authored 11 years ago

c1cc8c4d

Remove reflection, hard-code StorageLevels · 8001687a

Aaron Davidson authored 11 years ago

The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.

As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python

8001687a

Sep 06, 2013
- Memoize StorageLevels read from JVM · b8a0b6ea
  Aaron Davidson authored 11 years ago
  
  b8a0b6ea
- SPARK-660: Add StorageLevel support in Python · a63d4c7d
  Aaron Davidson authored 11 years ago
  
  It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
  a63d4c7d
Sep 01, 2013
- Move some classes to more appropriate packages: · 0a8cc309
  Matei Zaharia authored 12 years ago
  
  * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
  0a8cc309
- Initial work to rename package to org.apache.spark · 46eecd11
  Matei Zaharia authored 12 years ago
  
  46eecd11
Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 12 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Jul 29, 2013

SPARK-815. Python parallelize() should split lists before batching · feba7ee5

Matei Zaharia authored 12 years ago

One unfortunate consequence of this fix is that we materialize any
collections that are given to us as generators, but this seems necessary
to get reasonable behavior on small collections. We could add a
batchSize parameter later to bypass auto-computation of batch size if
this becomes a problem (e.g. if users really want to parallelize big
generators nicely)

feba7ee5

Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 12 years ago
  
  af3c9d50
Feb 03, 2013
- Fix reporting of PySpark doctest failures. · 2415c18f
  Josh Rosen authored 12 years ago
  
  2415c18f
Feb 01, 2013

Use spark.local.dir for PySpark temp files (SPARK-580). · e211f405
Josh Rosen authored 12 years ago

e211f405

Do not launch JavaGateways on workers (SPARK-674). · 9cc6ff9c

Josh Rosen authored 12 years ago

The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded.  The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.

I also made the gateway and jvm variables private.

This change results in ~3-4x performance improvement when running the
PySpark unit tests.

9cc6ff9c

Jan 23, 2013
- Allow PySpark's SparkFiles to be used from driver · ae2ed294
  Josh Rosen authored 12 years ago
  
  Fix minor documentation formatting issues.
  ae2ed294