Commits · 3d3acef0474b6dc21f1b470ea96079a491e58b75 · cs525-sp18-g07 / spark

Mar 06, 2014

SPARK-1187, Added missing Python APIs · 3d3acef0

Prabin Banka authored 11 years ago

The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()

was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #75 from prabinb/python-api-backup and squashes the following commits:

cc3c6cd [Prabin Banka] Added missing Python APIs

3d3acef0

Feb 20, 2014

SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795

Ahir Reddy authored 11 years ago

Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:

a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

59b13795

Jan 28, 2014

Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72

Josh Rosen authored 11 years ago

This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.

1381fc72

Jan 01, 2014
- Fix Python code after change of getOrElse · 7e8d2e8a
  Matei Zaharia authored 11 years ago
  
  7e8d2e8a
Dec 30, 2013
- Updated docs for SparkConf and handled review comments · 0fa58097
  Matei Zaharia authored 11 years ago
  
  0fa58097
Dec 29, 2013
- Properly show Spark properties on web UI, and change app name property · 994f080f
  Matei Zaharia authored 11 years ago
  
  994f080f
- Fix some Python docs and make sure to unset SPARK_TESTING in Python · eaa8a68f
  Matei Zaharia authored 11 years ago
  
  tests so we don't get the test spark.conf on the classpath.
  eaa8a68f
- Add Python docs about SparkConf · 58c6fa20
  Matei Zaharia authored 11 years ago
  
  58c6fa20
- Fix some other Python tests due to initializing JVM in a different way · 615fb649
  Matei Zaharia authored 11 years ago
  
  The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.
  615fb649
- Add SparkConf support in Python · cd00225d
  Matei Zaharia authored 11 years ago
  
  cd00225d
Dec 28, 2013
- Fix Python use of getLocalDir · 1c11f54a
  Matei Zaharia authored 11 years ago
  
  1c11f54a
Dec 24, 2013
- Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. · d4dfab50
  Tathagata Das authored 11 years ago
  
  d4dfab50
Dec 18, 2013
- Add collectPartition to JavaRDD interface. · af0cd6bd
  Shivaram Venkataraman authored 11 years ago
  
  Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
  af0cd6bd
Nov 10, 2013

FramedSerializer: _dumps => dumps, _loads => loads. · 13122ceb
Josh Rosen authored 11 years ago

13122ceb

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Nov 03, 2013

Remove Pickle-wrapping of Java objects in PySpark. · 7d68a81a

Josh Rosen authored 11 years ago

If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.

7d68a81a

Oct 22, 2013

Pass self to SparkContext._ensure_initialized. · 317a9eb1

Ewen Cheslack-Postava authored 11 years ago

The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.

317a9eb1

Add classmethod to SparkContext to set system properties. · 56d230e6

Ewen Cheslack-Postava authored 11 years ago

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.

56d230e6

Sep 08, 2013
- Whoopsy daisy · a3868544
  Aaron Davidson authored 12 years ago
  
  a3868544
Sep 07, 2013

Export StorageLevel and refactor · c1cc8c4d
Aaron Davidson authored 12 years ago

c1cc8c4d

Remove reflection, hard-code StorageLevels · 8001687a

Aaron Davidson authored 12 years ago

The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.

As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python

8001687a

Sep 06, 2013
- Memoize StorageLevels read from JVM · b8a0b6ea
  Aaron Davidson authored 12 years ago
  
  b8a0b6ea
- SPARK-660: Add StorageLevel support in Python · a63d4c7d
  Aaron Davidson authored 12 years ago
  
  It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
  a63d4c7d
Sep 01, 2013
- Move some classes to more appropriate packages: · 0a8cc309
  Matei Zaharia authored 12 years ago
  
  * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
  0a8cc309
- Initial work to rename package to org.apache.spark · 46eecd11
  Matei Zaharia authored 12 years ago
  
  46eecd11
Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 12 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Jul 29, 2013

SPARK-815. Python parallelize() should split lists before batching · feba7ee5

Matei Zaharia authored 12 years ago

One unfortunate consequence of this fix is that we materialize any
collections that are given to us as generators, but this seems necessary
to get reasonable behavior on small collections. We could add a
batchSize parameter later to bypass auto-computation of batch size if
this becomes a problem (e.g. if users really want to parallelize big
generators nicely)

feba7ee5

Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 12 years ago
  
  af3c9d50
Feb 03, 2013
- Fix reporting of PySpark doctest failures. · 2415c18f
  Josh Rosen authored 12 years ago
  
  2415c18f
Feb 01, 2013

Use spark.local.dir for PySpark temp files (SPARK-580). · e211f405
Josh Rosen authored 12 years ago

e211f405

Do not launch JavaGateways on workers (SPARK-674). · 9cc6ff9c

Josh Rosen authored 12 years ago

The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded.  The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.

I also made the gateway and jvm variables private.

This change results in ~3-4x performance improvement when running the
PySpark unit tests.

9cc6ff9c

Jan 23, 2013
- Allow PySpark's SparkFiles to be used from driver · ae2ed294
  Josh Rosen authored 12 years ago
  
  Fix minor documentation formatting issues.
  ae2ed294
Jan 22, 2013
- Fix sys.path bug in PySpark SparkContext.addPyFile · 35168d9c
  Josh Rosen authored 12 years ago
  
  35168d9c
- Make AccumulatorParam an abstract base class. · c75ae362
  Josh Rosen authored 12 years ago
  
  c75ae362
Jan 21, 2013

Don't download files to master's working directory. · ef711902

Josh Rosen authored 12 years ago

This should avoid exceptions caused by existing
files with different contents.

I also removed some unused code.

ef711902

Jan 20, 2013
- Update checkpointing API docs in Python/Java. · 5b6ea9e9
  Josh Rosen authored 12 years ago
  
  5b6ea9e9
- Add checkpointFile() and more tests to PySpark. · d0ba80dc
  Josh Rosen authored 12 years ago
  
  d0ba80dc
- Add RDD checkpointing to Python API. · 7ed1bf4b
  Josh Rosen authored 12 years ago
  
  7ed1bf4b
- Added accumulators to PySpark · 8e7f098a
  Matei Zaharia authored 12 years ago
  
  8e7f098a
Jan 10, 2013
- Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON. · 49c74ba2
  Josh Rosen authored 12 years ago
  
  49c74ba2