- Jan 01, 2014
-
-
Matei Zaharia authored
-
- Dec 30, 2013
-
-
Matei Zaharia authored
-
- Dec 29, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
tests so we don't get the test spark.conf on the classpath.
-
Matei Zaharia authored
-
Matei Zaharia authored
The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.
-
Matei Zaharia authored
-
- Dec 28, 2013
-
-
Matei Zaharia authored
-
- Dec 24, 2013
-
-
Tathagata Das authored
-
- Dec 18, 2013
-
-
Shivaram Venkataraman authored
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
-
- Nov 10, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
-
- Nov 03, 2013
-
-
Josh Rosen authored
If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
-
- Oct 22, 2013
-
-
Ewen Cheslack-Postava authored
The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
-
Ewen Cheslack-Postava authored
Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
-
- Sep 08, 2013
-
-
Aaron Davidson authored
-
- Sep 07, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
-
- Sep 06, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
-
- Sep 01, 2013
-
-
Matei Zaharia authored
* RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
-
Matei Zaharia authored
-
- Aug 16, 2013
-
-
Andre Schumacher authored
Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path
-
- Jul 29, 2013
-
-
Matei Zaharia authored
One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
-
- Jul 16, 2013
-
-
Matei Zaharia authored
-
- Feb 03, 2013
-
-
Josh Rosen authored
-
- Feb 01, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.
-
- Jan 23, 2013
-
-
Josh Rosen authored
Fix minor documentation formatting issues.
-
- Jan 22, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
-
- Jan 21, 2013
-
-
Josh Rosen authored
This should avoid exceptions caused by existing files with different contents. I also removed some unused code.
-
- Jan 20, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
-
Josh Rosen authored
-
Matei Zaharia authored
-
- Jan 10, 2013
-
-
Josh Rosen authored
-
- Jan 03, 2013
-
-
Josh Rosen authored
-
- Jan 01, 2013
-
-
Josh Rosen authored
-
- Dec 29, 2012
-
-
Josh Rosen authored
-