- Oct 22, 2013
-
-
Ewen Cheslack-Postava authored
The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
-
Ewen Cheslack-Postava authored
Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
-
- Oct 19, 2013
-
-
Ewen Cheslack-Postava authored
Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).
-
- Oct 09, 2013
-
-
Matei Zaharia authored
-
- Oct 07, 2013
-
-
Andre Schumacher authored
-
- Oct 04, 2013
-
-
Andre Schumacher authored
Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
-
- Sep 24, 2013
-
-
Patrick Wendell authored
-
- Sep 08, 2013
-
-
Aaron Davidson authored
-
- Sep 07, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
-
- Sep 06, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
-
- Sep 02, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
-
- Sep 01, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Matei Zaharia authored
* RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
-
Matei Zaharia authored
-
Matei Zaharia authored
-
- Aug 29, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.
-
- Aug 28, 2013
-
-
Andre Schumacher authored
-
Josh Rosen authored
This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
-
Andre Schumacher authored
-
- Aug 21, 2013
-
-
Andre Schumacher authored
-
- Aug 16, 2013
-
-
Andre Schumacher authored
Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path
-
- Aug 14, 2013
-
-
Josh Rosen authored
-
- Aug 12, 2013
-
-
Andre Schumacher authored
Now ADD_FILES uses a comma as file name separator.
-
- Aug 11, 2013
-
-
stayhf authored
-
- Aug 10, 2013
-
-
stayhf authored
-
- Aug 01, 2013
-
-
Matei Zaharia authored
-
- Jul 30, 2013
-
-
Josh Rosen authored
This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
-
- Jul 29, 2013
-
-
Matei Zaharia authored
batch input records for more efficient NumPy computations
-
Matei Zaharia authored
One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Matei Zaharia authored
-
- Jul 27, 2013
-
-
Matei Zaharia authored
-
- Jul 16, 2013
-
-
Matei Zaharia authored
-