- Dec 20, 2013
-
-
Tor Myklebust authored
-
Tor Myklebust authored
-
Tor Myklebust authored
-
Tor Myklebust authored
-
- Dec 19, 2013
-
-
Tor Myklebust authored
-
Tor Myklebust authored
-
Tor Myklebust authored
Incorporate most of Josh's style suggestions. I don't want to deal with the type and length checking errors until we've got at least one working stub that we're all happy with.
-
Tor Myklebust authored
-
Tor Myklebust authored
-
- Dec 09, 2013
-
-
Patrick Wendell authored
-
- Nov 29, 2013
-
-
Josh Rosen authored
Fixes SPARK-970.
-
- Nov 26, 2013
-
-
Josh Rosen authored
-
- Nov 10, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
-
Josh Rosen authored
For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
-
- Nov 03, 2013
-
-
Josh Rosen authored
If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
-
Josh Rosen authored
Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.
-
- Oct 22, 2013
-
-
Ewen Cheslack-Postava authored
The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
-
Ewen Cheslack-Postava authored
Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
-
- Oct 19, 2013
-
-
Ewen Cheslack-Postava authored
Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).
-
- Oct 09, 2013
-
-
Matei Zaharia authored
-
- Oct 07, 2013
-
-
Andre Schumacher authored
-
- Oct 04, 2013
-
-
Andre Schumacher authored
Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
-
- Sep 24, 2013
-
-
Patrick Wendell authored
-
- Sep 08, 2013
-
-
Aaron Davidson authored
-
- Sep 07, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
-
- Sep 06, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
-
- Sep 02, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
-
- Sep 01, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Matei Zaharia authored
* RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
-
Matei Zaharia authored
-
Matei Zaharia authored
-
- Aug 30, 2013
-
-
Andre Schumacher authored
-
- Aug 29, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.
-
- Aug 28, 2013
-
-
Andre Schumacher authored
-