Commits · b835ddf3dffe8698dab3b42c14a9da472868b13c · cs525-sp18-g07 / spark

Dec 20, 2013
- Licence notice. · b835ddf3
  Tor Myklebust authored 11 years ago
  
  b835ddf3
- Whitespace. · d89cc1e2
  Tor Myklebust authored 11 years ago
  
  d89cc1e2
- Remove gigantic endian-specific test and exception tests. · 319520b9
  Tor Myklebust authored 11 years ago
  
  319520b9
- Tests for the Python side of the mllib bindings. · 2940201a
  Tor Myklebust authored 11 years ago
  
  2940201a
Dec 19, 2013
- Python stubs for classification and clustering. · 73e17064
  Tor Myklebust authored 11 years ago
  
  73e17064
- Python side of python bindings for linear, Lasso, and ridge regression · 2328bdd0
  Tor Myklebust authored 11 years ago
  
  2328bdd0
- Incorporate most of Josh's style suggestions. I don't want to deal with the... · bf20591a
  Tor Myklebust authored 11 years ago
  
  Incorporate most of Josh's style suggestions. I don't want to deal with the type and length checking errors until we've got at least one working stub that we're all happy with.
  bf20591a
- The rest of the Python side of those bindings. · bf491bb3
  Tor Myklebust authored 11 years ago
  
  bf491bb3
- First cut at python mllib bindings. Only LinearRegression is supported. · 95915f8b
  Tor Myklebust authored 11 years ago
  
  95915f8b
Dec 09, 2013
- License headers · 5b74609d
  Patrick Wendell authored 11 years ago
  
  5b74609d
Nov 29, 2013
- Fix UnicodeEncodeError in PySpark saveAsTextFile(). · 3787f514
  Josh Rosen authored 11 years ago
  
  Fixes SPARK-970.
  3787f514
Nov 26, 2013
- Removed unused basestring case from dump_stream. · 1b74a27d
  Josh Rosen authored 11 years ago
  
  1b74a27d
Nov 10, 2013

FramedSerializer: _dumps => dumps, _loads => loads. · 13122ceb
Josh Rosen authored 11 years ago

13122ceb
Send PySpark commands as bytes insetad of strings. · ffa5bedf
Josh Rosen authored 11 years ago

ffa5bedf

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Nov 03, 2013

Remove Pickle-wrapping of Java objects in PySpark. · 7d68a81a

Josh Rosen authored 11 years ago

If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.

7d68a81a

Replace magic lengths with constants in PySpark. · a48d88d2

Josh Rosen authored 11 years ago

Write the length of the accumulators section up-front rather
than terminating it with a negative length.  I find this
easier to read.

a48d88d2

Oct 22, 2013

Pass self to SparkContext._ensure_initialized. · 317a9eb1

Ewen Cheslack-Postava authored 11 years ago

The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.

317a9eb1

Add classmethod to SparkContext to set system properties. · 56d230e6

Ewen Cheslack-Postava authored 11 years ago

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.

56d230e6

Oct 19, 2013

Add an add() method to pyspark accumulators. · 7eaa56de

Ewen Cheslack-Postava authored 11 years ago

Add a regular method for adding a term to accumulators in
pyspark. Currently if you have a non-global accumulator, adding to it
is awkward. The += operator can't be used for non-global accumulators
captured via closure because it's involves an assignment. The only way
to do it is using __iadd__ directly.

Adding this method lets you write code like this:

def main():
    sc = SparkContext()
    accum = sc.accumulator(0)

    rdd = sc.parallelize([1,2,3])
    def f(x):
        accum.add(x)
    rdd.foreach(f)
    print accum.value

where using accum += x instead would have caused UnboundLocalError
exceptions in workers. Currently it would have to be written as
accum.__iadd__(x).

7eaa56de

Oct 09, 2013
- Fix PySpark docs and an overly long line of code after fdbae41e · 478b2b7e
  Matei Zaharia authored 11 years ago
  
  478b2b7e
Oct 07, 2013
- SPARK-705: implement sortByKey() in PySpark · fdbae41e
  Andre Schumacher authored 11 years ago
  
  fdbae41e
Oct 04, 2013

Fixing SPARK-602: PythonPartitioner · c84946fe

Andre Schumacher authored 11 years ago

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.

c84946fe

Sep 24, 2013
- Update build version in master · 6079721f
  Patrick Wendell authored 11 years ago
  
  6079721f
Sep 08, 2013
- Whoopsy daisy · a3868544
  Aaron Davidson authored 11 years ago
  
  a3868544
Sep 07, 2013

Export StorageLevel and refactor · c1cc8c4d
Aaron Davidson authored 11 years ago

c1cc8c4d

Remove reflection, hard-code StorageLevels · 8001687a

Aaron Davidson authored 11 years ago

The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.

As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python

8001687a

Sep 06, 2013
- Memoize StorageLevels read from JVM · b8a0b6ea
  Aaron Davidson authored 11 years ago
  
  b8a0b6ea
- SPARK-660: Add StorageLevel support in Python · a63d4c7d
  Aaron Davidson authored 11 years ago
  
  It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
  a63d4c7d
Sep 02, 2013
- Add missing license headers found with RAT · 12b2f1f9
  Matei Zaharia authored 11 years ago
  
  12b2f1f9
- Exclude some private modules in epydoc · 2ba69529
  Matei Zaharia authored 11 years ago
  
  2ba69529
Sep 01, 2013
- Further fixes to get PySpark to work on Windows · 141f5427
  Matei Zaharia authored 11 years ago
  
  141f5427
- Allow PySpark to launch worker.py directly on Windows · 6550e5e6
  Matei Zaharia authored 11 years ago
  
  6550e5e6
- Move some classes to more appropriate packages: · 0a8cc309
  Matei Zaharia authored 11 years ago
  
  * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
  0a8cc309
- Add banner to PySpark and make wordcount output nicer · bbaa9d7d
  Matei Zaharia authored 11 years ago
  
  bbaa9d7d
- Initial work to rename package to org.apache.spark · 46eecd11
  Matei Zaharia authored 11 years ago
  
  46eecd11
Aug 30, 2013
- PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.py · 96571c25
  Andre Schumacher authored 11 years ago
  
  96571c25
Aug 29, 2013

Fix PySpark for assembly run and include it in dist · ab0e625d
Matei Zaharia authored 11 years ago

ab0e625d

Change build and run instructions to use assemblies · 53cd50c0

Matei Zaharia authored 11 years ago

This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.

53cd50c0

Aug 28, 2013
- RDD sample() and takeSample() prototypes for PySpark · a511c537
  Andre Schumacher authored 11 years ago
  
  a511c537