Commits · e5e0ebdb1190a256e51dbf1265c6957f0cd56a29 · cs525-sp18-g07 / spark

Oct 22, 2013

Pass self to SparkContext._ensure_initialized. · 317a9eb1

Ewen Cheslack-Postava authored 11 years ago

The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.

317a9eb1

Add classmethod to SparkContext to set system properties. · 56d230e6

Ewen Cheslack-Postava authored 11 years ago

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.

56d230e6

Oct 19, 2013

Add an add() method to pyspark accumulators. · 7eaa56de

Ewen Cheslack-Postava authored 11 years ago

Add a regular method for adding a term to accumulators in
pyspark. Currently if you have a non-global accumulator, adding to it
is awkward. The += operator can't be used for non-global accumulators
captured via closure because it's involves an assignment. The only way
to do it is using __iadd__ directly.

Adding this method lets you write code like this:

def main():
    sc = SparkContext()
    accum = sc.accumulator(0)

    rdd = sc.parallelize([1,2,3])
    def f(x):
        accum.add(x)
    rdd.foreach(f)
    print accum.value

where using accum += x instead would have caused UnboundLocalError
exceptions in workers. Currently it would have to be written as
accum.__iadd__(x).

7eaa56de

Oct 09, 2013
- Fix PySpark docs and an overly long line of code after fdbae41e · 478b2b7e
  Matei Zaharia authored 11 years ago
  
  478b2b7e
Oct 07, 2013
- SPARK-705: implement sortByKey() in PySpark · fdbae41e
  Andre Schumacher authored 11 years ago
  
  fdbae41e
Oct 04, 2013

Fixing SPARK-602: PythonPartitioner · c84946fe

Andre Schumacher authored 11 years ago

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.

c84946fe

Sep 24, 2013
- Update build version in master · 6079721f
  Patrick Wendell authored 11 years ago
  
  6079721f
Sep 08, 2013
- Whoopsy daisy · a3868544
  Aaron Davidson authored 11 years ago
  
  a3868544
Sep 07, 2013

Export StorageLevel and refactor · c1cc8c4d
Aaron Davidson authored 11 years ago

c1cc8c4d

Remove reflection, hard-code StorageLevels · 8001687a

Aaron Davidson authored 11 years ago

The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.

As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python

8001687a

Sep 06, 2013
- Memoize StorageLevels read from JVM · b8a0b6ea
  Aaron Davidson authored 11 years ago
  
  b8a0b6ea
- SPARK-660: Add StorageLevel support in Python · a63d4c7d
  Aaron Davidson authored 11 years ago
  
  It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
  a63d4c7d
Sep 02, 2013
- Add missing license headers found with RAT · 12b2f1f9
  Matei Zaharia authored 11 years ago
  
  12b2f1f9
- Exclude some private modules in epydoc · 2ba69529
  Matei Zaharia authored 11 years ago
  
  2ba69529
Sep 01, 2013
- Further fixes to get PySpark to work on Windows · 141f5427
  Matei Zaharia authored 11 years ago
  
  141f5427
- Allow PySpark to launch worker.py directly on Windows · 6550e5e6
  Matei Zaharia authored 11 years ago
  
  6550e5e6
- Move some classes to more appropriate packages: · 0a8cc309
  Matei Zaharia authored 11 years ago
  
  * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
  0a8cc309
- Add banner to PySpark and make wordcount output nicer · bbaa9d7d
  Matei Zaharia authored 11 years ago
  
  bbaa9d7d
- Initial work to rename package to org.apache.spark · 46eecd11
  Matei Zaharia authored 11 years ago
  
  46eecd11
Aug 29, 2013

Fix PySpark for assembly run and include it in dist · ab0e625d
Matei Zaharia authored 11 years ago

ab0e625d

Change build and run instructions to use assemblies · 53cd50c0

Matei Zaharia authored 11 years ago

This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.

53cd50c0

Aug 28, 2013
- RDD sample() and takeSample() prototypes for PySpark · a511c537
  Andre Schumacher authored 11 years ago
  
  a511c537
- Don't send SIGINT to Py4J gateway subprocess. · 742c44ea
  Josh Rosen authored 11 years ago
  
  This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
  742c44ea
- PySpark: implementing subtractByKey(), subtract() and keyBy() · 457bcd33
  Andre Schumacher authored 11 years ago
  
  457bcd33
Aug 21, 2013
- Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark · 76077bf9
  Andre Schumacher authored 11 years ago
  
  76077bf9
Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 11 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Aug 14, 2013
- Fix PySpark unit tests on Python 2.6. · 7a9abb9d
  Josh Rosen authored 11 years ago
  
  7a9abb9d
Aug 12, 2013
- Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark · 8fd5c7bc
  Andre Schumacher authored 11 years ago
  
  Now ADD_FILES uses a comma as file name separator.
  8fd5c7bc
Aug 11, 2013
- Code update for Matei's suggestions · 24f02082
  stayhf authored 11 years ago
  
  24f02082
Aug 10, 2013
- Simple PageRank algorithm implementation in Python for SPARK-760 · 55d9bde2
  stayhf authored 11 years ago
  
  55d9bde2
Aug 01, 2013
- Fix string parsing and style in LR · 5ac54839
  Matei Zaharia authored 11 years ago
  
  5ac54839
Jul 30, 2013

Do not inherit master's PYTHONPATH on workers. · b9573263

Josh Rosen authored 11 years ago

This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.

This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.

b9573263

Jul 29, 2013
- Update the Python logistic regression example to read from a file and · 01f94931
  Matei Zaharia authored 11 years ago
  
  batch input records for more efficient NumPy computations
  01f94931
- SPARK-815. Python parallelize() should split lists before batching · feba7ee5
  Matei Zaharia authored 11 years ago
  
  One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
  feba7ee5
- Use None instead of empty string as it's slightly smaller/faster · d75c3086
  Matei Zaharia authored 11 years ago
  
  d75c3086
- Allow python/run-tests to run from any directory · 96b50e82
  Matei Zaharia authored 11 years ago
  
  96b50e82
- Optimize Python foreach() to not return as many objects · b5ec3556
  Matei Zaharia authored 11 years ago
  
  b5ec3556
- Optimize Python take() to not compute entire first partition · b9d6783f
  Matei Zaharia authored 11 years ago
  
  b9d6783f
Jul 27, 2013
- Some fixes to Python examples (style and package name for LR) · f11ad72d
  Matei Zaharia authored 11 years ago
  
  f11ad72d
Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 11 years ago
  
  af3c9d50