Commits · 89a20b83e9aedca3bf9f6623babeb455badf0ee7 · cs525-sp18-g07 / spark

Aug 29, 2013

Fix PySpark for assembly run and include it in dist · ab0e625d
Matei Zaharia authored 11 years ago

ab0e625d

Change build and run instructions to use assemblies · 53cd50c0

Matei Zaharia authored 11 years ago

This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.

53cd50c0

Aug 28, 2013
- RDD sample() and takeSample() prototypes for PySpark · a511c537
  Andre Schumacher authored 11 years ago
  
  a511c537
- Don't send SIGINT to Py4J gateway subprocess. · 742c44ea
  Josh Rosen authored 11 years ago
  
  This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
  742c44ea
- PySpark: implementing subtractByKey(), subtract() and keyBy() · 457bcd33
  Andre Schumacher authored 11 years ago
  
  457bcd33
Aug 21, 2013
- Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark · 76077bf9
  Andre Schumacher authored 11 years ago
  
  76077bf9
Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 11 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Aug 14, 2013
- Fix PySpark unit tests on Python 2.6. · 7a9abb9d
  Josh Rosen authored 11 years ago
  
  7a9abb9d
Aug 12, 2013
- Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark · 8fd5c7bc
  Andre Schumacher authored 11 years ago
  
  Now ADD_FILES uses a comma as file name separator.
  8fd5c7bc
Aug 11, 2013
- Code update for Matei's suggestions · 24f02082
  stayhf authored 11 years ago
  
  24f02082
Aug 10, 2013
- Simple PageRank algorithm implementation in Python for SPARK-760 · 55d9bde2
  stayhf authored 11 years ago
  
  55d9bde2
Aug 01, 2013
- Fix string parsing and style in LR · 5ac54839
  Matei Zaharia authored 11 years ago
  
  5ac54839
Jul 30, 2013

Do not inherit master's PYTHONPATH on workers. · b9573263

Josh Rosen authored 11 years ago

This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.

This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.

b9573263

Jul 29, 2013
- Update the Python logistic regression example to read from a file and · 01f94931
  Matei Zaharia authored 11 years ago
  
  batch input records for more efficient NumPy computations
  01f94931
- SPARK-815. Python parallelize() should split lists before batching · feba7ee5
  Matei Zaharia authored 11 years ago
  
  One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
  feba7ee5
- Use None instead of empty string as it's slightly smaller/faster · d75c3086
  Matei Zaharia authored 11 years ago
  
  d75c3086
- Allow python/run-tests to run from any directory · 96b50e82
  Matei Zaharia authored 11 years ago
  
  96b50e82
- Optimize Python foreach() to not return as many objects · b5ec3556
  Matei Zaharia authored 11 years ago
  
  b5ec3556
- Optimize Python take() to not compute entire first partition · b9d6783f
  Matei Zaharia authored 11 years ago
  
  b9d6783f
Jul 27, 2013
- Some fixes to Python examples (style and package name for LR) · f11ad72d
  Matei Zaharia authored 11 years ago
  
  f11ad72d
Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 11 years ago
  
  af3c9d50
Jul 01, 2013

Fixed PySpark perf regression by not using socket.makefile(), and improved · ec31e68d

root authored 11 years ago

debuggability by letting "print" statements show up in the executor's stderr

Conflicts:
	core/src/main/scala/spark/api/python/PythonRDD.scala

ec31e68d

Jun 21, 2013
- Fix reporting of PySpark exceptions · c75bed0e
  Jey Kottalam authored 11 years ago
  
  c75bed0e
- PySpark daemon: fix deadlock, improve error handling · 7c5ff733
  Jey Kottalam authored 11 years ago
  
  7c5ff733
- Add tests and fixes for Python daemon shutdown · 62c47814
  Jey Kottalam authored 11 years ago
  
  62c47814
- Prefork Python worker processes · c79a6078
  Jey Kottalam authored 11 years ago
  
  c79a6078
- Add Python timing instrumentation · 40afe0d2
  Jey Kottalam authored 12 years ago
  
  40afe0d2
Apr 02, 2013
- Fix Python saveAsTextFile doctest to not expect order to be preserved · 9a731f5a
  Jey Kottalam authored 12 years ago
  
  9a731f5a
- Fix argv handling in Python transitive closure example · 20604001
  Jey Kottalam authored 12 years ago
  
  20604001
Feb 24, 2013
- Change numSplits to numPartitions in PySpark. · 2c966c98
  Josh Rosen authored 12 years ago
  
  2c966c98
Feb 09, 2013
- Add commutative requirement for 'reduce' to Python docstring. · b7a1fb5c
  Mark Hamstra authored 12 years ago
  
  b7a1fb5c
Feb 03, 2013
- Remove unnecessary doctest __main__ methods. · e6172911
  Josh Rosen authored 12 years ago
  
  e6172911
- Fetch fewer objects in PySpark's take() method. · 8fbd5380
  Josh Rosen authored 12 years ago
  
  8fbd5380
- Fix reporting of PySpark doctest failures. · 2415c18f
  Josh Rosen authored 12 years ago
  
  2415c18f
Feb 01, 2013

Use spark.local.dir for PySpark temp files (SPARK-580). · e211f405
Josh Rosen authored 12 years ago

e211f405

Do not launch JavaGateways on workers (SPARK-674). · 9cc6ff9c

Josh Rosen authored 12 years ago

The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded.  The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.

I also made the gateway and jvm variables private.

This change results in ~3-4x performance improvement when running the
PySpark unit tests.

9cc6ff9c

Fix stdout redirection in PySpark. · 57b64d0d
Josh Rosen authored 12 years ago

57b64d0d

Jan 31, 2013

SPARK-673: Capture and re-throw Python exceptions · 3446d5c8

Patrick Wendell authored 12 years ago

This patch alters the Python <-> executor protocol to pass on
exception data when they occur in user Python code.

3446d5c8

Jan 30, 2013
- Make module help available in python shell. · 3f945e3b
  Patrick Wendell authored 12 years ago
  
  Also, adds a line in doc explaining how to use.
  3f945e3b
Jan 25, 2013
- Replace old 'master' term with 'driver'. · 7dfb82a9
  Stephen Haberman authored 12 years ago
  
  7dfb82a9