Commits · a000b5c3b0438c17e9973df4832c320210c29c27 · cs525-sp18-g07 / spark

May 06, 2014

SPARK-1637: Clean up examples for 1.0 · a000b5c3

Sandeep authored 11 years ago

- [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
- [x] Move Python examples into examples/src/main/python
- [x] Update docs to reflect these changes

Author: Sandeep <sandeep@techaddict.me>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #571 from techaddict/SPARK-1637 and squashes the following commits:

47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
5f96121 [Sandeep] Move Python examples into examples/src/main/python
0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)

a000b5c3

[SPARK-1549] Add Python support to spark-submit · 951a5d93

Matei Zaharia authored 11 years ago

This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones

951a5d93

Apr 30, 2014

SPARK-1004. PySpark on YARN · ff5be9a4

Sandy Ryza authored 11 years ago

This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo

Author: Sandy Ryza <sandy@cloudera.com>

Closes #30 from sryza/sandy-spark-1004 and squashes the following commits:

89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
5165a02 [Sandy Ryza] Fix docs
fd0df79 [Sandy Ryza] PySpark on YARN

ff5be9a4

Apr 21, 2014

[SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs · fc783847

Matei Zaharia authored 11 years ago

I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.

Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/

Author: Matei Zaharia <matei@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>

Closes #457 from mateiz/better-docs and squashes the following commits:

a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
f05abc0 [Matei Zaharia] Don't include java.lang package names
995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
a14a93c [Matei Zaharia] typo
76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced

fc783847

Apr 15, 2014

SPARK-1426: Make MLlib work with NumPy versions older than 1.7 · df360917

Sandeep authored 11 years ago

Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array.
Replace it with a fallback

Author: Sandeep <sandeep@techaddict.me>

Closes #391 from techaddict/1426 and squashes the following commits:

d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older than 1.7 Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array. Replace it with a fallback

df360917

Apr 07, 2014

SPARK-1099: Introduce local[*] mode to infer number of cores · 0307db0f

Aaron Davidson authored 11 years ago

This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.

Author: Aaron Davidson <aaron@databricks.com>

Closes #182 from aarondav/110 and squashes the following commits:

a88294c [Aaron Davidson] Rebased changes for new spark-shell
a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores

0307db0f

Apr 05, 2014

SPARK-1421. Make MLlib work on Python 2.6 · 0b855167

Matei Zaharia authored 11 years ago

The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.

Author: Matei Zaharia <matei@databricks.com>

Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:

f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6

0b855167

Mar 13, 2014

SPARK-1183. Don't use "worker" to mean executor · 69837321

Sandy Ryza authored 11 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:

5066a4a [Sandy Ryza] Remove "worker" in a couple comments
0bd1e46 [Sandy Ryza] Remove --am-class from usage
bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
607539f [Sandy Ryza] Address review comments
74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor

69837321

Feb 26, 2014

Updated link for pyspark examples in docs · 26450351

Jyotiska NK authored 11 years ago

Author: Jyotiska NK <jyotiska123@gmail.com>

Closes #22 from jyotiska/pyspark_docs and squashes the following commits:

426136c [Jyotiska NK] Updated link for pyspark examples

26450351

Jan 15, 2014
- Clarify that Python 2.7 is only needed for MLlib · 2ffdaefb
  Matei Zaharia authored 11 years ago
  
  2ffdaefb
Jan 12, 2014
- Update Python required version to 2.7, and mention MLlib support · 224f1a75
  Matei Zaharia authored 11 years ago
  
  224f1a75
Jan 07, 2014

Simplify and fix pyspark script. · 82a1d38a

Patrick Wendell authored 11 years ago

This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.

I tested this using the three commands in the PySpark documentation page:

1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark

There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
  gloms them together as a single argument passed to `exec` which
  seemed to cause ipython to fail (it instead expects them as
  multiple arguments).

82a1d38a

Jan 06, 2014
- Code review feedback · d86dc74d
  Holden Karau authored 11 years ago
  
  d86dc74d
Jan 02, 2014
- pyspark -> bin/pyspark · a3f90a2e
  Prashant Sharma authored 11 years ago
  
  a3f90a2e
- Removed sbt folder and changed docs accordingly · 6be4c111
  Prashant Sharma authored 11 years ago
  
  6be4c111
Dec 30, 2013
- Updated docs for SparkConf and handled review comments · 0fa58097
  Matei Zaharia authored 11 years ago
  
  0fa58097
Oct 22, 2013
- Add notes to python documentation about using SparkContext.setSystemProperty. · c8748c25
  Ewen Cheslack-Postava authored 11 years ago
  
  c8748c25
Oct 09, 2013
- Fix PySpark docs and an overly long line of code after fdbae41e · 478b2b7e
  Matei Zaharia authored 11 years ago
  
  478b2b7e
Sep 10, 2013
- Update Python API features · 2425eb85
  Matei Zaharia authored 11 years ago
  
  2425eb85
Sep 08, 2013

More fair scheduler docs and property names. · 651a96ad

Matei Zaharia authored 11 years ago

Also changed uses of "job" terminology to "application" when they
referred to an entire Spark program, to avoid confusion.

651a96ad

Sep 02, 2013
- Doc improvements · 9ee1e9db
  Matei Zaharia authored 11 years ago
  
  9ee1e9db
Sep 01, 2013
- Fix more URLs in docs · d27cd03f
  Matei Zaharia authored 11 years ago
  
  d27cd03f
Aug 31, 2013
- More updates, describing changes to recommended use of environment vars · 4819baa6
  Matei Zaharia authored 11 years ago
  
  and new Python stuff
  4819baa6
Aug 29, 2013
- Update some build instructions because only sbt assembly and mvn package · 2de756ff
  Matei Zaharia authored 11 years ago
  
  are now needed
  2de756ff
Jul 29, 2013
- Add docs about ipython · 497f5575
  Matei Zaharia authored 11 years ago
  
  497f5575
Jul 01, 2013
- Clarify that PySpark is not supported on Windows · 7cd490ef
  root authored 12 years ago
  
  7cd490ef
Jun 26, 2013
- Simplify Python docs a little to do substring search · aea727f6
  Matei Zaharia authored 12 years ago
  
  aea727f6
Feb 25, 2013
- Some tweaks to docs · 351ac523
  Matei Zaharia authored 12 years ago
  
  351ac523
Feb 18, 2013

Added checkpointing and fault-tolerance semantics to the programming guide.... · 8ad561dc

Tathagata Das authored 12 years ago

Added checkpointing and fault-tolerance semantics to the programming guide. Fixed default checkpoint interval to being a multiple of slide duration. Fixed visibility of some classes and objects to clean up docs.

8ad561dc

Jan 30, 2013
- Make module help available in python shell. · 3f945e3b
  Patrick Wendell authored 12 years ago
  
  Also, adds a line in doc explaining how to use.
  3f945e3b
- Inclue packaging and launching pyspark in guide. · 58a7d320
  Patrick Wendell authored 12 years ago
  
  It's nicer if all the commands you need are made explicit.
  58a7d320
Jan 20, 2013
- Fix Python guide to say accumulators are available · ee5a0795
  Matei Zaharia authored 12 years ago
  
  ee5a0795
Jan 08, 2013
- Add mapPartitionsWithSplit() to PySpark. · b57dd0f1
  Josh Rosen authored 12 years ago
  
  b57dd0f1
Jan 01, 2013
- Add `pyspark` script to replace the other scripts. · ce9f1bbe
  Josh Rosen authored 12 years ago
  
  Expand the PySpark programming guide.
  ce9f1bbe
- Minor documentation and style fixes for PySpark. · 170e451f
  Josh Rosen authored 12 years ago
  
  170e451f
Dec 29, 2012
- Add documentation for Python API. · c2b105af
  Josh Rosen authored 12 years ago
  
  c2b105af