Skip to content
Snippets Groups Projects
  1. May 17, 2017
    • Andrew Ray's avatar
      [SPARK-20769][DOC] Incorrect documentation for using Jupyter notebook · 19954176
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error.
      
      ## How was this patch tested?
      
      Tested invocation locally with
      ```bash
      PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark
      ```
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18001 from aray/patch-1.
      19954176
  2. May 07, 2017
    • Steve Loughran's avatar
      [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. · 2cf83c47
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
      
      It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
      
      There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
      
      (this is the successor to #12004; I can't re-open it)
      
      ## How was this patch tested?
      
      Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
      
      Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
      
      Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
      
      SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
      maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
      
      This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
      
      Author: Steve Loughran <stevel@apache.org>
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #17834 from steveloughran/cloud/SPARK-7481-current.
      2cf83c47
  3. Mar 10, 2017
  4. Mar 07, 2017
  5. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      0e240549
  6. Jan 24, 2017
    • uncleGen's avatar
      [DOCS] Fix typo in docs · 7c61c2a1
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Fix typo in docs
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16658 from uncleGen/typo-issue.
      7c61c2a1
  7. Dec 16, 2016
    • Michal Senkyr's avatar
      [SPARK-18723][DOC] Expanded programming guide information on wholeTex… · 836c95b1
      Michal Senkyr authored
      ## What changes were proposed in this pull request?
      
      Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.
      
      Also added reference to the underlying CombineFileInputFormat
      
      ## How was this patch tested?
      
      Manual build of documentation and inspection in browser
      
      ```
      cd docs
      jekyll serve --watch
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
      836c95b1
  8. Dec 12, 2016
  9. Nov 29, 2016
    • aokolnychyi's avatar
      [MINOR][DOCS] Updates to the Accumulator example in the programming guide.... · f045d9da
      aokolnychyi authored
      [MINOR][DOCS] Updates to the Accumulator example in the programming guide. Fixed typos, AccumulatorV2 in Java
      
      ## What changes were proposed in this pull request?
      
      This pull request contains updates to Scala and Java Accumulator code snippets in the programming guide.
      
      - For Scala, the pull request fixes the signature of the 'add()' method in the custom Accumulator, which contained two params (as the old AccumulatorParam) instead of one (as in AccumulatorV2).
      
      - The Java example was updated to use the AccumulatorV2 class since AccumulatorParam is marked as deprecated.
      
      - Scala and Java examples are more consistent now.
      
      ## How was this patch tested?
      
      This patch was tested manually by building the docs locally.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/20652099/77d98d18-b4f3-11e6-8565-a995fe8cf8e5.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16024 from aokolnychyi/fixed_accumulator_example.
      f045d9da
  10. Nov 14, 2016
  11. Nov 03, 2016
    • Sean Owen's avatar
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6... · dc4c6009
      Sean Owen authored
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
      
      ## What changes were proposed in this pull request?
      
      Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15733 from srowen/SPARK-18138.
      dc4c6009
  12. Oct 22, 2016
    • Sean Owen's avatar
      [SPARK-17898][DOCS] repositories needs username and password · 01b26a06
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Document `user:password` syntax as possible means of specifying credentials for password-protected `--repositories`
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15584 from srowen/SPARK-17898.
      01b26a06
  13. Oct 12, 2016
    • Kousuke Saruta's avatar
      [SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is incorrect. · b512f04f
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.
      
      ## How was this patch tested?
      manual test.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #15439 from sarutak/SPARK-17880.
      b512f04f
  14. Aug 30, 2016
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
  15. Aug 12, 2016
  16. Aug 07, 2016
    • Shivansh's avatar
      [SPARK-16911] Fix the links in the programming guide · 6c1ecb19
      Shivansh authored
      ## What changes were proposed in this pull request?
      
       Fix the broken links in the programming guide of the Graphx Migration and understanding closures
      
      ## How was this patch tested?
      
      By running the test cases  and checking the links.
      
      Author: Shivansh <shiv4nsh@gmail.com>
      
      Closes #14503 from shiv4nsh/SPARK-16911.
      6c1ecb19
    • Bryan Cutler's avatar
      [SPARK-16932][DOCS] Changed programming guide to not reference old accumulator API in Scala · b1ebe182
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing.  This is not necessary for Scala, so all references to the old API are removed.  For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet.  Python has not currently implemented the new API.
      
      ## How was this patch tested?
      built doc locally
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
      b1ebe182
  17. Jul 15, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide · 5ffd5d38
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DataFrame-based API primary
      * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
      * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
      * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
        * **Reviewers: please check this carefully**
      * (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
      * Moved migration guide to ml-guide from mllib-guide
        * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
        * **Reviewers**: I did not change any of the content of the migration guides.
      
      Reorganized DataFrame-based guide:
      * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
      * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
        * **Reviewers**: I did not change the content of these guides, except some intro text.
      * Sidebar remains the same, but with pipeline and tuning sections added
      
      Other:
      * ml-classification-regression.html: Moved text about linear methods to new section in page
      
      ## How was this patch tested?
      
      Generated docs locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14213 from jkbradley/ml-guide-2.0.
      5ffd5d38
  18. Jul 13, 2016
    • sandy's avatar
      [SPARK-16438] Add Asynchronous Actions documentation · bf107f1e
      sandy authored
      ## What changes were proposed in this pull request?
      
      Add Asynchronous Actions documentation inside action of programming guide
      
      ## How was this patch tested?
      
      check the documentation indentation and formatting with md preview.
      
      Author: sandy <phalodi@gmail.com>
      
      Closes #14104 from phalodi/SPARK-16438.
      bf107f1e
  19. Jun 20, 2016
  20. Jun 14, 2016
  21. Jun 12, 2016
    • Sean Owen's avatar
      [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API · f51dfe61
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Deprecate old Java accumulator API; should use Scala now
      - Update Java tests and examples
      - Don't bother testing old accumulator API in Java 8 (too)
      - (fix a misspelling too)
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13606 from srowen/SPARK-15086.
      f51dfe61
  22. Jun 01, 2016
    • WeichenXu's avatar
      [SPARK-15702][DOCUMENTATION] Update document programming-guide accumulator section · 2402b914
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Update document programming-guide accumulator section (scala language)
      java and python version, because the API haven't done, so I do not modify them.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13441 from WeichenXu123/update_doc_accumulatorV2_clean.
      2402b914
  23. May 03, 2016
  24. Apr 30, 2016
    • pshearer's avatar
      [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS are set · 0368ff30
      pshearer authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13973
      
      Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.
      
      ## How was this patch tested?
      
      Manual testing; set IPYTHON=1 and verified that the error message prints.
      
      Author: pshearer <pshearer@massmutual.com>
      Author: shearerp <shearerp@umich.edu>
      
      Closes #12528 from shearerp/master.
      0368ff30
  25. Apr 28, 2016
  26. Apr 24, 2016
  27. Feb 27, 2016
    • Reynold Xin's avatar
      [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts · 59e3e10b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts.
      
      Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11400 from rxin/release-script.
      59e3e10b
  28. Feb 22, 2016
  29. Feb 19, 2016
  30. Jan 23, 2016
  31. Dec 22, 2015
  32. Dec 18, 2015
    • gatorsmile's avatar
      [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels · 499ac3e6
      gatorsmile authored
      The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
      
      davies Is this inconsistency intentional? Thanks!
      
      Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
      
      Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10092 from gatorsmile/persistStorageLevel.
      499ac3e6
  33. Nov 01, 2015
  34. Sep 28, 2015
  35. Sep 21, 2015
  36. Aug 22, 2015
    • Keiji Yoshida's avatar
      Update programming-guide.md · 46fcb9e0
      Keiji Yoshida authored
      Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`.
      
      Author: Keiji Yoshida <yoshida.keiji.84@gmail.com>
      
      Closes #8372 from yosssi/patch-1.
      46fcb9e0
  37. Aug 19, 2015
  38. Jun 19, 2015
    • Sean Owen's avatar
      [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps... · 4be53d03
      Sean Owen authored
      [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
      
      Clarify what may cause long-running Spark apps to preserve shuffle files
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6901 from srowen/SPARK-5836 and squashes the following commits:
      
      a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
      4be53d03
Loading