- May 17, 2017
-
-
Andrew Ray authored
## What changes were proposed in this pull request? SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error. ## How was this patch tested? Tested invocation locally with ```bash PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark ``` Author: Andrew Ray <ray.andrew@gmail.com> Closes #18001 from aray/patch-1.
-
- May 07, 2017
-
-
Steve Loughran authored
## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
-
- Mar 10, 2017
-
-
Yong Tang authored
This fix removes deprecated support for config `SPARK_YARN_USER_ENV`, as is mentioned in SPARK-17979. This fix also removes deprecated support for the following: ``` SPARK_YARN_USER_ENV SPARK_JAVA_OPTS SPARK_CLASSPATH SPARK_WORKER_INSTANCES ``` Related JIRA: [SPARK-14453]: https://issues.apache.org/jira/browse/SPARK-14453 [SPARK-12344]: https://issues.apache.org/jira/browse/SPARK-12344 [SPARK-15781]: https://issues.apache.org/jira/browse/SPARK-15781 Existing tests should pass. Author: Yong Tang <yong.tang.github@outlook.com> Closes #17212 from yongtang/SPARK-17979.
-
- Mar 07, 2017
-
-
Wenchen Fan authored
## What changes were proposed in this pull request? After Spark 2.0, `SparkSession` becomes the new entry point for Spark applications. We should update the public documents to reflect this. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #16856 from cloud-fan/doc.
-
- Feb 16, 2017
-
-
Sean Owen authored
- Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16871 from srowen/SPARK-19493.
-
- Jan 24, 2017
-
-
uncleGen authored
## What changes were proposed in this pull request? Fix typo in docs ## How was this patch tested? Author: uncleGen <hustyugm@gmail.com> Closes #16658 from uncleGen/typo-issue.
-
- Dec 16, 2016
-
-
Michal Senkyr authored
## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs jekyll serve --watch ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
-
- Dec 12, 2016
-
-
Bill Chambers authored
## What changes were proposed in this pull request? This PR clarifies where accumulators will be displayed. ## How was this patch tested? No testing. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bill Chambers <bill@databricks.com> Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #16180 from anabranch/improve-acc-docs.
-
- Nov 29, 2016
-
-
aokolnychyi authored
[MINOR][DOCS] Updates to the Accumulator example in the programming guide. Fixed typos, AccumulatorV2 in Java ## What changes were proposed in this pull request? This pull request contains updates to Scala and Java Accumulator code snippets in the programming guide. - For Scala, the pull request fixes the signature of the 'add()' method in the custom Accumulator, which contained two params (as the old AccumulatorParam) instead of one (as in AccumulatorV2). - The Java example was updated to use the AccumulatorV2 class since AccumulatorParam is marked as deprecated. - Scala and Java examples are more consistent now. ## How was this patch tested? This patch was tested manually by building the docs locally.  Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #16024 from aokolnychyi/fixed_accumulator_example.
-
- Nov 14, 2016
-
-
Noritaka Sekiyama authored
Changed HDFS default block size from 64MB to 128MB. https://issues.apache.org/jira/browse/SPARK-18432 Author: Noritaka Sekiyama <moomindani@gmail.com> Closes #15879 from moomindani/SPARK-18432.
-
- Nov 03, 2016
-
-
Sean Owen authored
[SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0 ## What changes were proposed in this pull request? Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it. ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15733 from srowen/SPARK-18138.
-
- Oct 22, 2016
-
-
Sean Owen authored
## What changes were proposed in this pull request? Document `user:password` syntax as possible means of specifying credentials for password-protected `--repositories` ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15584 from srowen/SPARK-17898.
-
- Oct 12, 2016
-
-
Kousuke Saruta authored
## What changes were proposed in this pull request? In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. ## How was this patch tested? manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #15439 from sarutak/SPARK-17880.
-
- Aug 30, 2016
-
-
Dmitriy Sokolov authored
## What changes were proposed in this pull request? Fix minor typos python example code in streaming programming guide ## How was this patch tested? N/A Author: Dmitriy Sokolov <silentsokolov@gmail.com> Closes #14805 from silentsokolov/fix-typos.
-
- Aug 12, 2016
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR fixes the documentation as below: - Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous) ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #14593 from HyukjinKwon/minor-documentation.
-
- Aug 07, 2016
-
-
Shivansh authored
## What changes were proposed in this pull request? Fix the broken links in the programming guide of the Graphx Migration and understanding closures ## How was this patch tested? By running the test cases and checking the links. Author: Shivansh <shiv4nsh@gmail.com> Closes #14503 from shiv4nsh/SPARK-16911.
-
Bryan Cutler authored
## What changes were proposed in this pull request? In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing. This is not necessary for Scala, so all references to the old API are removed. For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet. Python has not currently implemented the new API. ## How was this patch tested? built doc locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
-
- Jul 15, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
-
- Jul 13, 2016
-
-
sandy authored
## What changes were proposed in this pull request? Add Asynchronous Actions documentation inside action of programming guide ## How was this patch tested? check the documentation indentation and formatting with md preview. Author: sandy <phalodi@gmail.com> Closes #14104 from phalodi/SPARK-16438.
-
- Jun 20, 2016
-
-
Eric Liang authored
This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon. Author: Eric Liang <ekl@databricks.com> Closes #13744 from ericl/spark-16025.
-
- Jun 14, 2016
-
-
Mortada Mehyar authored
## What changes were proposed in this pull request? minor typo ## How was this patch tested? minor typo in the doc, should be self explanatory Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #13639 from mortada/typo.
-
- Jun 12, 2016
-
-
Sean Owen authored
## What changes were proposed in this pull request? - Deprecate old Java accumulator API; should use Scala now - Update Java tests and examples - Don't bother testing old accumulator API in Java 8 (too) - (fix a misspelling too) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13606 from srowen/SPARK-15086.
-
- Jun 01, 2016
-
-
WeichenXu authored
## What changes were proposed in this pull request? Update document programming-guide accumulator section (scala language) java and python version, because the API haven't done, so I do not modify them. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13441 from WeichenXu123/update_doc_accumulatorV2_clean.
-
- May 03, 2016
-
-
Sandeep Singh authored
Author: Sandeep Singh <sandeep@techaddict.me> Closes #12841 from techaddict/improve_docs_1.
-
- Apr 30, 2016
-
-
pshearer authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13973 Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing. ## How was this patch tested? Manual testing; set IPYTHON=1 and verified that the error message prints. Author: pshearer <pshearer@massmutual.com> Author: shearerp <shearerp@umich.edu> Closes #12528 from shearerp/master.
-
- Apr 28, 2016
-
-
Sean Owen authored
## What changes were proposed in this pull request? Add simple clarification that Spark can be cross-built for other Scala versions. ## How was this patch tested? Automated doc build Author: Sean Owen <sowen@cloudera.com> Closes #12757 from srowen/SPARK-14882.
-
- Apr 24, 2016
-
-
Jacek Laskowski authored
## What changes were proposed in this pull request? Added screenshot + minor fixes to improve reading ## How was this patch tested? Manual Author: Jacek Laskowski <jacek@japila.pl> Closes #12569 from jaceklaskowski/docs-accumulators.
-
- Feb 27, 2016
-
-
Reynold Xin authored
## What changes were proposed in this pull request? We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts. Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #11400 from rxin/release-script.
-
- Feb 22, 2016
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.
-
- Feb 19, 2016
-
-
Sean Owen authored
Clarify that reduce functions need to be commutative, and fold functions do not See https://github.com/apache/spark/pull/11091 Author: Sean Owen <sowen@cloudera.com> Closes #11217 from srowen/SPARK-13339.
-
- Jan 23, 2016
-
-
Sean Owen authored
[SPARK-12760][DOCS] inaccurate description for difference between local vs cluster mode in closure handling Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode Author: Sean Owen <sowen@cloudera.com> Closes #10866 from srowen/SPARK-12760.
-
Mortada Mehyar authored
…local vs cluster srowen thanks for the PR at https://github.com/apache/spark/pull/10866! sorry it took me a while. This is related to https://github.com/apache/spark/pull/10866, basically the assignment in the lambda expression in the python example is actually invalid ``` In [1]: data = [1, 2, 3, 4, 5] In [2]: counter = 0 In [3]: rdd = sc.parallelize(data) In [4]: rdd.foreach(lambda x: counter += x) File "<ipython-input-4-fcb86c182bad>", line 1 rdd.foreach(lambda x: counter += x) ^ SyntaxError: invalid syntax ``` Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #10867 from mortada/doc_python_fix.
-
- Dec 22, 2015
-
-
Shixiong Zhu authored
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10385 from zsxwing/accumulator-broadcast-example.
-
- Dec 18, 2015
-
-
gatorsmile authored
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs. davies Is this inconsistency intentional? Thanks! Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY. Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`. Author: gatorsmile <gatorsmile@gmail.com> Closes #10092 from gatorsmile/persistStorageLevel.
-
- Nov 01, 2015
-
-
Sean Owen authored
Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page CC pwendell Author: Sean Owen <sowen@cloudera.com> Closes #9298 from srowen/SPARK-11305.
-
- Sep 28, 2015
-
-
David Martin authored
seperate -> separate sees -> see Author: David Martin <dmartinpro@users.noreply.github.com> Closes #8928 from dmartinpro/patch-1.
-
- Sep 21, 2015
-
-
Jacek Laskowski authored
* Backticks are processed properly in Spark Properties table * Removed unnecessary spaces * See http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8795 from jaceklaskowski/docs-yarn-formatting.
-
- Aug 22, 2015
-
-
Keiji Yoshida authored
Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8372 from yosssi/patch-1.
-
- Aug 19, 2015
-
-
Davies Liu authored
cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8245 from davies/python_doc.
-
- Jun 19, 2015
-
-
Sean Owen authored
[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
-