- Nov 01, 2013
-
-
Evan Chan authored
-
- Oct 24, 2013
-
-
Patrick Wendell authored
This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.
-
- Oct 23, 2013
-
-
Patrick Wendell authored
-
- Oct 22, 2013
-
-
Ewen Cheslack-Postava authored
-
Aaron Davidson authored
-
- Oct 19, 2013
-
-
Patrick Wendell authored
Clarifies that this governs compression of internal data, not input data or output data.
-
- Oct 17, 2013
-
-
Mosharaf Chowdhury authored
-
- Oct 10, 2013
-
-
Aaron Davidson authored
-
Aaron Davidson authored
Updates to the documentation and changing some logError()s to logWarning()s.
-
- Oct 09, 2013
-
-
Matei Zaharia authored
-
- Oct 08, 2013
-
-
Aaron Davidson authored
Also fix a couple HTML/Markdown issues in other files.
-
- Oct 06, 2013
-
-
Patrick Wendell authored
-
- Oct 04, 2013
-
-
Nick Pentreath authored
-
- Oct 03, 2013
-
-
tgravescs authored
the classpaths
-
- Oct 02, 2013
-
-
tgravescs authored
-
- Sep 24, 2013
-
-
Patrick Wendell authored
-
- Sep 23, 2013
-
-
Y.CORP.YAHOO.COM\tgraves authored
Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit
-
- Sep 15, 2013
-
-
Jey Kottalam authored
-
Patrick Wendell authored
-
Patrick Wendell authored
-
- Sep 11, 2013
-
-
Benjamin Hindman authored
-
Benjamin Hindman authored
-
Patrick Wendell authored
-
- Sep 10, 2013
-
-
Matei Zaharia authored
-
- Sep 09, 2013
-
-
Patrick Wendell authored
-
- Sep 08, 2013
-
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Ameet Talwalkar authored
-
Ameet Talwalkar authored
-
Matei Zaharia authored
-
Patrick Wendell authored
-
Matei Zaharia authored
-
Matei Zaharia authored
-
Matei Zaharia authored
details on monitoring
-
Matei Zaharia authored
Also changed uses of "job" terminology to "application" when they referred to an entire Spark program, to avoid confusion.
-
Matei Zaharia authored
- Add job scheduling docs - Rename some fair scheduler properties - Organize intro page better - Link to Apache wiki for "contributing to Spark"
-
- Sep 07, 2013
-
-
Patrick Wendell authored
-
Patrick Wendell authored
-
Evan Chan authored
-
- Sep 06, 2013
-
-
Evan Chan authored
-