Commits · fb98488fc8e68cc84f6e0750fd4e9e29029879d2 · cs525-sp18-g07 / spark

Apr 21, 2014

Clean up and simplify Spark configuration · fb98488f

Patrick Wendell authored 11 years ago

Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements:

1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file.
2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath.
3. Adds ability to set these same variables for the driver using `spark-submit`.
4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`.
5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #299 from pwendell/config-cleanup and squashes the following commits:

127f301 [Patrick Wendell] Improvements to testing
a006464 [Patrick Wendell] Moving properties file template.
b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf
0086939 [Patrick Wendell] Minor style fixes
af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs
b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide
af0adf7 [Patrick Wendell] Automatically add user jar
a56b125 [Patrick Wendell] Responses to Tom's review
d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a762901 [Patrick Wendell] Fixing test failures
ffa00fe [Patrick Wendell] Review feedback
fda0301 [Patrick Wendell] Note
308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN
e83cd8f [Patrick Wendell] Changes to allow re-use of test applications
be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set
c2a2909 [Patrick Wendell] Test compile fixes
4ee6f9d [Patrick Wendell] Making YARN doc changes consistent
afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors.
b08893b [Patrick Wendell] Additional improvements.
ace4ead [Patrick Wendell] Responses to review feedback.
b72d183 [Patrick Wendell] Review feedback for spark env file
46555c1 [Patrick Wendell] Review feedback and import clean-ups
437aed1 [Patrick Wendell] Small fix
761ebcd [Patrick Wendell] Library path and classpath for drivers
7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script
5b0ba8e [Patrick Wendell] Don't ship executor envs
84cc5e5 [Patrick Wendell] Small clean-up
1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings
4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH
6eaf7d0 [Patrick Wendell] executorJavaOpts
0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN
ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS

fb98488f

Apr 16, 2014

misleading task number of groupByKey · 9c40b9ea

Chen Chao authored 11 years ago

"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389

detail is as following code :

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }

Author: Chen Chao <crazyjvm@gmail.com>

Closes #403 from CrazyJvm/patch-4 and squashes the following commits:

42f6c9e [Chen Chao] fix format
829a995 [Chen Chao] fix format
1568336 [Chen Chao] misleading task number of groupByKey

9c40b9ea

Apr 07, 2014

SPARK-1099: Introduce local[*] mode to infer number of cores · 0307db0f

Aaron Davidson authored 11 years ago

This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.

Author: Aaron Davidson <aaron@databricks.com>

Closes #182 from aarondav/110 and squashes the following commits:

a88294c [Aaron Davidson] Rebased changes for new spark-shell
a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores

0307db0f

Apr 04, 2014

SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde

Haoyuan Li authored 11 years ago

Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark

b50ddfde

Feb 27, 2014

Removed reference to incubation in Spark user docs. · 40e080a6

Reynold Xin authored 11 years ago

Author: Reynold Xin <rxin@apache.org>

Closes #2 from rxin/docs and squashes the following commits:

08bbd5f [Reynold Xin] Removed reference to incubation in Spark user docs.

40e080a6

Feb 22, 2014

SPARK-1117: update accumulator docs · aaec7d4a

Xiangrui Meng authored 11 years ago

The current doc hints spark doesn't support accumulators of type `Long`, which is wrong.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1117

Author: Xiangrui Meng <meng@databricks.com>

Closes #631 from mengxr/acc and squashes the following commits:

45ecd25 [Xiangrui Meng] update accumulator docs

aaec7d4a

Feb 19, 2014

[SPARK-1105] fix site scala version error in docs · 7b012c93

CodingCat authored 11 years ago

https://spark-project.atlassian.net/browse/SPARK-1105

fix site scala version error

Author: CodingCat <zhunansjtu@gmail.com>

Closes #618 from CodingCat/doc_version and squashes the following commits:

39bb8aa [CodingCat] more fixes
65bedb0 [CodingCat] fix site scala version error in doc

7b012c93

Jan 23, 2014
- Deprecate mapPartitionsWithSplit in PySpark. · 4cebb79c
  Josh Rosen authored 11 years ago
  
  Also, replace the last reference to it in the docs. This fixes SPARK-1026.
  4cebb79c
Jan 06, 2014
- Code review feedback · d86dc74d
  Holden Karau authored 11 years ago
  
  d86dc74d
Jan 02, 2014
- run-example -> bin/run-example · 94b7a7fe
  Prashant Sharma authored 11 years ago
  
  94b7a7fe
- spark-shell -> bin/spark-shell · b810a85c
  Prashant Sharma authored 11 years ago
  
  b810a85c
- Removed sbt folder and changed docs accordingly · 6be4c111
  Prashant Sharma authored 11 years ago
  
  6be4c111
Dec 30, 2013
- Updated docs for SparkConf and handled review comments · 0fa58097
  Matei Zaharia authored 11 years ago
  
  0fa58097
Dec 29, 2013
- Revert "Merge pull request #310 from jyunfan/master" · 72a17b69
  Reynold Xin authored 11 years ago
  
  This reverts commit 79b20e4d, reversing changes made to 7375047d.
  72a17b69
Dec 28, 2013
- Fix typo in the Accumulators section · 17f6620a
  Jyun-Fan Tsai authored 11 years ago
  
  val => var
  17f6620a
Dec 18, 2013
- changed the example links in the scala-programming-guid · ad8ce014
  fengdong authored 11 years ago
  
  ad8ce014
Dec 17, 2013
- Fixed the example link. · ddebaf82
  fengdong authored 11 years ago
  
  ddebaf82
Oct 23, 2013
- Fixing broken links in programming guide · 4e093b88
  Patrick Wendell authored 11 years ago
  
  4e093b88
Oct 22, 2013
- Docs: Fix links to RDD API documentation · 962bec97
  Aaron Davidson authored 11 years ago
  
  962bec97
Sep 08, 2013

More fair scheduler docs and property names. · 651a96ad

Matei Zaharia authored 12 years ago

Also changed uses of "job" terminology to "application" when they
referred to an entire Spark program, to avoid confusion.

651a96ad

Sep 01, 2013
- Move some classes to more appropriate packages: · 0a8cc309
  Matei Zaharia authored 12 years ago
  
  * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
  0a8cc309
- Fix more URLs in docs · d27cd03f
  Matei Zaharia authored 12 years ago
  
  d27cd03f
- Update docs for new package · 4f422032
  Matei Zaharia authored 12 years ago
  
  4f422032
Aug 30, 2013
- Update docs about HDFS versions · 42935330
  Matei Zaharia authored 12 years ago
  
  42935330
Aug 29, 2013

Change build and run instructions to use assemblies · 53cd50c0

Matei Zaharia authored 12 years ago

This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.

53cd50c0

Jun 22, 2013
- ADD_JARS environment variable for spark-shell · b5df1cd6
  Matei Zaharia authored 12 years ago
  
  b5df1cd6
May 15, 2013
- Docs: Mention spark shell's default for MASTER · afcad7b3
  Andrew Ash authored 12 years ago
  
  afcad7b3
Feb 25, 2013
- Allow passing sparkHome and JARs to StreamingContext constructor · 490f056c
  Matei Zaharia authored 12 years ago
  
  Also warns if spark.cleaner.ttl is not set in the version where you pass your own SparkContext.
  490f056c
Feb 06, 2013
- Change docs on 'reduce' since the merging of local reduces no longer preserves · 934a53c8
  Mark Hamstra authored 12 years ago
  
  ordering, so the reduce function must also be commutative.
  934a53c8
Jan 23, 2013
- Made StorageLevel constructor private, and added StorageLevels.create() to the... · 155f3139
  Tathagata Das authored 12 years ago
  
  Made StorageLevel constructor private, and added StorageLevels.create() to the Java API. Updates scala and java programming guides.
  155f3139
Oct 26, 2012
- Fix Spark groupId in Scala Programming Guide. · 33bea24f
  Josh Rosen authored 12 years ago
  
  33bea24f
Oct 12, 2012
- Tweak · 603b419f
  Matei Zaharia authored 12 years ago
  
  603b419f
Oct 10, 2012
- Updating programming guide with new link instructions · 0f760a0b
  Patrick Wendell authored 12 years ago
  
  0f760a0b
Oct 09, 2012

Updates to documentation: · bc0bc672

Matei Zaharia authored 12 years ago

- Edited quick start and tuning guide to simplify them a little
- Simplified top menu bar
- Made private a SparkContext constructor parameter that was left as
  public
- Various small fixes

bc0bc672

Oct 08, 2012

Updating lots of docs to use the new special version number variables, · e1a724f3
Andy Konwinski authored 12 years ago
```
also adding the version to the navbar so it is easy to tell which
version of Spark these docs were compiled for.
```
e1a724f3

Adds liquid variables to docs templating system so that they can be used · 45d03231

Andy Konwinski authored 12 years ago

throughout the docs: SPARK_VERSION, SCALA_VERSION, and MESOS_VERSION.

To use them, e.g. use {{site.SPARK_VERSION}}.

Also removes uses of {{HOME_PATH}} which were being resolved to ""
by the templating system anyway.

45d03231

Sep 29, 2012
- Added mapPartitionsWithSplit to the programming guide. · f5812d03
  Reynold Xin authored 12 years ago
  
  f5812d03
- Allow controlling number of splits in distinct(). · 37c199bb
  Josh Rosen authored 12 years ago
  
  37c199bb
Sep 27, 2012
- Renamed storage levels to something cleaner; fixes #223. · 7bcb08ce
  Matei Zaharia authored 12 years ago
  
  7bcb08ce
- Updates to standalone cluster, web UI and deploy docs. · ea05fc13
  Matei Zaharia authored 12 years ago
  
  ea05fc13