Skip to content
Snippets Groups Projects
  1. Aug 09, 2014
    • li-zhihui's avatar
      [SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone mode · 28dbae85
      li-zhihui authored
      In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set.
      
      Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready.
      
      Author: li-zhihui <zhihui.li@intel.com>
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #1525 from li-zhihui/fixre4s and squashes the following commits:
      
      e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes
      abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes
      ca54bd9 [li-zhihui] Format log with String interpolation
      88c7dc6 [li-zhihui] Few codes and docs refactor
      41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode
      28dbae85
  2. Aug 07, 2014
    • Matei Zaharia's avatar
      SPARK-2787: Make sort-based shuffle write files directly when there's no... · 6906b69c
      Matei Zaharia authored
      SPARK-2787: Make sort-based shuffle write files directly when there's no sorting/aggregation and # partitions is small
      
      As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.
      
      On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:
      
      88cf26a [Matei Zaharia] Fix rebase
      10233af [Matei Zaharia] Review comments
      398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
      ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
      d0ae3c5 [Matei Zaharia] Fix some comments
      90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
      31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
      6906b69c
  3. Aug 06, 2014
    • Andrew Or's avatar
      [SPARK-2157] Enable tight firewall rules for Spark · 09f7e458
      Andrew Or authored
      The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
      
      The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
      
      My spark-env.sh looks like this:
      ```
      export SPARK_MASTER_PORT=6060
      export SPARK_WORKER_PORT=7070
      export SPARK_MASTER_WEBUI_PORT=9090
      export SPARK_WORKER_WEBUI_PORT=9091
      ```
      and my spark-defaults.conf looks like this:
      ```
      spark.master spark://andrews-mbp:6060
      spark.driver.port 5001
      spark.fileserver.port 5011
      spark.broadcast.port 5021
      spark.replClassServer.port 5031
      spark.blockManager.port 5041
      spark.executor.port 5051
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1777 from andrewor14/configure-ports and squashes the following commits:
      
      621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      8a6b820 [Andrew Or] Use a random UI port during tests
      7da0493 [Andrew Or] Fix tests
      523c30e [Andrew Or] Add test for isBindCollision
      b97b02a [Andrew Or] Minor fixes
      c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      93d359f [Andrew Or] Executors connect to wrong port when collision occurs
      d502e5f [Andrew Or] Handle port collisions when creating Akka systems
      a2dd05c [Andrew Or] Patrick's comment nit
      86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
      1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
      cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
      e837cde [Andrew Or] Remove outdated TODOs
      bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      de1b207 [Andrew Or] Update docs to reflect new ports
      b565079 [Andrew Or] Add spark.ports.maxRetries
      2551eb2 [Andrew Or] Remove spark.worker.watcher.port
      151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      9868358 [Andrew Or] Add a few miscellaneous ports
      6016e77 [Andrew Or] Add spark.executor.port
      8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
      4d9e6f3 [Andrew Or] Fix super subtle bug
      3f8e51b [Andrew Or] Correct erroneous docs...
      e111d08 [Andrew Or] Add names for UI services
      470f38c [Andrew Or] Special case non-"Address already in use" exceptions
      1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
      ba32280 [Andrew Or] Minor fixes
      6b550b0 [Andrew Or] Assorted fixes
      73fbe89 [Andrew Or] Move start service logic to Utils
      ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
      038a579 [Andrew Ash] Trust the server start function to report the port the service started on
      7c5bdc4 [Andrew Ash] Fix style issue
      0347aef [Andrew Ash] Unify port fallback logic to a single place
      24a4c32 [Andrew Ash] Remove type on val to match surrounding style
      9e4ad96 [Andrew Ash] Reformat for style checker
      5d84e0e [Andrew Ash] Document new port configuration options
      066dc7a [Andrew Ash] Fix up HttpServer port increments
      cad16da [Andrew Ash] Add fallover increment logic for HttpServer
      c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
      b80d2fd [Andrew Ash] Make Spark's block manager port configurable
      17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
      f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
      49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
      1c0981a [Andrew Ash] Make port in HttpServer configurable
      09f7e458
  4. Aug 05, 2014
    • Reynold Xin's avatar
      [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB. · acff9a7f
      Reynold Xin authored
      This can substantially reduce memory usage during shuffle.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
      
      104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
      acff9a7f
    • Thomas Graves's avatar
      SPARK-1680: use configs for specifying environment variables on YARN · 41e0a21b
      Thomas Graves authored
      Note that this also documents spark.executorEnv.*  which to me means its public.  If we don't want that please speak up.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
      
      11525df [Thomas Graves] more doc changes
      553bad0 [Thomas Graves] fix documentation
      152bf7c [Thomas Graves] fix docs
      5382326 [Thomas Graves] try fix docs
      32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
      41e0a21b
    • Thomas Graves's avatar
      SPARK-1890 and SPARK-1891- add admin and modify acls · 1c5555a2
      Thomas Graves authored
      It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:
      
      - adds modify acls
      - adds admin acls (list of admins/users that get added to both view and modify acls)
      - modify Kill button on UI to take modify acls into account
      - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
      - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
      
      8292eb1 [Thomas Graves] review comments
      b92ec89 [Thomas Graves] remove unneeded variable from applistener
      4c765f4 [Thomas Graves] Add in admin acls
      72eb0ac [Thomas Graves] Add modify acls
      1c5555a2
    • Reynold Xin's avatar
      [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. · 184048f8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1780 from rxin/kryo-init-size and squashes the following commits:
      
      551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
      184048f8
  5. Aug 04, 2014
    • Matei Zaharia's avatar
      SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter · 8e7d5ba1
      Matei Zaharia authored
      All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
      
      In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1722 from mateiz/spark-2792 and squashes the following commits:
      
      5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
      18fe865 [Matei Zaharia] Update docs on objectStreamReset
      576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
      0374217 [Matei Zaharia] Remove super paranoid code to close file handles
      bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
      0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
      9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
      8e7d5ba1
  6. Aug 01, 2014
    • Sandy Ryza's avatar
      SPARK-2099. Report progress while task is running. · 8d338f64
      Sandy Ryza authored
      This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed.  It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks.
      
      It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits:
      
      93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying
      132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes
      38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite
      51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements
      3084f10 [Sandy Ryza] Make TaskUIData a case class again
      3bda974 [Sandy Ryza] Stylistic fixes
      0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
      8d338f64
  7. Jul 30, 2014
    • Koert Kuipers's avatar
      SPARK-2543: Allow user to set maximum Kryo buffer size · 7c5fc28a
      Koert Kuipers authored
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:
      
      15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
      1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
      0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
      143ec4d [Koert Kuipers] test resizable buffer in kryo Output
      0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
      7c5fc28a
  8. Jul 27, 2014
    • Andrew Or's avatar
      [SPARK-1777] Prevent OOMs from single partitions · ecf30ee7
      Andrew Or authored
      **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
      
      **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
      
      **New configurations.**
      - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
      - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
      
      For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
      
      e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      c7c8832 [Andrew Or] Simplify logic + update a few comments
      269d07b [Andrew Or] Very minor changes to tests
      6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b7e165c [Andrew Or] Add new tests for unrolling blocks
      f12916d [Andrew Or] Slightly clean up tests
      71672a7 [Andrew Or] Update unrollSafely tests
      369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
      f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
      a66fbd2 [Andrew Or] Rename a few things + update comments
      68730b3 [Andrew Or] Fix weird scalatest behavior
      e40c60d [Andrew Or] Fix MIMA excludes
      ff77aa1 [Andrew Or] Fix tests
      1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
      ed6cda4 [Andrew Or] Formatting fix (super minor)
      f9ff82e [Andrew Or] putValues -> putIterator + putArray
      beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8448c9b [Andrew Or] Fix tests
      a49ba4d [Andrew Or] Do not expose unroll memory check period
      69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
      3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
      dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8288228 [Andrew Or] Synchronize put and unroll properly
      4f18a3d [Andrew Or] bufferFraction -> unrollFraction
      28edfa3 [Andrew Or] Update a few comments / log messages
      728323b [Andrew Or] Do not synchronize every 1000 elements
      5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      129c441 [Andrew Or] Fix bug: Use toArray rather than array
      9a65245 [Andrew Or] Update a few comments + minor control flow changes
      57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
      3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
      f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      0871835 [Andrew Or] Add an effective storage level interface to BlockManager
      64e7d4c [Andrew Or] Add/modify a few comments (minor)
      8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
      ecc8c2d [Andrew Or] Fix binary incompatibility
      24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
      2b7ee66 [Andrew Or] Fix bug in SizeTracking*
      9b9a273 [Andrew Or] Fix tests
      20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      649bdb3 [Andrew Or] Document spark.storage.bufferFraction
      a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
      e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
      198e374 [Andrew Or] Unfold -> unroll
      0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
      ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
      22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
      0871535 [Andrew Or] Fix tests in BlockManagerSuite
      d68f31e [Andrew Or] Safely unfold blocks for all memory puts
      5961f50 [Andrew Or] Fix tests
      195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
      1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d5dd3b4 [Andrew Or] Free buffer memory in finally
      ea02eec [Andrew Or] Fix tests
      b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      87aa75c [Andrew Or] Fix mima excludes again (typo)
      11eb921 [Andrew Or] Clarify comment (minor)
      50cae44 [Andrew Or] Remove now duplicate mima exclude
      7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      df47265 [Andrew Or] Fix binary incompatibility
      6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      f94f5af [Andrew Or] Update a few comments (minor)
      776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
      bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
      97ea499 [Andrew Or] Change BlockManager interface to use Arrays
      c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
      ecf30ee7
    • Matei Zaharia's avatar
      SPARK-2680: Lower spark.shuffle.memoryFraction to 0.2 by default · b547f69b
      Matei Zaharia authored
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1593 from mateiz/spark-2680 and squashes the following commits:
      
      3c949c4 [Matei Zaharia] Lower spark.shuffle.memoryFraction to 0.2 by default
      b547f69b
  9. Jul 26, 2014
    • Hossein's avatar
      [SPARK-2696] Reduce default value of spark.serializer.objectStreamReset · 66f26a46
      Hossein authored
      The current default value of spark.serializer.objectStreamReset is 10,000.
      When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.
      
      This patch sets the default value to a more reasonable default value (100).
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #1595 from falaki/objectStreamReset and squashes the following commits:
      
      650a935 [Hossein] Updated documentation
      1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset
      66f26a46
  10. Jul 25, 2014
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  11. Jul 24, 2014
    • Sandy Ryza's avatar
      SPARK-2310. Support arbitrary Spark properties on the command line with ... · e34922a2
      Sandy Ryza authored
      ...spark-submit
      
      The PR allows invocations like
        spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:
      
      1dc9855 [Sandy Ryza] More doc and cleanup
      00edfb9 [Sandy Ryza] Review comments
      91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
      8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
      e34922a2
  12. Jul 23, 2014
    • Ian O Connell's avatar
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a... · efdaeb11
      Ian O Connell authored
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
      
      Author: Ian O Connell <ioconnell@twitter.com>
      
      Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
      
      5498566 [Ian O Connell] Docs update suggested by Patrick
      20e8555 [Ian O Connell] Slight style change
      f92c294 [Ian O Connell] Add docs for new KryoSerializer option
      f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
      4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
      665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
      efdaeb11
  13. Jul 16, 2014
    • Xiangrui Meng's avatar
      [SPARK-2522] set default broadcast factory to torrent · 96f28c97
      Xiangrui Meng authored
      HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1437 from mengxr/bt-broadcast and squashes the following commits:
      
      ed492fe [Xiangrui Meng] set default broadcast factory to torrent
      96f28c97
  14. Jul 15, 2014
  15. Jul 14, 2014
    • li-zhihui's avatar
      [SPARK-1946] Submit tasks after (configured ratio) executors have been registered · 3dd8af7a
      li-zhihui authored
      Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality.
      
      A simple solution is sleeping few seconds in application, so that executors have enough time to register.
      
      The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered.
      
      \# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0
      spark.scheduler.minRegisteredExecutorsRatio = 0.8
      
      \# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000
      spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000
      
      Author: li-zhihui <zhihui.li@intel.com>
      
      Closes #900 from li-zhihui/master and squashes the following commits:
      
      b9f8326 [li-zhihui] Add logs & edit docs
      1ac08b1 [li-zhihui] Add new configs to user docs
      22ead12 [li-zhihui] Move waitBackendReady to postStartHook
      c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS
      4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor
      0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks
      4261454 [li-zhihui] Add docs for new configs & code style
      ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime
      6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha
      812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode
      e7b6272 [li-zhihui] support yarn-cluster
      37f7dc2 [li-zhihui] support yarn mode(percentage style)
      3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered
      3dd8af7a
  16. Jul 10, 2014
    • Issac Buenrostro's avatar
      [SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data consumption. · 2dd67248
      Issac Buenrostro authored
      Author: Issac Buenrostro <buenrostro@ooyala.com>
      
      Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits:
      
      5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling.
      62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala
      7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active
      ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption.
      2dd67248
  17. Jun 10, 2014
    • Tathagata Das's avatar
      [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs · 4823bf47
      Tathagata Das authored
      Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.
      
      Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).
      
      This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with  `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`).  The web UI has also been modified to show the logs across the rolled-over files.
      
      You can test this locally (without waiting a whole day) by setting  configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).
      
      ```
      stderr
      stderr--2014-05-27--14-37
      stderr--2014-05-27--14-47
      stderr--2014-05-27--15-05
      stdout
      stdout--2014-05-27--14-47
      ```
      
      The web ui should show logs across these files.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #895 from tdas/rolling-logs and squashes the following commits:
      
      fd8f87f [Tathagata Das] Minor change.
      d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      ad956c1 [Tathagata Das] Scala style fix.
      1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
      c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
      4224409 [Tathagata Das] Style fix.
      108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
      f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
      312d874 [Tathagata Das] Minor fixes based on PR comments.
      8a67d83 [Tathagata Das] Fixed comments.
      b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
      b7e8272 [Tathagata Das] Style fix,
      374c9a9 [Tathagata Das] Added missing license.
      24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
      adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
      cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
      4823bf47
  18. Jun 05, 2014
    • CodingCat's avatar
      SPARK-1677: allow user to disable output dir existence checking · 89cdbb08
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-1677
      
      For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true)  for the user to disable the output directory existence checking
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #947 from CodingCat/SPARK-1677 and squashes the following commits:
      
      7930f83 [CodingCat] miao
      c0c0e03 [CodingCat] bug fix and doc update
      5318562 [CodingCat] bug fix
      13219b5 [CodingCat] allow user to disable output dir existence checking
      89cdbb08
  19. May 31, 2014
    • Andrew Ash's avatar
      Typo: and -> an · 9c1f204d
      Andrew Ash authored
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #927 from ash211/patch-5 and squashes the following commits:
      
      79b577d [Andrew Ash] Typo: and -> an
      9c1f204d
  20. May 30, 2014
    • Matei Zaharia's avatar
      [SPARK-1566] consolidate programming guide, and general doc updates · c8bf4131
      Matei Zaharia authored
      This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
      
      * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
      * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
      * Spark-submit guide moved to a separate page and expanded slightly
      * Various cleanups of the menu system, security docs, and others
      * Updated look of title bar to differentiate the docs from previous Spark versions
      
      You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #896 from mateiz/1.0-docs and squashes the following commits:
      
      03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
      0779508 [Matei Zaharia] tweak
      ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
      1bf4112 [Matei Zaharia] Review comments
      4414f88 [Matei Zaharia] tweaks
      d04e979 [Matei Zaharia] Fix some old links to Java guide
      a34ed33 [Matei Zaharia] tweak
      541bb3b [Matei Zaharia] miscellaneous changes
      fcefdec [Matei Zaharia] Moved submitting apps to separate doc
      61d72b4 [Matei Zaharia] stuff
      181f217 [Matei Zaharia] migration guide, remove old language guides
      e11a0da [Matei Zaharia] Add more API functions
      6a030a9 [Matei Zaharia] tweaks
      8db0ae3 [Matei Zaharia] Added key-value pairs section
      318d2c9 [Matei Zaharia] tweaks
      1c81477 [Matei Zaharia] New section on basics and function syntax
      e38f559 [Matei Zaharia] Actually added programming guide to Git
      a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
      3b6a876 [Matei Zaharia] More CSS tweaks
      01ec8bf [Matei Zaharia] More CSS tweaks
      e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
      c8bf4131
  21. May 28, 2014
    • Patrick Wendell's avatar
      Organize configuration docs · 7801d44f
      Patrick Wendell authored
      This PR improves and organizes the config option page
      and makes a few other changes to config docs. See a preview here:
      http://people.apache.org/~pwendell/config-improvements/configuration.html
      
      The biggest changes are:
      1. The configs for the standalone master/workers were moved to the
      standalone page and out of the general config doc.
      2. SPARK_LOCAL_DIRS was missing from the standalone docs.
      3. Expanded discussion of injecting configs with spark-submit, including an
      example.
      4. Config options were organized into the following categories:
      - Runtime Environment
      - Shuffle Behavior
      - Spark UI
      - Compression and Serialization
      - Execution Behavior
      - Networking
      - Scheduling
      - Security
      - Spark Streaming
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #880 from pwendell/config-cleanup and squashes the following commits:
      
      93f56c3 [Patrick Wendell] Feedback from Matei
      6f66efc [Patrick Wendell] More feedback
      16ae776 [Patrick Wendell] Adding back header section
      d9c264f [Patrick Wendell] Small fix
      e0c1728 [Patrick Wendell] Response to Matei's review
      27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
      e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      a374369 [Patrick Wendell] Line wrapping fixes
      fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      3289ea4 [Patrick Wendell] Pulling in changes from #856
      106ee31 [Patrick Wendell] Small link fix
      f7e79bc [Patrick Wendell] Re-organizing config options.
      54b184d [Patrick Wendell] Adding standalone configs to the standalone page
      592e94a [Patrick Wendell] Stash
      29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
      2d719ef [Patrick Wendell] Small fix
      4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
      204b248 [Patrick Wendell] Small fixes
      7801d44f
  22. May 25, 2014
    • Andrew Ash's avatar
      SPARK-1903 Document Spark's network connections · 06595296
      Andrew Ash authored
      https://issues.apache.org/jira/browse/SPARK-1903
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #856 from ash211/SPARK-1903 and squashes the following commits:
      
      6e7782a [Andrew Ash] Add the technology used on each port
      1d9b5d3 [Andrew Ash] Document port for history server
      56193ee [Andrew Ash] spark.ui.port becomes worker.ui.port and master.ui.port
      a774c07 [Andrew Ash] Wording in network section
      90e8237 [Andrew Ash] Use real :toc instead of the hand-written one
      edaa337 [Andrew Ash] Master -> Standalone Cluster Master
      57e8869 [Andrew Ash] Port -> Default Port
      3d4d289 [Andrew Ash] Title to title case
      c7d42d9 [Andrew Ash] [WIP] SPARK-1903 Add initial port listing for documentation
      a416ae9 [Andrew Ash] Word wrap to 100 lines
      06595296
  23. May 21, 2014
    • Reynold Xin's avatar
      Configuration documentation updates · 2a948e7e
      Reynold Xin authored
      
      1. Add < code > to configuration options
      2. List env variables in tabular format to be consistent with other pages.
      3. Moved Viewing Spark Properties section up.
      
      This is against branch-1.0, but should be cherry picked into master as well.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #851 from rxin/doc-config and squashes the following commits:
      
      28ac0d3 [Reynold Xin] Add <code> to configuration options, and list env variables in a table.
      
      (cherry picked from commit 75af8bd3)
      Signed-off-by: default avatarReynold Xin <rxin@apache.org>
      2a948e7e
    • Andrew Or's avatar
      [Docs] Correct example of creating a new SparkConf · 1014668f
      Andrew Or authored
      The example code on the configuration page currently does not compile.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #842 from andrewor14/conf-docs and squashes the following commits:
      
      aabff57 [Andrew Or] Correct example of creating a new SparkConf
      1014668f
  24. May 15, 2014
    • Aaron Davidson's avatar
      SPARK-1860: Do not cleanup application work/ directories by default · bb98ecaf
      Aaron Davidson authored
      This causes an unrecoverable error for applications that are running for longer
      than 7 days that have jars added to the SparkContext, as the jars are cleaned up
      even though the application is still running.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #800 from aarondav/shitty-defaults and squashes the following commits:
      
      a573fbb [Aaron Davidson] SPARK-1860: Do not cleanup application work/ directories by default
      bb98ecaf
  25. May 12, 2014
    • Andrew Or's avatar
      [SPARK-1753 / 1773 / 1814] Update outdated docs for spark-submit, YARN, standalone etc. · 2ffd1eaf
      Andrew Or authored
      YARN
      - SparkPi was updated to not take in master as an argument; we should update the docs to reflect that.
      - The default YARN build guide should be in maven, not sbt.
      - This PR also adds a paragraph on steps to debug a YARN application.
      
      Standalone
      - Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`.
      - The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made.
      
      In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #701 from andrewor14/yarn-docs and squashes the following commits:
      
      e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814)
      25cfe7b [Andrew Or] Merge in the warning from SPARK-1753
      a8c39c5 [Andrew Or] Minor changes
      336bbd9 [Andrew Or] Tabs -> spaces
      4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
      041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html
      3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions
      5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
      85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc.
      c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
      381fe32 [Andrew Or] Update docs for standalone mode
      757c184 [Andrew Or] Add a note about the requirements for the debugging trick
      f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
      924f04c [Andrew Or] Revert addition of --deploy-mode
      d5fe17b [Andrew Or] Update the YARN docs
      2ffd1eaf
  26. May 06, 2014
    • Sean Owen's avatar
      SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs · 25ad8f93
      Sean Owen authored
      While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs.
      
      Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #653 from srowen/SPARK-1727 and squashes the following commits:
      
      6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count
      8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output)
      99966a9 [Sean Owen] Update issue tracker URL in docs
      23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak)
      8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
      25ad8f93
  27. May 05, 2014
    • Tathagata Das's avatar
      [SPARK-1504], [SPARK-1505], [SPARK-1558] Updated Spark Streaming guide · a975a19f
      Tathagata Das authored
      - SPARK-1558: Updated custom receiver guide to match it with the new API
      - SPARK-1504: Added deployment and monitoring subsection to streaming
      - SPARK-1505: Added migration guide for migrating from 0.9.x and below to Spark 1.0
      - Updated various Java streaming examples to use JavaReceiverInputDStream to highlight the API change.
      - Removed the requirement for cleaner ttl from streaming guide
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #652 from tdas/doc-fix and squashes the following commits:
      
      cb4f4b7 [Tathagata Das] Possible fix for flaky graceful shutdown test.
      ab71f7f [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into doc-fix
      8d6ff9b [Tathagata Das] Addded migration guide to Spark Streaming.
      7d171df [Tathagata Das] Added reference to JavaReceiverInputStream in examples and streaming guide.
      49edd7c [Tathagata Das] Change java doc links to use Java docs.
      11528d7 [Tathagata Das] Updated links on index page.
      ff80970 [Tathagata Das] More updates to streaming guide.
      4dc42e9 [Tathagata Das] Added monitoring and other documentation in the streaming guide.
      14c6564 [Tathagata Das] Updated custom receiver guide.
      a975a19f
    • Reynold Xin's avatar
      Updated doc for spark.closure.serializer to indicate only Java serializer work. · f2eb070a
      Reynold Xin authored
      See discussion from http://apache-spark-developers-list.1001551.n3.nabble.com/bug-using-kryo-as-closure-serializer-td6473.html
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #642 from rxin/docs-ser and squashes the following commits:
      
      a507db5 [Reynold Xin] Use "Java" instead of default.
      5eb8cdd [Reynold Xin] Updated doc for spark.closure.serializer to indicate only the default serializer work.
      f2eb070a
  28. Apr 27, 2014
    • Patrick Wendell's avatar
      SPARK-1145: Memory mapping with many small blocks can cause JVM allocation failures · 6b3c6e5d
      Patrick Wendell authored
      This includes some minor code clean-up as well. The main change is that small files are not memory mapped. There is a nicer way to write that code block using Scala's `Try` but to make it easy to back port and as simple as possible, I opted for the more explicit but less pretty format.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #43 from pwendell/block-iter-logging and squashes the following commits:
      
      1cff512 [Patrick Wendell] Small issue from merge.
      49f6c269 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into block-iter-logging
      4943351 [Patrick Wendell] Added a test and feedback on mateis review
      a637a18 [Patrick Wendell] Review feedback and adding rewind() when reading byte buffers.
      b76b95f [Patrick Wendell] Review feedback
      4e1514e [Patrick Wendell] Don't memory map for small files
      d238b88 [Patrick Wendell] Some logging and clean-up
      6b3c6e5d
  29. Apr 24, 2014
    • Tathagata Das's avatar
      [SPARK-1592][streaming] Automatically remove streaming input blocks · 526a518b
      Tathagata Das authored
      The raw input data is stored as blocks in BlockManagers. Earlier they were cleared by cleaner ttl. Now since streaming does not require cleaner TTL to be set, the block would not get cleared. This increases up the Spark's memory usage, which is not even accounted and shown in the Spark storage UI. It may cause the data blocks to spill over to disk, which eventually slows down the receiving of data (persisting to memory become bottlenecked by writing to disk).
      
      The solution in this PR is to automatically remove those blocks. The mechanism to keep track of which BlockRDDs (which has presents the raw data blocks as a RDD) can be safely cleared already exists. Just use it to explicitly remove blocks from BlockRDDs.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #512 from tdas/block-rdd-unpersist and squashes the following commits:
      
      d25e610 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist
      5f46d69 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist
      2c320cd [Tathagata Das] Updated configuration with spark.streaming.unpersist setting.
      2d4b2fd [Tathagata Das] Automatically removed input blocks
      526a518b
  30. Apr 21, 2014
    • Matei Zaharia's avatar
      [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs · fc783847
      Matei Zaharia authored
      I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.
      
      Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/
      
      Author: Matei Zaharia <matei@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Patrick Wendell <pwendell@gmail.com>
      
      Closes #457 from mateiz/better-docs and squashes the following commits:
      
      a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
      5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
      f05abc0 [Matei Zaharia] Don't include java.lang package names
      995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
      a14a93c [Matei Zaharia] typo
      76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
      ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
      acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
      fc783847
    • Patrick Wendell's avatar
      Clean up and simplify Spark configuration · fb98488f
      Patrick Wendell authored
      Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements:
      
      1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file.
      2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath.
      3. Adds ability to set these same variables for the driver using `spark-submit`.
      4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`.
      5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #299 from pwendell/config-cleanup and squashes the following commits:
      
      127f301 [Patrick Wendell] Improvements to testing
      a006464 [Patrick Wendell] Moving properties file template.
      b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf
      0086939 [Patrick Wendell] Minor style fixes
      af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs
      b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide
      af0adf7 [Patrick Wendell] Automatically add user jar
      a56b125 [Patrick Wendell] Responses to Tom's review
      d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      a762901 [Patrick Wendell] Fixing test failures
      ffa00fe [Patrick Wendell] Review feedback
      fda0301 [Patrick Wendell] Note
      308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN
      e83cd8f [Patrick Wendell] Changes to allow re-use of test applications
      be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set
      c2a2909 [Patrick Wendell] Test compile fixes
      4ee6f9d [Patrick Wendell] Making YARN doc changes consistent
      afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors.
      b08893b [Patrick Wendell] Additional improvements.
      ace4ead [Patrick Wendell] Responses to review feedback.
      b72d183 [Patrick Wendell] Review feedback for spark env file
      46555c1 [Patrick Wendell] Review feedback and import clean-ups
      437aed1 [Patrick Wendell] Small fix
      761ebcd [Patrick Wendell] Library path and classpath for drivers
      7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script
      5b0ba8e [Patrick Wendell] Don't ship executor envs
      84cc5e5 [Patrick Wendell] Small clean-up
      1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings
      4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH
      6eaf7d0 [Patrick Wendell] executorJavaOpts
      0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN
      ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
      fb98488f
  31. Apr 16, 2014
    • Chen Chao's avatar
      update spark.default.parallelism · 9edd8878
      Chen Chao authored
      actually, the value 8 is only valid in mesos fine-grained mode :
      <code>
        override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8)
      </code>
      
      while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers!
      <code>
      override def defaultParallelism(): Int = {
         conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
        }
      </code>
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #389 from CrazyJvm/patch-2 and squashes the following commits:
      
      84a7fe4 [Chen Chao] miss </li> at the end of every single line
      04a9796 [Chen Chao] change format
      ee0fae0 [Chen Chao] update spark.default.parallelism
      9edd8878
  32. Apr 10, 2014
    • Sundeep Narravula's avatar
      SPARK-1202 - Add a "cancel" button in the UI for stages · 2c557837
      Sundeep Narravula authored
      Author: Sundeep Narravula <sundeepn@superduel.local>
      Author: Sundeep Narravula <sundeepn@dhcpx-204-110.corp.yahoo.com>
      
      Closes #246 from sundeepn/uikilljob and squashes the following commits:
      
      5fdd0e2 [Sundeep Narravula] Fix test string
      f6fdff1 [Sundeep Narravula] Format fix; reduced line size to less than 100 chars
      d1daeb9 [Sundeep Narravula] Incorporating review comments.
      8d97923 [Sundeep Narravula] Ability to kill jobs thru the UI. This behavior can be turned on be settings the following variable: spark.ui.killEnabled=true (default=false) Adding DAGScheduler event StageCancelled and corresponding handlers. Added cancellation reason to handlers.
      2c557837
Loading