Skip to content
Snippets Groups Projects
  1. Aug 20, 2014
    • Patrick Wendell's avatar
      SPARK-3092 [SQL]: Always include the thriftserver when -Phive is enabled. · f2f26c2a
      Patrick Wendell authored
      Currently we have a separate profile called hive-thriftserver. I originally suggested this in case users did not want to bundle the thriftserver, but it's ultimately lead to a lot of confusion. Since the thriftserver is only a few classes, I don't see a really good reason to isolate it from the rest of Hive. So let's go ahead and just include it in the same profile to simplify things.
      
      This has been suggested in the past by liancheng.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2006 from pwendell/hiveserver and squashes the following commits:
      
      742ea40 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into hiveserver
      034ad47 [Patrick Wendell] SPARK-3092: Always include the thriftserver when -Phive is enabled.
      f2f26c2a
    • Ken Takagiwa's avatar
      [DOCS] Fixed wrong links · 8a74e4b2
      Ken Takagiwa authored
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      
      Closes #2042 from giwa/patch-1 and squashes the following commits:
      
      216fe0e [Ken Takagiwa] Fixed wrong links
      8a74e4b2
  2. Aug 19, 2014
    • Xiangrui Meng's avatar
      [SPARK-3130][MLLIB] detect negative values in naive Bayes · 068b6fe6
      Xiangrui Meng authored
      because NB treats feature values as term frequencies. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2038 from mengxr/nb-neg and squashes the following commits:
      
      52c37c3 [Xiangrui Meng] address comments
      65f892d [Xiangrui Meng] detect negative values in nb
      068b6fe6
    • freeman's avatar
      [SPARK-3112][MLLIB] Add documentation and example for StreamingLR · c7252b00
      freeman authored
      Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.
      
      mengxr tdas
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits:
      
      568d250 [freeman] Tweaks to wording / formatting
      05a1139 [freeman] Added documentation and example for StreamingLR
      c7252b00
    • Xiangrui Meng's avatar
      [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs · 825d4fe4
      Xiangrui Meng authored
      Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2041 from mengxr/stat-doc and squashes the following commits:
      
      fc5eedf [Xiangrui Meng] add missing comma
      ffde810 [Xiangrui Meng] address comments
      aef6d07 [Xiangrui Meng] add doc for random data generation
      b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
      825d4fe4
    • Vida Ha's avatar
      SPARK-2333 - spark_ec2 script should allow option for existing security group · 94053a7b
      Vida Ha authored
          - Uses the name tag to identify machines in a cluster.
          - Allows overriding the security group name so it doesn't need to coincide with the cluster name.
          - Outputs the request id's of up to 10 pending spot instance requests.
      
      Author: Vida Ha <vida@databricks.com>
      
      Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:
      
      c80d5c3 [Vida Ha] wrap retries in a try catch block
      b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
      94053a7b
  3. Aug 18, 2014
    • Matt Forbes's avatar
      Fix typo in decision tree docs · cd0720ca
      Matt Forbes authored
      Candidate splits were inconsistent with the example.
      
      Author: Matt Forbes <matt@tellapart.com>
      
      Closes #1837 from emef/tree-doc and squashes the following commits:
      
      3be14a1 [Matt Forbes] Fix typo in decision tree docs
      cd0720ca
    • Patrick Wendell's avatar
      SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool · 6bca8898
      Patrick Wendell authored
      This definitely needs review as I am not familiar with this part of Spark.
      I tested this locally and it did seem to work.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1937 from pwendell/scheduler and squashes the following commits:
      
      b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
      6bca8898
    • Liquan Pei's avatar
      [SPARK-2842][MLlib]Word2Vec documentation · eef779b8
      Liquan Pei authored
      mengxr
      Documentation for Word2Vec
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
      
      4ff11d4 [Liquan Pei] minor fix
      8d7458f [Liquan Pei] code reformat
      6df0dcb [Liquan Pei] add Word2Vec documentation
      eef779b8
  4. Aug 17, 2014
    • Chris Fregly's avatar
      [SPARK-1981] updated streaming-kinesis.md · 99243288
      Chris Fregly authored
      fixed markup, separated out sections more-clearly, more thorough explanations
      
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1757 from cfregly/master and squashes the following commits:
      
      9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled in the example (due to no stateful operations being used)
      0f37061 [Chris Fregly] SPARK-1981:  (Kinesis streaming support) updated streaming-kinesis.md
      862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      99243288
  5. Aug 16, 2014
    • Kousuke Saruta's avatar
      [SPARK-2677] BasicBlockFetchIterator#next can wait forever · 76fa0eaf
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:
      
      cddbc7b [Kousuke Saruta] Removed Exception throwing when ConnectionManager#handleMessage receives ack for non-referenced message
      d3bd2a8 [Kousuke Saruta] Modified configuration.md for spark.core.connection.ack.timeout
      e85f88b [Kousuke Saruta] Removed useless synchronized blocks
      7ed48be [Kousuke Saruta] Modified ConnectionManager to use ackTimeoutMonitor ConnectionManager-wide
      9b620a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
      7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
      8a73974 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      ade279a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the case remote Executor cannot ack
      a454239 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
      76fa0eaf
  6. Aug 14, 2014
    • Aaron Davidson's avatar
      [SPARK-3029] Disable local execution of Spark jobs by default · d069c5d9
      Aaron Davidson authored
      Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
      
      Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
      
      This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1321 from aarondav/allowlocal and squashes the following commits:
      
      136b253 [Aaron Davidson] Fix DAGSchedulerSuite
      5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
      d069c5d9
    • Andrew Or's avatar
      [Docs] Add missing <code> tags (minor) · e4245656
      Andrew Or authored
      These configs looked inconsistent from the rest.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1936 from andrewor14/docs-code and squashes the following commits:
      
      15f578a [Andrew Or] Add <code> tag
      e4245656
  7. Aug 13, 2014
    • Kousuke Saruta's avatar
      [SPARK-2963] [SQL] There no documentation about building to use HiveServer and CLI for SparkSQL · 869f06c7
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1885 from sarutak/SPARK-2963 and squashes the following commits:
      
      ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
      07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
      6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
      c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
      869f06c7
    • Reynold Xin's avatar
      [SPARK-2953] Allow using short names for io compression codecs · 676f9828
      Reynold Xin authored
      Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:
      
      9f50962 [Reynold Xin] Specify short-form compression codec names first.
      63f78ee [Reynold Xin] Updated configuration documentation.
      47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
      676f9828
  8. Aug 12, 2014
    • Ameet Talwalkar's avatar
      SPARK-2830 [MLlib]: re-organize mllib documentation · c235b83e
      Ameet Talwalkar authored
      As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
      
      Author: Ameet Talwalkar <atalwalkar@gmail.com>
      
      Closes #1908 from atalwalkar/master and squashes the following commits:
      
      fe6938a [Ameet Talwalkar] made xiangruis suggested changes
      840028b [Ameet Talwalkar] made xiangruis suggested changes
      7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
      c235b83e
  9. Aug 09, 2014
    • li-zhihui's avatar
      [SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone mode · 28dbae85
      li-zhihui authored
      In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set.
      
      Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready.
      
      Author: li-zhihui <zhihui.li@intel.com>
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #1525 from li-zhihui/fixre4s and squashes the following commits:
      
      e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes
      abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes
      ca54bd9 [li-zhihui] Format log with String interpolation
      88c7dc6 [li-zhihui] Few codes and docs refactor
      41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode
      28dbae85
  10. Aug 07, 2014
    • Matei Zaharia's avatar
      SPARK-2787: Make sort-based shuffle write files directly when there's no... · 6906b69c
      Matei Zaharia authored
      SPARK-2787: Make sort-based shuffle write files directly when there's no sorting/aggregation and # partitions is small
      
      As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.
      
      On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:
      
      88cf26a [Matei Zaharia] Fix rebase
      10233af [Matei Zaharia] Review comments
      398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
      ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
      d0ae3c5 [Matei Zaharia] Fix some comments
      90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
      31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
      6906b69c
  11. Aug 06, 2014
    • Andrew Or's avatar
      [SPARK-2157] Enable tight firewall rules for Spark · 09f7e458
      Andrew Or authored
      The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
      
      The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
      
      My spark-env.sh looks like this:
      ```
      export SPARK_MASTER_PORT=6060
      export SPARK_WORKER_PORT=7070
      export SPARK_MASTER_WEBUI_PORT=9090
      export SPARK_WORKER_WEBUI_PORT=9091
      ```
      and my spark-defaults.conf looks like this:
      ```
      spark.master spark://andrews-mbp:6060
      spark.driver.port 5001
      spark.fileserver.port 5011
      spark.broadcast.port 5021
      spark.replClassServer.port 5031
      spark.blockManager.port 5041
      spark.executor.port 5051
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1777 from andrewor14/configure-ports and squashes the following commits:
      
      621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      8a6b820 [Andrew Or] Use a random UI port during tests
      7da0493 [Andrew Or] Fix tests
      523c30e [Andrew Or] Add test for isBindCollision
      b97b02a [Andrew Or] Minor fixes
      c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      93d359f [Andrew Or] Executors connect to wrong port when collision occurs
      d502e5f [Andrew Or] Handle port collisions when creating Akka systems
      a2dd05c [Andrew Or] Patrick's comment nit
      86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
      1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
      cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
      e837cde [Andrew Or] Remove outdated TODOs
      bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      de1b207 [Andrew Or] Update docs to reflect new ports
      b565079 [Andrew Or] Add spark.ports.maxRetries
      2551eb2 [Andrew Or] Remove spark.worker.watcher.port
      151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      9868358 [Andrew Or] Add a few miscellaneous ports
      6016e77 [Andrew Or] Add spark.executor.port
      8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
      4d9e6f3 [Andrew Or] Fix super subtle bug
      3f8e51b [Andrew Or] Correct erroneous docs...
      e111d08 [Andrew Or] Add names for UI services
      470f38c [Andrew Or] Special case non-"Address already in use" exceptions
      1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
      ba32280 [Andrew Or] Minor fixes
      6b550b0 [Andrew Or] Assorted fixes
      73fbe89 [Andrew Or] Move start service logic to Utils
      ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
      038a579 [Andrew Ash] Trust the server start function to report the port the service started on
      7c5bdc4 [Andrew Ash] Fix style issue
      0347aef [Andrew Ash] Unify port fallback logic to a single place
      24a4c32 [Andrew Ash] Remove type on val to match surrounding style
      9e4ad96 [Andrew Ash] Reformat for style checker
      5d84e0e [Andrew Ash] Document new port configuration options
      066dc7a [Andrew Ash] Fix up HttpServer port increments
      cad16da [Andrew Ash] Add fallover increment logic for HttpServer
      c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
      b80d2fd [Andrew Ash] Make Spark's block manager port configurable
      17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
      f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
      49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
      1c0981a [Andrew Ash] Make port in HttpServer configurable
      09f7e458
  12. Aug 05, 2014
    • Reynold Xin's avatar
      [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB. · acff9a7f
      Reynold Xin authored
      This can substantially reduce memory usage during shuffle.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
      
      104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
      acff9a7f
    • Thomas Graves's avatar
      SPARK-1680: use configs for specifying environment variables on YARN · 41e0a21b
      Thomas Graves authored
      Note that this also documents spark.executorEnv.*  which to me means its public.  If we don't want that please speak up.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
      
      11525df [Thomas Graves] more doc changes
      553bad0 [Thomas Graves] fix documentation
      152bf7c [Thomas Graves] fix docs
      5382326 [Thomas Graves] try fix docs
      32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
      41e0a21b
    • Patrick Wendell's avatar
      SPARK-2380: Support displaying accumulator values in the web UI · 74f82c71
      Patrick Wendell authored
      This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.
      
      Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.
      
      ![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1309 from pwendell/metrics and squashes the following commits:
      
      8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
      93fbe0f [Patrick Wendell] Other minor fixes
      cc43f68 [Patrick Wendell] Updating unit tests
      c991b1b [Patrick Wendell] Moving some code into the Accumulators class
      9a9ba3c [Patrick Wendell] More merge fixes
      c5ace9e [Patrick Wendell] More merge conflicts
      1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
      9860c55 [Patrick Wendell] Potential solution to posting listener events
      0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
      0ec4ac7 [Patrick Wendell] Java API's
      e95bf69 [Patrick Wendell] Stash
      be97261 [Patrick Wendell] Style fix
      8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
      64d405f [Patrick Wendell] Adding missing file
      5d8b156 [Patrick Wendell] Changes based on Kay's review.
      9f18bad [Patrick Wendell] Minor style changes and tests
      7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
      ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
      0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
      74f82c71
    • Guancheng (G.C.) Chen's avatar
      [SPARK-2859] Update url of Kryo project in related docs · ac3440f4
      Guancheng (G.C.) Chen authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859
      
      Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #1782 from gchen/kryo-docs and squashes the following commits:
      
      b62543c [Guancheng (G.C.) Chen] update url of Kryo project
      ac3440f4
    • Thomas Graves's avatar
      SPARK-1890 and SPARK-1891- add admin and modify acls · 1c5555a2
      Thomas Graves authored
      It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:
      
      - adds modify acls
      - adds admin acls (list of admins/users that get added to both view and modify acls)
      - modify Kill button on UI to take modify acls into account
      - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
      - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
      
      8292eb1 [Thomas Graves] review comments
      b92ec89 [Thomas Graves] remove unneeded variable from applistener
      4c765f4 [Thomas Graves] Add in admin acls
      72eb0ac [Thomas Graves] Add modify acls
      1c5555a2
    • Thomas Graves's avatar
      SPARK-1528 - spark on yarn, add support for accessing remote HDFS · 2c0f705e
      Thomas Graves authored
      Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster.  User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
      
      ddbcd16 [Thomas Graves] review comments
      0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
      2c0f705e
    • Reynold Xin's avatar
      [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. · 184048f8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1780 from rxin/kryo-init-size and squashes the following commits:
      
      551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
      184048f8
    • Andrew Or's avatar
      [SPARK-2857] Correct properties to set Master / Worker ports · a646a365
      Andrew Or authored
      `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1779 from andrewor14/master-worker-port and squashes the following commits:
      
      8475e95 [Andrew Or] Update docs to reflect changes in configs
      4db3d5d [Andrew Or] Stop using configs that don't actually work
      a646a365
  13. Aug 04, 2014
    • Matei Zaharia's avatar
      SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter · 8e7d5ba1
      Matei Zaharia authored
      All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
      
      In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1722 from mateiz/spark-2792 and squashes the following commits:
      
      5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
      18fe865 [Matei Zaharia] Update docs on objectStreamReset
      576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
      0374217 [Matei Zaharia] Remove super paranoid code to close file handles
      bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
      0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
      9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
      8e7d5ba1
  14. Aug 03, 2014
    • Michael Armbrust's avatar
      [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect' · 236dfac6
      Michael Armbrust authored
      Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
      
      The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
      
      **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
      
      For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
      
      ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
      20c43f8 [Michael Armbrust] override function instead of just setting the value
      7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
      236dfac6
    • Stephen Boesch's avatar
      SPARK-2712 - Add a small note to maven doc that mvn package must happen ... · f8cd143b
      Stephen Boesch authored
      Per request by Reynold adding small note about proper sequencing of build then test.
      
      Author: Stephen Boesch <javadba@gmail.com>
      
      Closes #1615 from javadba/docs and squashes the following commits:
      
      6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
      5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
      f8cd143b
  15. Aug 02, 2014
    • Michael Armbrust's avatar
      [SPARK-2739][SQL] Rename registerAsTable to registerTempTable · 1a804373
      Michael Armbrust authored
      There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
      
      d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      4dff086 [Michael Armbrust] Fix .java files too
      89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
      1a804373
    • Chris Fregly's avatar
      [SPARK-1981] Add AWS Kinesis streaming support · 91f9504e
      Chris Fregly authored
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1434 from cfregly/master and squashes the following commits:
      
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      91f9504e
  16. Aug 01, 2014
    • CrazyJvm's avatar
      [SQL] Documentation: Explain cacheTable command · c82fe478
      CrazyJvm authored
      add the `cacheTable` specification
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1681 from CrazyJvm/sql-programming-guide-cache and squashes the following commits:
      
      0a231e0 [CrazyJvm] grammar fixes
      a04020e [CrazyJvm] modify title to Cached tables
      18b6594 [CrazyJvm] fix format
      2cbbf58 [CrazyJvm] add cacheTable guide
      c82fe478
    • Sandy Ryza's avatar
      SPARK-2099. Report progress while task is running. · 8d338f64
      Sandy Ryza authored
      This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed.  It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks.
      
      It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits:
      
      93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying
      132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes
      38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite
      51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements
      3084f10 [Sandy Ryza] Make TaskUIData a case class again
      3bda974 [Sandy Ryza] Stylistic fixes
      0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
      8d338f64
  17. Jul 31, 2014
    • kballou's avatar
      Docs: monitoring, streaming programming guide · cc820502
      kballou authored
      Fix several awkward wordings and grammatical issues in the following
      documents:
      
      *   docs/monitoring.md
      
      *   docs/streaming-programming-guide.md
      
      Author: kballou <kballou@devnulllabs.io>
      
      Closes #1662 from kennyballou/grammar_fixes and squashes the following commits:
      
      e1b8ad6 [kballou] Docs: monitoring, streaming programming guide
      cc820502
    • Michael Armbrust's avatar
      [SPARK-2397][SQL] Deprecate LocalHiveContext · 72cfb139
      Michael Armbrust authored
      LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1641 from marmbrus/localHiveContext and squashes the following commits:
      
      e5ec497 [Michael Armbrust] Add deprecation version
      626e056 [Michael Armbrust] Don't remove from imports yet
      905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
      1c2727e [Michael Armbrust] Deprecate LocalHiveContext
      72cfb139
    • CrazyJvm's avatar
      automatically set master according to `spark.master` in `spark-defaults.... · 669e3f05
      CrazyJvm authored
      automatically set master according to `spark.master` in `spark-defaults.conf`
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1644 from CrazyJvm/standalone-guide and squashes the following commits:
      
      bb12b95 [CrazyJvm] automatically set master according to `spark.master` in `spark-defaults.conf`
      669e3f05
  18. Jul 30, 2014
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
    • Koert Kuipers's avatar
      SPARK-2543: Allow user to set maximum Kryo buffer size · 7c5fc28a
      Koert Kuipers authored
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:
      
      15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
      1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
      0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
      143ec4d [Koert Kuipers] test resizable buffer in kryo Output
      0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
      7c5fc28a
  19. Jul 28, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix) · a7a9d144
      Cheng Lian authored
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
      
      In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
      
      629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
      ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
      a7a9d144
Loading