Skip to content
Snippets Groups Projects
  1. Aug 01, 2014
  2. Jul 31, 2014
    • Doris Xin's avatar
      [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation · c4755403
      Doris Xin authored
      getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1710 from dorx/correlationBug and squashes the following commits:
      
      733def4 [Doris Xin] bugs and reviewer comments.
      31db920 [Doris Xin] revert unnecessary change
      043ff83 [Doris Xin] bug fix for spearman corner case
      c4755403
    • Xiangrui Meng's avatar
      [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK · b1900832
      Xiangrui Meng authored
      Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.
      
      srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1700 from mengxr/als-level and squashes the following commits:
      
      c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
      b1900832
    • GuoQiang Li's avatar
      SPARK-2766: ScalaReflectionSuite throw an llegalArgumentException in JDK 6 · 9998efab
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1683 from witgo/SPARK-2766 and squashes the following commits:
      
      d0db00c [GuoQiang Li] ScalaReflectionSuite  throw an llegalArgumentException in JDK 6
      9998efab
    • Yin Huai's avatar
      [SPARK-2779] [SQL] asInstanceOf[Map[...]] should use scala.collection.Map... · 9632719c
      Yin Huai authored
      [SPARK-2779] [SQL] asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map
      
      Since we let users create Rows. It makes sense to accept mutable Maps as values of MapType columns.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2779
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1705 from yhuai/SPARK-2779 and squashes the following commits:
      
      00d72fd [Yin Huai] Use scala.collection.Map.
      9632719c
    • Joseph K. Bradley's avatar
      [SPARK-2756] [mllib] Decision tree bug fixes · b124de58
      Joseph K. Bradley authored
      (1) Inconsistent aggregate (agg) indexing for unordered features.
      (2) Fixed gain calculations for edge cases.
      (3) One-off error in choosing thresholds for continuous features for small datasets.
      (4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)
      
      Other updates, to help with tests:
      * Updated DecisionTreeRunner to print more info.
      * Added utility functions to DecisionTreeModel: toString, depth, numNodes
      * Improved internal DecisionTree documentation
      
      Bug fix details:
      
      (1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).
      
      * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
      ** featureValue was from arr (so it was a feature value)
      ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
      * The rest of the code indexed agg as (node, feature, binIndex, label).
      * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.
      
      Unit tests in DecisionTreeSuite
      * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
      * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.
      
      (2) Bug fix: calculateGainForSplit (for classification):
      * It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.
      
      Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).
      
      Small updates to documentation and coding style.
      
      (3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      
      * Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
      * Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
      * Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
      * Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.
      
      CC: mengxr manishamde  Please let me know if I missed something!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:
      
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      b124de58
    • Doris Xin's avatar
      [SPARK-2724] Python version of RandomRDDGenerators · d8430148
      Doris Xin authored
      RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
      
      `randomRDD.py` is named to avoid collision with the built-in Python `random` package.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1628 from dorx/pythonRDD and squashes the following commits:
      
      55c6de8 [Doris Xin] review comments. all python units passed.
      f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
      2d73917 [Doris Xin] fix for linalg.py
      8663e6a [Doris Xin] reverting back to a single python file for random
      f47c481 [Doris Xin] docs update
      687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
      4338f40 [Doris Xin] renamed randomRDD to rand and import as random
      29d205e [Doris Xin] created mllib.random package
      bd2df13 [Doris Xin] typos
      07ddff2 [Doris Xin] units passed.
      23b2ecd [Doris Xin] WIP
      d8430148
    • Zongheng Yang's avatar
      [SPARK-2531 & SPARK-2436] [SQL] Optimize the BuildSide when planning BroadcastNestedLoopJoin. · 8f51491e
      Zongheng Yang authored
      This PR resolves the following two tickets:
      
      - [SPARK-2531](https://issues.apache.org/jira/browse/SPARK-2531): BNLJ currently assumes the build side is the right relation. This patch refactors some of its logic to take into account a BuildSide properly.
      - [SPARK-2436](https://issues.apache.org/jira/browse/SPARK-2436): building on top of the above, we simply use the physical size statistics (if available) of both relations, and make the smaller relation the build side in the planner.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1448 from concretevitamin/bnlj-buildSide and squashes the following commits:
      
      1780351 [Zongheng Yang] Use size estimation to decide optimal build side of BNLJ.
      68e6c5b [Zongheng Yang] Consolidate two adjacent pattern matchings.
      96d312a [Zongheng Yang] Use a while loop instead of collection methods chaining.
      4bc525e [Zongheng Yang] Make BroadcastNestedLoopJoin take a BuildSide.
      8f51491e
    • Aaron Davidson's avatar
      SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark · ef4ff00f
      Aaron Davidson authored
      Prior to this change, every PySpark task completion opened a new socket to the accumulator server, passed its updates through, and then quit. I'm not entirely sure why PySpark always sends accumulator updates, but regardless this causes a very rapid buildup of ephemeral TCP connections that remain in the TCP_WAIT state for around a minute before being cleaned up.
      
      Rather than trying to allow these sockets to be cleaned up faster, this patch simply reuses the connection between tasks completions (since they're fed updates in a single-threaded manner by the DAGScheduler anyway).
      
      The only tricky part here was making sure that the AccumulatorServer was able to shutdown in a timely manner (i.e., stop polling for new data), and this was accomplished via minor feats of magic.
      
      I have confirmed that this patch eliminates the buildup of ephemeral sockets due to the accumulator updates. However, I did note that there were still significant sockets being created against the PySpark daemon port, but my machine was not able to create enough sockets fast enough to fail. This may not be the last time we've seen this issue, though.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1503 from aarondav/accum and squashes the following commits:
      
      b3e12f7 [Aaron Davidson] SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark
      ef4ff00f
    • Rui Li's avatar
      SPARK-2740: allow user to specify ascending and numPartitions for sortBy... · 492a195c
      Rui Li authored
      It should be more convenient if user can specify ascending and numPartitions when calling sortByKey.
      
      Author: Rui Li <rui.li@intel.com>
      
      Closes #1645 from lirui-intel/spark-2740 and squashes the following commits:
      
      fb5d52e [Rui Li] SPARK-2740: allow user to specify ascending and numPartitions for sortByKey
      492a195c
    • kballou's avatar
      Docs: monitoring, streaming programming guide · cc820502
      kballou authored
      Fix several awkward wordings and grammatical issues in the following
      documents:
      
      *   docs/monitoring.md
      
      *   docs/streaming-programming-guide.md
      
      Author: kballou <kballou@devnulllabs.io>
      
      Closes #1662 from kennyballou/grammar_fixes and squashes the following commits:
      
      e1b8ad6 [kballou] Docs: monitoring, streaming programming guide
      cc820502
    • Josh Rosen's avatar
      Improvements to merge_spark_pr.py · e0213621
      Josh Rosen authored
      This commit fixes a couple of issues in the merge_spark_pr.py developer script:
      
      - Allow recovery from failed cherry-picks.
      - Fix detection of pull requests that have already been merged.
      
      Both of these fixes are useful when backporting changes.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1668 from JoshRosen/pr-script-improvements and squashes the following commits:
      
      ff4f33a [Josh Rosen] Default SPARK_HOME to cwd(); detect missing JIRA credentials.
      ed5bc57 [Josh Rosen] Improvements for backporting using merge_spark_pr:
      e0213621
    • Yin Huai's avatar
      [SPARK-2523] [SQL] Hadoop table scan bug fixing (fix failing Jenkins maven test) · 49b36129
      Yin Huai authored
      This PR tries to resolve the broken Jenkins maven test issue introduced by #1439. Now, we create a single query test to run both the setup work and the test query.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1669 from yhuai/SPARK-2523-fixTest and squashes the following commits:
      
      358af1a [Yin Huai] Make partition_based_table_scan_with_different_serde run atomically.
      49b36129
    • Xiangrui Meng's avatar
      [SPARK-2511][MLLIB] add HashingTF and IDF · dc0865bc
      Xiangrui Meng authored
      This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .
      
      Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1671 from mengxr/tfidf and squashes the following commits:
      
      7d65888 [Xiangrui Meng] use JavaConverters._
      5fe9ec4 [Xiangrui Meng] fix unit test
      6e214ec [Xiangrui Meng] add apache header
      cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
      3814440 [Xiangrui Meng] add HashingTF and IDF
      dc0865bc
    • Sean Owen's avatar
      SPARK-2646. log4j initialization not quite compatible with log4j 2.x · e5749a13
      Sean Owen authored
      The logging code that handles log4j initialization leads to an stack overflow error when used with log4j 2.x, which has just been released. This occurs even a downstream project has correctly adjusted SLF4J bindings, and that is the right thing to do for log4j 2.x, since it is effectively a separate project from 1.x.
      
      Here is the relevant bit of Logging.scala:
      
      ```
        private def initializeLogging() {
          // If Log4j is being used, but is not initialized, load a default properties file
          val binder = StaticLoggerBinder.getSingleton
          val usingLog4j = binder.getLoggerFactoryClassStr.endsWith("Log4jLoggerFactory")
          val log4jInitialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements
          if (!log4jInitialized && usingLog4j) {
            val defaultLogProps = "org/apache/spark/log4j-defaults.properties"
            Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
              case Some(url) =>
                PropertyConfigurator.configure(url)
                log.info(s"Using Spark's default log4j profile: $defaultLogProps")
              case None =>
                System.err.println(s"Spark was unable to load $defaultLogProps")
            }
          }
          Logging.initialized = true
      
          // Force a call into slf4j to initialize it. Avoids this happening from mutliple threads
          // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
          log
        }
      ```
      
      The first minor issue is that there is a call to a logger inside this method, which is initializing logging. In this situation, it ends up causing the initialization to be called recursively until the stack overflow. It would be slightly tidier to log this only after Logging.initialized = true. Or not at all. But it's not the root problem, or else, it would not work at all now.
      
      The calls to log4j classes here always reference log4j 1.2 no matter what. For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, "usingLog4j" means "using log4j 1.2" and "log4jInitialized" means "log4j 1.2 is initialized".
      
      usingLog4j should be false for log4j 2.x, because the initialization only matters for log4j 1.2. But, it's true, and that's the real issue. And log4jInitialized is always false, since calls to the log4j 1.2 API are stubs and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the loop.
      
      This is fixed, I believe, if "usingLog4j" can be false for log4j 2.x. The SLF4J static binding class has the same name for both versions, unfortunately, which causes the issue. However they're in different packages. For example, if the test included "... and begins with org.slf4j", it should work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package org.apache.logging.slf4j.
      
      Of course, I assume that SLF4J will eventually offer its own binding. I hope to goodness they at least name the binding class differently, or else this will again not work. But then some other check can probably be made.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1547 from srowen/SPARK-2646 and squashes the following commits:
      
      92a9898 [Sean Owen] System.out -> System.err
      94be4c7 [Sean Owen] Add back log message as System.out, with informational comment
      a7f8876 [Sean Owen] Updates from review
      6f3c1d3 [Sean Owen] Remove log statement in logging initialization, and distinguish log4j 1.2 from 2.0, to avoid stack overflow in initialization
      e5749a13
    • Sean Owen's avatar
      SPARK-2749 [BUILD] Part 2. Fix a follow-on scalastyle error · 4dbabb39
      Sean Owen authored
      The test compile error is fixed, but the build still fails because of one scalastyle error.
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/lastFailedBuild/hadoop.version=1.0.4,label=centos/console
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1690 from srowen/SPARK-2749 and squashes the following commits:
      
      1c9e7a6 [Sean Owen] Also: fix scalastyle error by wrapping a long line
      4dbabb39
    • Sandy Ryza's avatar
      SPARK-2664. Deal with `--conf` options in spark-submit that relate to fl... · f68105df
      Sandy Ryza authored
      ...ags
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1665 from sryza/sandy-spark-2664 and squashes the following commits:
      
      0518c63 [Sandy Ryza] SPARK-2664. Deal with `--conf` options in spark-submit that relate to flags
      f68105df
    • Aaron Davidson's avatar
      SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD · f1933123
      Aaron Davidson authored
      This allows users to gain access to the InputSplit which backs each partition.
      
      An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #973 from aarondav/hadoop and squashes the following commits:
      
      9c9112b [Aaron Davidson] Add JavaAPISuite test
      9942cd7 [Aaron Davidson] Add Java API
      1284a3a [Aaron Davidson] SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD
      f1933123
    • Michael Armbrust's avatar
      [SPARK-2397][SQL] Deprecate LocalHiveContext · 72cfb139
      Michael Armbrust authored
      LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1641 from marmbrus/localHiveContext and squashes the following commits:
      
      e5ec497 [Michael Armbrust] Add deprecation version
      626e056 [Michael Armbrust] Don't remove from imports yet
      905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
      1c2727e [Michael Armbrust] Deprecate LocalHiveContext
      72cfb139
    • Michael Armbrust's avatar
      [SPARK-2743][SQL] Resolve original attributes in ParquetTableScan · 3072b960
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1647 from marmbrus/parquetCase and squashes the following commits:
      
      a1799b7 [Michael Armbrust] move comment
      2a2a68b [Michael Armbrust] Merge remote-tracking branch 'apache/master' into parquetCase
      bb35d5b [Michael Armbrust] Fix test case that produced an invalid plan.
      e6870bf [Michael Armbrust] Better error message.
      539a2e1 [Michael Armbrust] Resolve original attributes in ParquetTableScan
      3072b960
    • Timothy Hunter's avatar
      [SPARK-2762] SparkILoop leaks memory in multi-repl configurations · 92ca910e
      Timothy Hunter authored
      This pull request is a small refactor so that a partial function (hence a closure) is not created. Instead, a regular function is used. The behavior of the code is not changed.
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #1674 from thunterdb/closure_issue and squashes the following commits:
      
      e1e664d [Timothy Hunter] simplify closure
      92ca910e
    • CrazyJvm's avatar
      automatically set master according to `spark.master` in `spark-defaults.... · 669e3f05
      CrazyJvm authored
      automatically set master according to `spark.master` in `spark-defaults.conf`
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1644 from CrazyJvm/standalone-guide and squashes the following commits:
      
      bb12b95 [CrazyJvm] automatically set master according to `spark.master` in `spark-defaults.conf`
      669e3f05
    • Prashant Sharma's avatar
      [SPARK-2497] Included checks for module symbols too. · 5a110da2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1463 from ScrapCodes/SPARK-2497/mima-exclude-all and squashes the following commits:
      
      72077b1 [Prashant Sharma] Check separately for module symbols.
      cd96192 [Prashant Sharma] SPARK-2497 Produce "member excludes" irrespective of the fact that class itself is excluded or not.
      5a110da2
    • Josh Rosen's avatar
      [SPARK-2737] Add retag() method for changing RDDs' ClassTags. · 4fb25935
      Josh Rosen authored
      The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197).
      
      There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose.
      
      Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1639 from JoshRosen/SPARK-2737 and squashes the following commits:
      
      572b4c8 [Josh Rosen] Replace newRDD[T] with mapPartitions().
      469d941 [Josh Rosen] Preserve partitioner in retag().
      af78816 [Josh Rosen] Allow retag() to get classTag implicitly.
      d1d54e6 [Josh Rosen] [SPARK-2737] Add retag() method for changing RDDs' ClassTags.
      4fb25935
  3. Jul 30, 2014
    • Andrew Or's avatar
      [SPARK-2340] Resolve event logging and History Server paths properly · a7c305b8
      Andrew Or authored
      We resolve relative paths to the local `file:/` system for `--jars` and `--files` in spark submit (#853). We should do the same for the history server.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1280 from andrewor14/hist-serv-fix and squashes the following commits:
      
      13ff406 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hist-serv-fix
      b393e17 [Andrew Or] Strip trailing "/" from logging directory
      622a471 [Andrew Or] Fix test in EventLoggingListenerSuite
      0e20f71 [Andrew Or] Shift responsibility of resolving paths up one level
      b037c0c [Andrew Or] Use resolved paths for everything in history server
      c7e36ee [Andrew Or] Resolve paths for event logging too
      40e3933 [Andrew Or] Resolve history server file paths
      a7c305b8
    • derek ma's avatar
      Required AM memory is "amMem", not "args.amMemory" · 118c1c42
      derek ma authored
      "ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster" appears if this code is not changed. obviously, 1024 is less than 1048, so change this
      
      Author: derek ma <maji3@asiainfo-linkage.com>
      
      Closes #1494 from maji2014/master and squashes the following commits:
      
      b0f6640 [derek ma] Required AM memory is "amMem", not "args.amMemory"
      118c1c42
    • Reynold Xin's avatar
      [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs · 894d48ff
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1675 from rxin/unionrdd and squashes the following commits:
      
      941d316 [Reynold Xin] Clear RDDs for checkpointing.
      c9f05f2 [Reynold Xin] [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs
      894d48ff
    • Matei Zaharia's avatar
      SPARK-2045 Sort-based shuffle · e9662844
      Matei Zaharia authored
      This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)
      
      The main TODOs still left are:
      - [x] enabling ExternalSorter to merge across spilled files
        - [x] with an Ordering
        - [x] without an Ordering, using the keys' hash codes
      - [x] adding more tests (e.g. a version of our shuffle suite that runs on this)
      - [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged
      - [x] disabling spilling if spark.shuffle.spill is set to false
      
      Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.
      
      After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits:
      
      bd841f9 [Matei Zaharia] Various review comments
      d1c137fd [Matei Zaharia] Various review comments
      a611159 [Matei Zaharia] Compile fixes due to rebase
      62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s.
      f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic)
      9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase
      0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle
      eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD
      fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams
      a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test
      03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle
      3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter
      ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer
      44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes
      5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data:
      5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition)
      e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it)
      c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty
      de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark]
      4988d16 [Matei Zaharia] tweak
      c1b7572 [Matei Zaharia] Small optimization
      ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions
      ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering
      4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given
      e1f84be [Matei Zaharia] Fix disk block manager test
      5a40a1c [Matei Zaharia] More tests
      614f1b4 [Matei Zaharia] Add spill metrics to map tasks
      cc52caf [Matei Zaharia] Add more error handling and tests for error cases
      bbf359d [Matei Zaharia] More work
      3a56341 [Matei Zaharia] More partial work towards sort-based shuffle
      7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle
      b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
      e9662844
    • strat0sphere's avatar
      Update DecisionTreeRunner.scala · da501766
      strat0sphere authored
      Author: strat0sphere <stratos.dimopoulos@gmail.com>
      
      Closes #1676 from strat0sphere/patch-1 and squashes the following commits:
      
      044d2fa [strat0sphere] Update DecisionTreeRunner.scala
      da501766
    • Sean Owen's avatar
      SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets · e9b275b7
      Sean Owen authored
      Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1663 from srowen/SPARK-2341 and squashes the following commits:
      
      8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
      18a8c8e [Sean Owen] Updates from review
      83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
      e9b275b7
    • Michael Armbrust's avatar
      [SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run. · 88a519db
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1650 from marmbrus/dropCached and squashes the following commits:
      
      e6ab80b [Michael Armbrust] Support if exists.
      83426c6 [Michael Armbrust] Remove tables from cache when DROP TABLE is run.
      88a519db
    • Brock Noland's avatar
      SPARK-2741 - Publish version of spark assembly which does not contain Hive · 2ac37db7
      Brock Noland authored
      Provide a version of the Spark tarball which does not package Hive. This is meant for HIve + Spark users.
      
      Author: Brock Noland <brock@apache.org>
      
      Closes #1667 from brockn/master and squashes the following commits:
      
      5beafb2 [Brock Noland] SPARK-2741 - Publish version of spark assembly which does not contain Hive
      2ac37db7
    • Sean Owen's avatar
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven... · 6ab96a6f
      Sean Owen authored
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
      
      The Maven-based builds in the build matrix have been failing for a few days:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/
      
      On inspection, it looks like the Spark SQL Java tests don't compile:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
      
      I confirmed it by repeating the command vs master:
      
      `mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`
      
      The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`
      
      Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1660 from srowen/SPARK-2749 and squashes the following commits:
      
      858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
      9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
      6ab96a6f
    • Reynold Xin's avatar
      Properly pass SBT_MAVEN_PROFILES into sbt. · 2f4b1705
      Reynold Xin authored
      2f4b1705
    • Reynold Xin's avatar
      Set AMPLAB_JENKINS_BUILD_PROFILE. · 10973275
      Reynold Xin authored
      10973275
    • Reynold Xin's avatar
      Wrap JAR_DL in dev/check-license. · 7c7ce545
      Reynold Xin authored
      7c7ce545
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
    • Reynold Xin's avatar
      dev/check-license wrap folders in quotes. · 437dc8c5
      Reynold Xin authored
      437dc8c5
    • Michael Armbrust's avatar
      [SQL] Fix compiling of catalyst docs. · 2248891a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1653 from marmbrus/fixDocs and squashes the following commits:
      
      0aa1feb [Michael Armbrust] Fix compiling of catalyst docs.
      2248891a
    • Reynold Xin's avatar
      More wrapping FWDIR in quotes. · 0feb349e
      Reynold Xin authored
      0feb349e
Loading