Skip to content
Snippets Groups Projects
  1. Sep 15, 2014
    • Prashant Sharma's avatar
      [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. · ecf0c029
      Prashant Sharma authored
      Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2285 from ScrapCodes/mima-fix and squashes the following commits:
      
      093c76f [Prashant Sharma] Update mima
      59012a8 [Prashant Sharma] Update mima
      35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
      ecf0c029
  2. Sep 07, 2014
    • Josh Rosen's avatar
      [HOTFIX] Fix broken Mima tests on the master branch · 4ba26735
      Josh Rosen authored
      By merging #2268, which bumped the Spark version to 1.2.0-SNAPSHOT, I inadvertently broke the Mima binary compatibility tests.  The issue is that we were comparing 1.2.0-SNAPSHOT against Spark 1.0.0 without using any Mima excludes.  The right long-term fix for this is probably to publish nightly snapshots on Maven central and change the master branch to test binary compatibility against the current release candidate branch's snapshots until that release is finalized.
      
      As a short-term fix until 1.1.0 is published on Maven central, I've configured the build to test the master branch for binary compatibility against the 1.1.0-RC4 jars.  I'll loop back and remove the Apache staging repo as soon as 1.1.0 final is available.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2315 from JoshRosen/mima-fix and squashes the following commits:
      
      776bc2c [Josh Rosen] Add two excludes to workaround Mima annotation issues.
      ec90e21 [Josh Rosen] Add deploy and graphx to 1.2 MiMa excludes.
      57569be [Josh Rosen] Fix MiMa tests in master branch; test against 1.1.0 RC.
      4ba26735
  3. Sep 03, 2014
    • Marcelo Vanzin's avatar
      [SPARK-3388] Expose aplication ID in ApplicationStart event, use it in history server. · f2b5b619
      Marcelo Vanzin authored
      This change exposes the application ID generated by the Spark Master, Mesos or Yarn
      via the SparkListenerApplicationStart event. It then uses that information to expose the
      application via its ID in the history server, instead of using the internal directory name
      generated by the event logger as an application id. This allows someone who knows
      the application ID to easily figure out the URL for the application's entry in the HS, aside
      from looking better.
      
      In Yarn mode, this is used to generate a direct link from the RM application list to the
      Spark history server entry (thus providing a fix for SPARK-2150).
      
      Note this sort of assumes that the different managers will generate app ids that are
      sufficiently different from each other that clashes will not occur.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Andrew Or <andrewor14@gmail.com>
      
      Closes #1218 from vanzin/yarn-hs-link-2 and squashes the following commits:
      
      2d19f3c [Marcelo Vanzin] Review feedback.
      6706d3a [Marcelo Vanzin] Implement applicationId() in base classes.
      56fe42e [Marcelo Vanzin] Fix cluster mode history address, plus a cleanup.
      44112a8 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      8278316 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      a86bbcf [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      a0056e6 [Marcelo Vanzin] Unbreak test.
      4b10cfd [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      cb0cab2 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      25f2826 [Marcelo Vanzin] Add MIMA excludes.
      f0ba90f [Marcelo Vanzin] Use BufferedIterator.
      c90a08d [Marcelo Vanzin] Remove unused code.
      3f8ec66 [Marcelo Vanzin] Review feedback.
      21aa71b [Marcelo Vanzin] Fix JSON test.
      b022bae [Marcelo Vanzin] Undo SparkContext cleanup.
      c6d7478 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      4e3483f [Marcelo Vanzin] Fix test.
      57517b8 [Marcelo Vanzin] Review feedback. Mostly, more consistent use of Scala's Option.
      311e49d [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
      d35d86f [Marcelo Vanzin] Fix yarn backend after rebase.
      36dc362 [Marcelo Vanzin] Don't use Iterator::takeWhile().
      0afd696 [Marcelo Vanzin] Wait until master responds before returning from start().
      abc4697 [Marcelo Vanzin] Make FsHistoryProvider keep a map of applications by id.
      26b266e [Marcelo Vanzin] Use Mesos framework ID as Spark application ID.
      b3f3664 [Marcelo Vanzin] [yarn] Make the RM link point to the app direcly in the HS.
      2fb7de4 [Marcelo Vanzin] Expose the application ID in the ApplicationStart event.
      ed10348 [Marcelo Vanzin] Expose application id to spark context.
      f2b5b619
  4. Sep 02, 2014
    • lirui's avatar
      SPARK-2636: Expose job ID in JobWaiter API · fbf2678c
      lirui authored
      This PR adds the async actions to the Java API. User can call these async actions to get the FutureAction and use JobWaiter (for SimpleFutureAction) to retrieve job Id.
      
      Author: lirui <rui.li@intel.com>
      
      Closes #2176 from lirui-intel/SPARK-2636 and squashes the following commits:
      
      ccaafb7 [lirui] SPARK-2636: fix java doc
      5536d55 [lirui] SPARK-2636: mark the async API as experimental
      e2e01d5 [lirui] SPARK-2636: add mima exclude
      0ca320d [lirui] SPARK-2636: fix method name & javadoc
      3fa39f7 [lirui] SPARK-2636: refine the patch
      af4f5d9 [lirui] SPARK-2636: remove unused imports
      843276c [lirui] SPARK-2636: only keep foreachAsync in the java API
      fbf5744 [lirui] SPARK-2636: add more async actions for java api
      1b25abc [lirui] SPARK-2636: expose some fields in JobWaiter
      d09f732 [lirui] SPARK-2636: fix build
      eb1ee79 [lirui] SPARK-2636: change some parameters in SimpleFutureAction to member field
      6e2b87b [lirui] SPARK-2636: add java API for async actions
      fbf2678c
  5. Aug 30, 2014
    • Raymond Liu's avatar
      [SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager · acea9280
      Raymond Liu authored
      By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1241 from colorant/shuffle and squashes the following commits:
      
      0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
      acea9280
  6. Aug 16, 2014
    • Xiangrui Meng's avatar
      [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints · 7e70708a
      Xiangrui Meng authored
      Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.
      
      `LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.
      
      freeman-lab tdas
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1952 from mengxr/labelparser and squashes the following commits:
      
      c818fb2 [Xiangrui Meng] merge master
      ce20e6f [Xiangrui Meng] update mima excludes
      b386b8d [Xiangrui Meng] fix tests
      2436b3d [Xiangrui Meng] add parse() to LabeledPoint
      7e70708a
    • Reynold Xin's avatar
      [SPARK-3045] Make Serializer interface Java friendly · a83c7723
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1948 from rxin/kryo and squashes the following commits:
      
      a3a80d8 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer classloader
      3d13277 [Reynold Xin] Reverted that in TestJavaSerializerImpl too.
      196f3dc [Reynold Xin] Ok one more commit to revert the classloader change.
      c49b50c [Reynold Xin] Removed JavaSerializer change.
      afbf37d [Reynold Xin] Moved the test case also.
      a2e693e [Reynold Xin] Removed the Kryo bug fix from this pull request.
      c81bd6c [Reynold Xin] Use defaultClassLoader when executing user specified custom registrator.
      68f261e [Reynold Xin] Added license check excludes.
      0c28179 [Reynold Xin] [SPARK-3045] Make Serializer interface Java friendly [SPARK-3046] Set executor's class loader as the default serializer class loader
      a83c7723
  7. Aug 15, 2014
    • Anand Avati's avatar
      [SPARK-2924] remove default args to overloaded methods · 7589c39d
      Anand Avati authored
      Not supported in Scala 2.11. Split them into separate methods instead.
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1704 from avati/SPARK-1812-default-args and squashes the following commits:
      
      3e3924a [Anand Avati] SPARK-1812: Add Mima excludes for the broken ABI
      901dfc7 [Anand Avati] SPARK-1812: core - Fix overloaded methods with default arguments
      07f00af [Anand Avati] SPARK-1812: streaming - Fix overloaded methods with default arguments
      7589c39d
  8. Aug 12, 2014
    • Xiangrui Meng's avatar
      [SPARK-2923][MLLIB] Implement some basic BLAS routines · 9038d94e
      Xiangrui Meng authored
      Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.
      
      Tested on my local machine:
      
      ~~~
      bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
      examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
      --regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
      ~~~
      
      1. before: ~1m
      2. after: ~30s
      
      CC: jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1849 from mengxr/ml-blas and squashes the following commits:
      
      ba583a2 [Xiangrui Meng] exclude Vector.copy
      a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
      6edeab9 [Xiangrui Meng] address comments
      940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
      c2a38bc [Xiangrui Meng] enhance dot tests
      4cfaac4 [Xiangrui Meng] add apache header
      48d01d2 [Xiangrui Meng] add tests for zeros and copy
      3b882b1 [Xiangrui Meng] use blas.scal in gradient
      735eb23 [Xiangrui Meng] remove d from BLAS routines
      d2d7d3c [Xiangrui Meng] update gradient and lbfgs
      7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
      14e6645 [Xiangrui Meng] add ddot
      cbb8273 [Xiangrui Meng] add daxpy test
      07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
      e8c326d [Xiangrui Meng] axpy
      9038d94e
  9. Aug 08, 2014
    • Xiangrui Meng's avatar
      [SPARK-1997][MLLIB] update breeze to 0.9 · 74d6f622
      Xiangrui Meng authored
      0.9 dependences (this version doesn't depend on scalalogging and I excluded commons-math3 from its transitive dependencies):
      ~~~
      +-org.scalanlp:breeze_2.10:0.9 [S]
        +-com.github.fommil.netlib:core:1.1.2
        +-com.github.rwl:jtransforms:2.4.0
        +-net.sf.opencsv:opencsv:2.3
        +-net.sourceforge.f2j:arpack_combined_all:0.1
        +-org.scalanlp:breeze-macros_2.10:0.3.1 [S]
        | +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
        |
        +-org.slf4j:slf4j-api:1.7.5
        +-org.spire-math:spire_2.10:0.7.4 [S]
          +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
          |
          +-org.spire-math:spire-macros_2.10:0.7.4 [S]
            +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
      ~~~
      
      Closes #1749
      
      CC: witgo avati
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1857 from mengxr/breeze-0.9 and squashes the following commits:
      
      7fc16b6 [Xiangrui Meng] don't know why but exclude a private method for mima
      dcc502e [Xiangrui Meng] update breeze to 0.9
      74d6f622
  10. Aug 02, 2014
  11. Aug 01, 2014
    • jerryshao's avatar
      [SPARK-2103][Streaming] Change to ClassTag for KafkaInputDStream and fix reflection issue · a32f0fb7
      jerryshao authored
      This PR updates previous Manifest for KafkaInputDStream's Decoder to ClassTag, also fix the problem addressed in [SPARK-2103](https://issues.apache.org/jira/browse/SPARK-2103).
      
      Previous Java interface cannot actually get the type of Decoder, so when using this Manifest to reconstruct the decode object will meet reflection exception.
      
      Also for other two Java interfaces, ClassTag[String] is useless because calling Scala API will get the right implicit ClassTag.
      
      Current Kafka unit test cannot actually verify the interface. I've tested these interfaces in my local and distribute settings.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1508 from jerryshao/SPARK-2103 and squashes the following commits:
      
      e90c37b [jerryshao] Add Mima excludes
      7529810 [jerryshao] Change Manifest to ClassTag for KafkaInputDStream's Decoder and fix Decoder construct issue when using Java API
      a32f0fb7
  12. Jul 30, 2014
  13. Jul 27, 2014
    • Andrew Or's avatar
      [SPARK-1777] Prevent OOMs from single partitions · ecf30ee7
      Andrew Or authored
      **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
      
      **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
      
      **New configurations.**
      - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
      - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
      
      For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
      
      e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      c7c8832 [Andrew Or] Simplify logic + update a few comments
      269d07b [Andrew Or] Very minor changes to tests
      6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b7e165c [Andrew Or] Add new tests for unrolling blocks
      f12916d [Andrew Or] Slightly clean up tests
      71672a7 [Andrew Or] Update unrollSafely tests
      369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
      f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
      a66fbd2 [Andrew Or] Rename a few things + update comments
      68730b3 [Andrew Or] Fix weird scalatest behavior
      e40c60d [Andrew Or] Fix MIMA excludes
      ff77aa1 [Andrew Or] Fix tests
      1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
      ed6cda4 [Andrew Or] Formatting fix (super minor)
      f9ff82e [Andrew Or] putValues -> putIterator + putArray
      beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8448c9b [Andrew Or] Fix tests
      a49ba4d [Andrew Or] Do not expose unroll memory check period
      69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
      3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
      dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8288228 [Andrew Or] Synchronize put and unroll properly
      4f18a3d [Andrew Or] bufferFraction -> unrollFraction
      28edfa3 [Andrew Or] Update a few comments / log messages
      728323b [Andrew Or] Do not synchronize every 1000 elements
      5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      129c441 [Andrew Or] Fix bug: Use toArray rather than array
      9a65245 [Andrew Or] Update a few comments + minor control flow changes
      57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
      3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
      f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      0871835 [Andrew Or] Add an effective storage level interface to BlockManager
      64e7d4c [Andrew Or] Add/modify a few comments (minor)
      8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
      ecc8c2d [Andrew Or] Fix binary incompatibility
      24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
      2b7ee66 [Andrew Or] Fix bug in SizeTracking*
      9b9a273 [Andrew Or] Fix tests
      20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      649bdb3 [Andrew Or] Document spark.storage.bufferFraction
      a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
      e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
      198e374 [Andrew Or] Unfold -> unroll
      0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
      ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
      22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
      0871535 [Andrew Or] Fix tests in BlockManagerSuite
      d68f31e [Andrew Or] Safely unfold blocks for all memory puts
      5961f50 [Andrew Or] Fix tests
      195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
      1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d5dd3b4 [Andrew Or] Free buffer memory in finally
      ea02eec [Andrew Or] Fix tests
      b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      87aa75c [Andrew Or] Fix mima excludes again (typo)
      11eb921 [Andrew Or] Clarify comment (minor)
      50cae44 [Andrew Or] Remove now duplicate mima exclude
      7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      df47265 [Andrew Or] Fix binary incompatibility
      6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      f94f5af [Andrew Or] Update a few comments (minor)
      776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
      bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
      97ea499 [Andrew Or] Change BlockManager interface to use Arrays
      c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
      ecf30ee7
  14. Jul 23, 2014
    • Prashant Sharma's avatar
      [SPARK-2549] Functions defined inside of other functions trigger failures · 9b763329
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits:
      
      9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes.
      bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures
      9b763329
  15. Jul 22, 2014
  16. Jul 21, 2014
    • Gregory Owen's avatar
      [SPARK-2086] Improve output of toDebugString to make shuffle boundaries more clear · c3462c65
      Gregory Owen authored
      Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly
      
      New output:
      
      ```
      (3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
       |  MappedValuesRDD[324] at apply at Transformer.scala:22
       |  CoGroupedRDD[323] at apply at Transformer.scala:22
       +-(5) MappedRDD[320] at apply at Transformer.scala:22
       |  |  MappedRDD[319] at apply at Transformer.scala:22
       |  |  MappedValuesRDD[318] at apply at Transformer.scala:22
       |  |  MapPartitionsRDD[317] at apply at Transformer.scala:22
       |  |  ShuffledRDD[316] at apply at Transformer.scala:22
       |  +-(10) MappedRDD[315] at apply at Transformer.scala:22
       |     |   ParallelCollectionRDD[314] at apply at Transformer.scala:22
       +-(100) MappedRDD[322] at apply at Transformer.scala:22
           |   ParallelCollectionRDD[321] at apply at Transformer.scala:22
      ```
      
      Author: Gregory Owen <greowen@gmail.com>
      
      Closes #1364 from GregOwen/to-debug-string and squashes the following commits:
      
      08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
      1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
      c3462c65
  17. Jul 18, 2014
    • Manish Amde's avatar
      [MLlib] SPARK-1536: multiclass classification support for decision tree · d88f6be4
      Manish Amde authored
      The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.
      
      It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.
      
      The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.
      
      cc: mengxr, etrain, hirakendu, atalwalkar, srowen
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      Author: Evan Sparks <sparks@cs.berkeley.edu>
      
      Closes #886 from manishamde/multiclass and squashes the following commits:
      
      26f8acc [Manish Amde] another attempt at fixing mima
      c5b2d04 [Manish Amde] more MIMA fixes
      1ce7212 [Manish Amde] change problem filter for mima
      10fdd82 [Manish Amde] fixing MIMA excludes
      e1c970d [Manish Amde] merged master
      abf2901 [Manish Amde] adding classes to MimaExcludes.scala
      45e767a [Manish Amde] adding developer api annotation for overriden methods
      c8428c4 [Manish Amde] fixing weird multiline bug
      afced16 [Manish Amde] removed label weights support
      2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
      4e85f2c [Manish Amde] minor: fixed scalastyle issues
      b2ae41f [Manish Amde] minor: scalastyle
      e4c1321 [Manish Amde] using while loop for regression histograms
      d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
      0fecd38 [Manish Amde] minor: add newline to EOF
      2061cf5 [Manish Amde] merged from master
      06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
      9cc3e31 [Manish Amde] added implicit conversion import
      5c1b2ca [Manish Amde] doc for PointConverter class
      485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
      3d7f911 [Manish Amde] updated doc
      8e44ab8 [Manish Amde] updated doc
      adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
      e3e8843 [Manish Amde] minor code formatting
      23d4268 [Manish Amde] minor: another minor code style
      34ee7b9 [Manish Amde] minor: code style
      237762d [Manish Amde] renaming functions
      12e6d0a [Manish Amde] minor: removing line in doc
      9a90c93 [Manish Amde] Merge branch 'master' into multiclass
      1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
      f5f6b83 [Manish Amde] multiclass for continous variables
      8cfd3b6 [Manish Amde] working for categorical multiclass classification
      828ff16 [Manish Amde] added categorical variable test
      bce835f [Manish Amde] code cleanup
      7e5f08c [Manish Amde] minor doc
      1dd2735 [Manish Amde] bin search logic for multiclass
      f16a9bb [Manish Amde] fixing while loop
      d811425 [Manish Amde] multiclass bin aggregate logic
      ab5cb21 [Manish Amde] multiclass logic
      d8e4a11 [Manish Amde] sample weights
      ed5a2df [Manish Amde] fixed classification requirements
      d012be7 [Manish Amde] fixed while loop
      18d2835 [Manish Amde] changing default values for num classes
      6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
      75f2bfc [Manish Amde] minor code style fix
      e547151 [Manish Amde] minor modifications
      34549d0 [Manish Amde] fixing error during merge
      098e8c5 [Manish Amde] merged master
      e006f9d [Manish Amde] changing variable names
      5c78e1a [Manish Amde] added multiclass support
      6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
      46e06ee [Manish Amde] minor mods
      3f85a17 [Manish Amde] tests for multiclass classification
      4d5f70c [Manish Amde] added multiclass support for find splits bins
      46f909c [Manish Amde] todo for multiclass support
      455bea9 [Manish Amde] fixed tests
      14aea48 [Manish Amde] changing instance format to weighted labeled point
      a1a6e09 [Manish Amde] added weighted point class
      968ca9d [Manish Amde] merged master
      7fc9545 [Manish Amde] added docs
      ce004a1 [Manish Amde] minor formatting
      b27ad2c [Manish Amde] formatting
      426bb28 [Manish Amde] programming guide blurb
      8053fed [Manish Amde] more formatting
      5eca9e4 [Manish Amde] grammar
      4731cda [Manish Amde] formatting
      5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
      cbd9f14 [Manish Amde] modified scala.math to math
      dad9652 [Manish Amde] removed unused imports
      e0426ee [Manish Amde] renamed parameter
      718506b [Manish Amde] added unit test
      1517155 [Manish Amde] updated documentation
      9dbdabe [Manish Amde] merge from master
      719d009 [Manish Amde] updating user documentation
      fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
      0287772 [Evan Sparks] Fixing scalastyle issue.
      2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
      2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
      abc5a23 [Evan Sparks] Parameterizing max memory.
      50b143a [Manish Amde] adding support for very deep trees
      d88f6be4
  18. Jul 17, 2014
    • Reynold Xin's avatar
      [SPARK-2534] Avoid pulling in the entire RDD in various operators · d988d345
      Reynold Xin authored
      This should go into both master and branch-1.0.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1450 from rxin/agg-closure and squashes the following commits:
      
      e40f363 [Reynold Xin] Mima check excludes.
      9186364 [Reynold Xin] Define the return type more explicitly.
      38e348b [Reynold Xin] Fixed the cases in RDD.scala.
      ea6b34d [Reynold Xin] Blah
      89b9c43 [Reynold Xin] Fix other instances of accidentally pulling in extra stuff in closures.
      73b2783 [Reynold Xin] [SPARK-2534] Avoid pulling in the entire RDD in groupByKey.
      d988d345
  19. Jul 12, 2014
    • DB Tsai's avatar
      [SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max · 55960869
      DB Tsai authored
      It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.
      
      Changes:
      1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
      2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
      3) Added the APIs documentation for MultivariateOnlineSummarizer.
      4) Added the unittests for MultivariateOnlineSummarizer.
      
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #955 from dbtsai/dbtsai-summarizer and squashes the following commits:
      
      b13ac90 [DB Tsai] dbtsai-summarizer
      55960869
  20. Jul 10, 2014
    • tmalaska's avatar
      [SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 · 40a8fef4
      tmalaska authored
      This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska
      Adds MIMA binary check exclusions.
      
      Author: tmalaska <ted.malaska@cloudera.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #1347 from tdas/FLUME-1915 and squashes the following commits:
      
      96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver.
      41d5338 [tmalaska] Address line 57 that was too long
      12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...
      40a8fef4
    • Prashant Sharma's avatar
      [SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8
      Prashant Sharma authored
      Patch introduces the new way of working also retaining the existing ways of doing things.
      
      For example build instruction for yarn in maven is
      `mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
      in sbt it can become
      `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
      Also supports
      `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:
      
      a8ac951 [Prashant Sharma] Updated sbt version.
      62b09bb [Prashant Sharma] Improvements.
      fa6221d [Prashant Sharma] Excluding sql from mima
      4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
      72651ca [Prashant Sharma] Addresses code reivew comments.
      acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
      ac4312c [Prashant Sharma] Revert "minor fix"
      6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
      65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
      446768e [Prashant Sharma] minor fix
      89b9777 [Prashant Sharma] Merge conflicts
      d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
      dccc8ac [Prashant Sharma] updated mima to check against 1.0
      a49c61b [Prashant Sharma] Fix for tools jar
      a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
      cf88758 [Prashant Sharma] cleanup
      9439ea3 [Prashant Sharma] Small fix to run-examples script.
      96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
      36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
      4973dbd [Patrick Wendell] Example build using pom reader.
      628932b8
  21. Jun 23, 2014
    • Marcelo Vanzin's avatar
      [SPARK-1768] History server enhancements. · 21ddd7d1
      Marcelo Vanzin authored
      Two improvements to the history server:
      
      - Separate the HTTP handling from history fetching, so that it's easy to add
        new backends later (thinking about SPARK-1537 in the long run)
      
      - Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
        memory for faster access. This allows the app limit to go away, since holding
        just the listing in memory shouldn't be too expensive unless the user has millions
        of completed apps in the history (at which point I'd expect other issues to arise
        aside from history server memory usage, such as FileSystem.listStatus()
        starting to become ridiculously expensive).
      
      I also fixed a few minor things along the way which aren't really worth mentioning.
      I also removed the app's log path from the UI since that information may not even
      exist depending on which backend is used (even though there is only one now).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #718 from vanzin/hist-server and squashes the following commits:
      
      53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
      c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
      dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
      4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
      2a7f68d [Marcelo Vanzin] Address review feedback.
      4e72c77 [Marcelo Vanzin] Remove comment about ordering.
      249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
      ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
      6e2432f [Marcelo Vanzin] Second round of feedback.
      b2c570a [Marcelo Vanzin] Make class package-private.
      4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
      e852149 [Marcelo Vanzin] Initialize new app array to expected size.
      e8026f4 [Marcelo Vanzin] Review feedback.
      49d2fd3 [Marcelo Vanzin] Fix a comment.
      91e96ca [Marcelo Vanzin] Fix scalastyle issues.
      6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
      eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
      bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
      b284478 [Marcelo Vanzin] Separate history server from history backend.
      21ddd7d1
  22. Jun 21, 2014
  23. Jun 12, 2014
    • Andrew Or's avatar
      [Minor] Fix style, formatting and naming in BlockManager etc. · 44daec5a
      Andrew Or authored
      This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.
      
      (Warning: this PR is full of unimportant nitpicks)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1058 from andrewor14/bm-minor and squashes the following commits:
      
      8e12eaf [Andrew Or] SparkException -> BlockException
      c36fd53 [Andrew Or] Make parts of BlockManager more readable
      0a5f378 [Andrew Or] Entry -> MemoryEntry
      e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
      c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
      b3470f1 [Andrew Or] More string interpolation (minor)
      7f9dcab [Andrew Or] Use string interpolation (minor)
      94a425b [Andrew Or] Refactor against duplicate code + minor changes
      8a6a7dc [Andrew Or] Exception -> SparkException
      97c410f [Andrew Or] Deal with MIMA excludes
      2480f1d [Andrew Or] Fixes in StorgeLevel.scala
      abb0163 [Andrew Or] Style, formatting and naming fixes
      44daec5a
    • Sandy Ryza's avatar
      SPARK-554. Add aggregateByKey. · ce92a9c1
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #705 from sryza/sandy-spark-554 and squashes the following commits:
      
      2302b8f [Sandy Ryza] Add MIMA exclude
      f52e0ad [Sandy Ryza] Fix Python tests for real
      2f3afa3 [Sandy Ryza] Fix Python test
      0b735e9 [Sandy Ryza] Fix line lengths
      ae56746 [Sandy Ryza] Fix doc (replace T with V)
      c2be415 [Sandy Ryza] Java and Python aggregateByKey
      23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
      ce92a9c1
  24. Jun 11, 2014
    • Tor Myklebust's avatar
      [SPARK-1672][MLLIB] Separate user and product partitioning in ALS · d9203350
      Tor Myklebust authored
      Some clean up work following #593.
      
      1. Allow to set different number user blocks and number product blocks in `ALS`.
      2. Update `MovieLensALS` to reflect the change.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:
      
      0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
      36420c7 [Xiangrui Meng] set exclusion rules for ALS
      9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
      84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
      d17a8bf [Xiangrui Meng] merge master
      a4925fd [Tor Myklebust] Style.
      bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
      021f54b [Tor Myklebust] Separate user and product blocks.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      d9203350
  25. Jun 04, 2014
    • Kan Zhang's avatar
      [SPARK-1817] RDD.zip() should verify partition sizes for each partition · c402a4a6
      Kan Zhang authored
      RDD.zip() will throw an exception if it finds partition sizes are not the same.
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #944 from kanzhang/SPARK-1817 and squashes the following commits:
      
      c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
      524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
      c402a4a6
  26. Jun 03, 2014
    • Reynold Xin's avatar
      SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. · 1faef149
      Reynold Xin authored
      I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #897 from rxin/hll and squashes the following commits:
      
      4d83f41 [Reynold Xin] New error bound and non-randomness.
      f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
      e367527 [Reynold Xin] One more round of code review.
      41e649a [Reynold Xin] Update final mima list.
      9e320c8 [Reynold Xin] Incorporate code review feedback.
      e110d70 [Reynold Xin] Merge branch 'master' into hll
      354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
      acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
      6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
      1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
      9221b27 [Reynold Xin] Merge branch 'master' into hll
      88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
      1294be6 [Reynold Xin] Updated HLL+.
      e7786cb [Reynold Xin] Merge branch 'master' into hll
      c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
      1faef149
    • Joseph E. Gonzalez's avatar
      Synthetic GraphX Benchmark · 894ecde0
      Joseph E. Gonzalez authored
      This PR accomplishes two things:
      
      1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
      
      2. This PR improves the implementation of the log-normal graph generator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
      
      e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      bccccad [Ankur Dave] Fix long lines
      374678a [Ankur Dave] Bugfix and style changes
      1bdf39a [Joseph E. Gonzalez] updating options
      d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
      f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
      894ecde0
  27. Jun 01, 2014
    • Patrick Wendell's avatar
      Better explanation for how to use MIMA excludes. · d17d2214
      Patrick Wendell authored
      This patch does a few things:
      1. We have a file MimaExcludes.scala exclusively for excludes.
      2. The test runner tells users about that file if a test fails.
      3. I've added back the excludes used from 0.9->1.0. We should keep
         these in the project as an official audit trail of times where
         we decided to make exceptions.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #937 from pwendell/mima and squashes the following commits:
      
      7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.
      d17d2214
Loading