-
- Downloads
[SPARK-1212] Adding sparse data support and update KMeans
Continue our discussions from https://github.com/apache/incubator-spark/pull/575 This PR is WIP because it depends on a SNAPSHOT version of breeze. Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze. @fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling. I'm going to update this PR to include: 1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time. 2. Some numbers about the performance. 3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks! Author: Xiangrui Meng <meng@databricks.com> Closes #117 from mengxr/sparse-kmeans and squashes the following commits: 67b368d [Xiangrui Meng] fix SparseVector.toArray 5eda0de [Xiangrui Meng] update NOTICE 67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd 1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly 9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray 226d2cd [Xiangrui Meng] update Java friendly methods in Vectors 238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]] b28ba2f [Xiangrui Meng] add toArray to Vector e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java 72bde33 [Xiangrui Meng] clean up code for distance computation 712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly 27858e4 [Xiangrui Meng] update breeze version to 0.7 07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation 6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs 42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel 42b4e50 [Xiangrui Meng] line feed at the end a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans 3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm 0107e19 [Xiangrui Meng] update NOTICE 87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays 0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans 4e7d5ca [Xiangrui Meng] minor style update 07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
Showing
- NOTICE 9 additions, 0 deletionsNOTICE
- examples/src/main/java/org/apache/spark/examples/JavaKMeans.java 0 additions, 138 deletions...s/src/main/java/org/apache/spark/examples/JavaKMeans.java
- examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java 12 additions, 11 deletions...main/java/org/apache/spark/mllib/examples/JavaKMeans.java
- mllib/pom.xml 5 additions, 0 deletionsmllib/pom.xml
- mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 7 additions, 5 deletions...la/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala 154 additions, 79 deletions...main/scala/org/apache/spark/mllib/clustering/KMeans.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala 18 additions, 6 deletions...scala/org/apache/spark/mllib/clustering/KMeansModel.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala 37 additions, 21 deletions...scala/org/apache/spark/mllib/clustering/LocalKMeans.scala
- mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 177 additions, 0 deletions...rc/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
- mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDs.scala 32 additions, 0 deletions...rc/main/scala/org/apache/spark/mllib/rdd/VectorRDDs.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 50 additions, 11 deletions.../src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
- mllib/src/test/java/org/apache/spark/mllib/clustering/JavaKMeansSuite.java 32 additions, 56 deletions...va/org/apache/spark/mllib/clustering/JavaKMeansSuite.java
- mllib/src/test/java/org/apache/spark/mllib/linalg/JavaVectorsSuite.java 44 additions, 0 deletions.../java/org/apache/spark/mllib/linalg/JavaVectorsSuite.java
- mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala 106 additions, 69 deletions...scala/org/apache/spark/mllib/clustering/KMeansSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala 58 additions, 0 deletions...ache/spark/mllib/linalg/BreezeVectorConversionSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala 85 additions, 0 deletions...st/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/rdd/VectorRDDsSuite.scala 33 additions, 0 deletions...st/scala/org/apache/spark/mllib/rdd/VectorRDDsSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/util/LocalSparkContext.scala 17 additions, 0 deletions...scala/org/apache/spark/mllib/util/LocalSparkContext.scala
- mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala 52 additions, 0 deletions...test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
- project/SparkBuild.scala 2 additions, 1 deletionproject/SparkBuild.scala
Loading
Please register or sign in to comment