-
- Downloads
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test) CC: mengxr dorx Main changes were examples to show usage across APIs. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits: ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps. 8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN. b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan. 32173b7 [Joseph K. Bradley] Stats examples update. c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. 65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey 064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Showing
- examples/src/main/python/als.py 2 additions, 0 deletionsexamples/src/main/python/als.py
- examples/src/main/python/cassandra_inputformat.py 2 additions, 0 deletionsexamples/src/main/python/cassandra_inputformat.py
- examples/src/main/python/cassandra_outputformat.py 2 additions, 0 deletionsexamples/src/main/python/cassandra_outputformat.py
- examples/src/main/python/hbase_inputformat.py 2 additions, 0 deletionsexamples/src/main/python/hbase_inputformat.py
- examples/src/main/python/hbase_outputformat.py 2 additions, 0 deletionsexamples/src/main/python/hbase_outputformat.py
- examples/src/main/python/kmeans.py 2 additions, 0 deletionsexamples/src/main/python/kmeans.py
- examples/src/main/python/logistic_regression.py 2 additions, 0 deletionsexamples/src/main/python/logistic_regression.py
- examples/src/main/python/mllib/correlations.py 60 additions, 0 deletionsexamples/src/main/python/mllib/correlations.py
- examples/src/main/python/mllib/decision_tree_runner.py 5 additions, 0 deletionsexamples/src/main/python/mllib/decision_tree_runner.py
- examples/src/main/python/mllib/kmeans.py 1 addition, 0 deletionsexamples/src/main/python/mllib/kmeans.py
- examples/src/main/python/mllib/logistic_regression.py 1 addition, 0 deletionsexamples/src/main/python/mllib/logistic_regression.py
- examples/src/main/python/mllib/random_rdd_generation.py 55 additions, 0 deletionsexamples/src/main/python/mllib/random_rdd_generation.py
- examples/src/main/python/mllib/sampled_rdds.py 86 additions, 0 deletionsexamples/src/main/python/mllib/sampled_rdds.py
- examples/src/main/python/pagerank.py 2 additions, 0 deletionsexamples/src/main/python/pagerank.py
- examples/src/main/python/pi.py 2 additions, 0 deletionsexamples/src/main/python/pi.py
- examples/src/main/python/sort.py 2 additions, 0 deletionsexamples/src/main/python/sort.py
- examples/src/main/python/transitive_closure.py 2 additions, 0 deletionsexamples/src/main/python/transitive_closure.py
- examples/src/main/python/wordcount.py 2 additions, 0 deletionsexamples/src/main/python/wordcount.py
- examples/src/main/scala/org/apache/spark/examples/mllib/Correlations.scala 92 additions, 0 deletions.../scala/org/apache/spark/examples/mllib/Correlations.scala
- examples/src/main/scala/org/apache/spark/examples/mllib/MultivariateSummarizer.scala 98 additions, 0 deletions.../apache/spark/examples/mllib/MultivariateSummarizer.scala
Please register or sign in to comment