Skip to content
Snippets Groups Projects
  1. Sep 04, 2014
  2. Sep 03, 2014
    • Matthew Farrellee's avatar
      [SPARK-2435] Add shutdown hook to pyspark · 7c6e71f0
      Matthew Farrellee authored
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2183 from mattf/SPARK-2435 and squashes the following commits:
      
      ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark
      7c6e71f0
    • Davies Liu's avatar
      [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF · c5cbc492
      Davies Liu authored
      After this patch, broadcast can be used in Python UDF.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2243 from davies/udf_broadcast and squashes the following commits:
      
      7b88861 [Davies Liu] support broadcast in UDF
      c5cbc492
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  3. Sep 02, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add countApproxDistinct() API · e2c901b4
      Davies Liu authored
      RDD.countApproxDistinct(relativeSD=0.05):
      
              :: Experimental ::
              Return approximate number of distinct elements in the RDD.
      
              The algorithm used is based on streamlib's implementation of
              "HyperLogLog in Practice: Algorithmic Engineering of a State
              of The Art Cardinality Estimation Algorithm", available
              <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
      
              This support all the types of objects, which is supported by
              Pyrolite, nearly all builtin types.
      
              param relativeSD Relative accuracy. Smaller values create
                                 counters that require more space.
                                 It must be greater than 0.000017.
      
              >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
              >>> 950 < n < 1050
              True
              >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
              >>> 18 < n < 22
              True
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2142 from davies/countApproxDistinct and squashes the following commits:
      
      e20da47 [Davies Liu] remove the correction in Python
      c38c4e4 [Davies Liu] fix doc tests
      2ab157c [Davies Liu] fix doc tests
      9d2565f [Davies Liu] add commments and link for hash collision correction
      d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
      ded624f [Davies Liu] calculate hash in Python
      4cba98f [Davies Liu] add more tests
      a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
      e97e342 [Davies Liu] add countApproxDistinct()
      e2c901b4
  4. Aug 30, 2014
    • Holden Karau's avatar
      SPARK-3318: Documentation update in addFile on how to use SparkFiles.get · ba78383b
      Holden Karau authored
      Rather than specifying the path to SparkFiles we need to use the filename.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits:
      
      a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well
      0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
      ba78383b
  5. Aug 29, 2014
  6. Aug 27, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add RDD.lookup(key) · 4fa2fda8
      Davies Liu authored
      RDD.lookup(key)
      
              Return the list of values in the RDD for key `key`. This operation
              is done efficiently if the RDD has a known partitioner by only
              searching the partition that the key maps to.
      
              >>> l = range(1000)
              >>> rdd = sc.parallelize(zip(l, l), 10)
              >>> rdd.lookup(42)  # slow
              [42]
              >>> sorted = rdd.sortByKey()
              >>> sorted.lookup(42)  # fast
              [42]
      
      It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2093 from davies/lookup and squashes the following commits:
      
      1789cd4 [Davies Liu] `f` in foreach could be generator or not.
      2871b80 [Davies Liu] Merge branch 'master' into lookup
      c6390ea [Davies Liu] address all comments
      0f1bce8 [Davies Liu] add test case for lookup()
      be0e8ba [Davies Liu] fix preservesPartitioning
      eb1305d [Davies Liu] add RDD.lookup(key)
      4fa2fda8
    • Andrew Or's avatar
      [SPARK-3167] Handle special driver configs in Windows · 7557c4cf
      Andrew Or authored
      This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2129 from andrewor14/windows-config and squashes the following commits:
      
      881a8f0 [Andrew Or] Add reference to Windows taskkill
      92e6047 [Andrew Or] Update a few comments (minor)
      22b1acd [Andrew Or] Fix style again (minor)
      afcffea [Andrew Or] Fix style (minor)
      72004c2 [Andrew Or] Actually respect --driver-java-options
      803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
      eeb34a0 [Andrew Or] Update outdated comment (minor)
      35caecc [Andrew Or] In Windows, actually kill Java processes on exit
      f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
      83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)
      7557c4cf
  7. Aug 26, 2014
    • Davies Liu's avatar
      [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() · f1e71d4c
      Davies Liu authored
      Using external sort to support sort large datasets in reduce stage.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1978 from davies/sort and squashes the following commits:
      
      bbcd9ba [Davies Liu] check spilled bytes in tests
      b125d2f [Davies Liu] add test for external sort in rdd
      eae0176 [Davies Liu] choose different disks from different processes and instances
      1f075ed [Davies Liu] Merge branch 'master' into sort
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      f1e71d4c
    • Takuya UESHIN's avatar
      [SPARK-2969][SQL] Make ScalaReflection be able to handle... · 98c2bb0b
      Takuya UESHIN authored
      [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
      
      Make `ScalaReflection` be able to handle like:
      
      - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
      - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
      - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
      - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:
      
      24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
      79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
      7cd1a7a [Takuya UESHIN] Fix json test failures.
      2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
      2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
      9fa02f5 [Takuya UESHIN] Fix a test failure.
      1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
      98c2bb0b
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add histgram() API · 3cedc4f4
      Davies Liu authored
      RDD.histogram(buckets)
      
              Compute a histogram using the provided buckets. The buckets
              are all open to the right except for the last which is closed.
              e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
              which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
              and 50 we would have a histogram of 1,0,1.
      
              If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
              this can be switched from an O(log n) inseration to O(1) per
              element(where n = # buckets).
      
              Buckets must be sorted and not contain any duplicates, must be
              at least two elements.
      
              If `buckets` is a number, it will generates buckets which is
              evenly spaced between the minimum and maximum of the RDD. For
              example, if the min value is 0 and the max is 100, given buckets
              as 2, the resulting buckets will be [0,50) [50,100]. buckets must
              be at least 1 If the RDD contains infinity, NaN throws an exception
              If the elements in RDD do not vary (max == min) always returns
              a single bucket.
      
              It will return an tuple of buckets and histogram.
      
              >>> rdd = sc.parallelize(range(51))
              >>> rdd.histogram(2)
              ([0, 25, 50], [25, 26])
              >>> rdd.histogram([0, 5, 25, 50])
              ([0, 5, 25, 50], [5, 20, 26])
              >>> rdd.histogram([0, 15, 30, 45, 60], True)
              ([0, 15, 30, 45, 60], [15, 15, 15, 6])
              >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
              >>> rdd.histogram(("a", "b", "c"))
              (('a', 'b', 'c'), [2, 2])
      
      closes #122, it's duplicated.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2091 from davies/histgram and squashes the following commits:
      
      a322f8a [Davies Liu] fix deprecation of e.message
      84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
      d9a0722 [Davies Liu] address comments
      0e18a2d [Davies Liu] add histgram() API
      3cedc4f4
  8. Aug 24, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId() · fb0db772
      Davies Liu authored
      RDD.zipWithIndex()
      
              Zips this RDD with its element indices.
      
              The ordering is first based on the partition index and then the
              ordering of items within each partition. So the first item in
              the first partition gets index 0, and the last item in the last
              partition receives the largest index.
      
              This method needs to trigger a spark job when this RDD contains
              more than one partitions.
      
              >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
              [(0, 0), (1, 1), (2, 2), (3, 3)]
      
      RDD.zipWithUniqueId()
      
              Zips this RDD with generated unique Long ids.
      
              Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
              n is the number of partitions. So there may exist gaps, but this
              method won't trigger a spark job, which is different from
              L{zipWithIndex}
      
              >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
              [(0, 0), (2, 1), (1, 2), (3, 3)]
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2092 from davies/zipWith and squashes the following commits:
      
      cebe5bf [Davies Liu] improve test cases, reverse the order of index
      0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
      fb0db772
  9. Aug 23, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add approx API for RDD · 8df4dad4
      Davies Liu authored
      RDD.countApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate version of count() that returns a potentially incomplete
              result within a timeout, even if not all tasks have finished.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> rdd.countApprox(1000, 1.0)
              1000
      
      RDD.sumApprox(self, timeout, confidence=0.95)
      
              Approximate operation to return the sum within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000))
              >>> (rdd.sumApprox(1000) - r) / r < 0.05
      
      RDD.meanApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate operation to return the mean within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000)) / 1000.0
              >>> (rdd.meanApprox(1000) - r) / r < 0.05
              True
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2095 from davies/approx and squashes the following commits:
      
      e8c252b [Davies Liu] add approx API for RDD
      8df4dad4
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) · db436e36
      Davies Liu authored
      RDD.max(key=None)
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
              >>> rdd.max()
              43.0
              >>> rdd.max(key=str)
              5.0
      
      RDD.min(key=None)
      
              Find the minimum item in this RDD.
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
              >>> rdd.min()
              2.0
              >>> rdd.min(key=str)
              10.0
      
      RDD.top(num, key=None)
      
              Get the top N elements from a RDD.
      
              Note: It returns the list sorted in descending order.
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
              [12]
              >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
              [6, 5]
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
              [4, 3, 2]
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2094 from davies/cmp and squashes the following commits:
      
      ccbaf25 [Davies Liu] add `key` to top()
      ad7e374 [Davies Liu] fix tests
      2f63512 [Davies Liu] change `comp` to `key` in min/max
      dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
      db436e36
  10. Aug 20, 2014
    • Andrew Or's avatar
      [SPARK-3140] Clarify confusing PySpark exception message · ba3c730e
      Andrew Or authored
      We read the py4j port from the stdout of the `bin/spark-submit` subprocess. If there is interference in stdout (e.g. a random echo in `spark-submit`), we throw an exception with a warning message. We do not, however, distinguish between this case from the case where no stdout is produced at all.
      
      I wasted a non-trivial amount of time being baffled by this exception in search of places where I print random whitespace (in vain, of course). A clearer exception message that distinguishes between these cases will prevent similar headaches that I have gone through.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2067 from andrewor14/python-exception and squashes the following commits:
      
      742f823 [Andrew Or] Further clarify warning messages
      e96a7a0 [Andrew Or] Distinguish between unexpected output and no output at all
      ba3c730e
    • Davies Liu's avatar
      [SPARK-3141] [PySpark] fix sortByKey() with take() · 0a7ef633
      Davies Liu authored
      Fix sortByKey() with take()
      
      The function `f` used in mapPartitions should always return an iterator.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2045 from davies/fix_sortbykey and squashes the following commits:
      
      1160f59 [Davies Liu] fix sortByKey() with take()
      0a7ef633
    • Josh Rosen's avatar
      [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs · ebcb94f7
      Josh Rosen authored
      This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).
      
      This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.
      
      It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.
      
      I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2002 from JoshRosen/local-dirs and squashes the following commits:
      
      efad8c6 [Josh Rosen] Address review comments:
      1dec709 [Josh Rosen] Minor updates to Javadocs.
      7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
      399ac25 [Josh Rosen] Update getLocalDir() documentation.
      bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
      3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
      b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
      007298b [Josh Rosen] Allow environment variables to be mocked in tests.
      6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
      ebcb94f7
  11. Aug 19, 2014
    • Xiangrui Meng's avatar
      [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs · 825d4fe4
      Xiangrui Meng authored
      Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2041 from mengxr/stat-doc and squashes the following commits:
      
      fc5eedf [Xiangrui Meng] add missing comma
      ffde810 [Xiangrui Meng] address comments
      aef6d07 [Xiangrui Meng] add doc for random data generation
      b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
      825d4fe4
    • Davies Liu's avatar
      [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. · d7e80c25
      Davies Liu authored
      If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1894 from davies/zip and squashes the following commits:
      
      c4652ea [Davies Liu] add more test cases
      6d05fc8 [Davies Liu] Merge branch 'master' into zip
      813b1e4 [Davies Liu] add more tests for failed cases
      a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
      d7e80c25
  12. Aug 18, 2014
    • Josh Rosen's avatar
      [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL. · 1f1819b2
      Josh Rosen authored
      This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.
      
      This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified.  It also includes Davies' fix for the bug.
      
      Closes #2026.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:
      
      9af2708 [Davies Liu] bugfix: disable compression of command
      0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.
      1f1819b2
    • Joseph K. Bradley's avatar
      [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes · c8b16ca0
      Joseph K. Bradley authored
      Added examples for statistical summarization:
      * Scala: StatisticalSummary.scala
      ** Tests: correlation, MultivariateOnlineSummarizer
      * python: statistical_summary.py
      ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      
      Added examples for random and sampled RDDs:
      * Scala: RandomAndSampledRDDs.scala
      * python: random_and_sampled_rdds.py
      * Both test:
      ** RandomRDDGenerators.normalRDD, normalVectorRDD
      ** RDD.sample, takeSample, sampleByKey
      
      Added sc.stop() to all examples.
      
      CorrelationSuite.scala
      * Added 1 test for RDDs with only 1 value
      
      RowMatrix.scala
      * numCols(): Added check for numRows = 0, with error message.
      * computeCovariance(): Added check for numRows <= 1, with error message.
      
      Python SparseVector (pyspark/mllib/linalg.py)
      * Added toDense() function
      
      python/run-tests script
      * Added stat.py (doc test)
      
      CC: mengxr dorx  Main changes were examples to show usage across APIs.
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
      
      ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
      8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
      b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
      32173b7 [Joseph K. Bradley] Stats examples update.
      c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
      ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
      65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
      064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      c8b16ca0
    • Joseph K. Bradley's avatar
      [mllib] DecisionTree: treeAggregate + Python example bug fix · 115eeb30
      Joseph K. Bradley authored
      Small DecisionTree updates:
      * Changed main DecisionTree aggregate to treeAggregate.
      * Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
      * Fixed same bug in python doc tests, and added tree.py to doc tests.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:
      
      b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
      8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
      b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
      85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
      3726d20 [Joseph K. Bradley] Small code improvements based on code review.
      ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
      db0d773 [Joseph K. Bradley] scala style fix
      6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
      931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
      f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
      6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
      2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
      356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
      b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
      b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      115eeb30
    • Davies Liu's avatar
      [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 · d1d0ee41
      Davies Liu authored
      bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as "utf-8".
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2018 from davies/fix_utf8 and squashes the following commits:
      
      4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
      d1d0ee41
  13. Aug 16, 2014
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
    • iAmGhost's avatar
      [SPARK-3035] Wrong example with SparkContext.addFile · 379e7585
      iAmGhost authored
      https://issues.apache.org/jira/browse/SPARK-3035
      
      fix for wrong document.
      
      Author: iAmGhost <kdh7807@gmail.com>
      
      Closes #1942 from iAmGhost/master and squashes the following commits:
      
      487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix for wrong document.
      379e7585
    • Xiangrui Meng's avatar
      [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs · ac6411c6
      Xiangrui Meng authored
      `RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its methods return RDDs but not RDDGenerators. So a more proper (and shorter) name would be `RandomRDDs`.
      
      dorx brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1979 from mengxr/randomrdds and squashes the following commits:
      
      b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
      ac6411c6
    • Cheng Lian's avatar
      [SQL] Using safe floating-point numbers in doctest · b4a05928
      Cheng Lian authored
      Test code in `sql.py` tries to compare two floating-point numbers directly, and cased [build failure(s)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18365/consoleFull).
      
      [Doctest documentation](https://docs.python.org/3/library/doctest.html#warnings) recommends using numbers in the form of `I/2**J` to avoid the precision issue.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1925 from liancheng/fix-pysql-fp-test and squashes the following commits:
      
      0fbf584 [Cheng Lian] Removed unnecessary `...' from inferSchema doctest
      e8059d4 [Cheng Lian] Using safe floating-point numbers in doctest
      b4a05928
  14. Aug 14, 2014
    • Ahir Reddy's avatar
      [SQL] Python JsonRDD UTF8 Encoding Fix · fde692b3
      Ahir Reddy authored
      Only encode unicode objects to UTF-8, and not strings
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:
      
      ca4e9ba [Ahir Reddy] Encoding Fix
      fde692b3
  15. Aug 13, 2014
    • Davies Liu's avatar
      [SPARK-2983] [PySpark] improve performance of sortByKey() · 434bea1c
      Davies Liu authored
      1. skip partitionBy() when numOfPartition is 1
      2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
      rangePartitioner
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1898 from davies/sort and squashes the following commits:
      
      0a9608b [Davies Liu] Merge branch 'master' into sort
      1cf9565 [Davies Liu] improve performance of sortByKey()
      434bea1c
    • Davies Liu's avatar
      [SPARK-3013] [SQL] [PySpark] convert array into list · c974a716
      Davies Liu authored
      because Pyrolite does not support array from Python 2.6
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1928 from davies/fix_array and squashes the following commits:
      
      858e6c5 [Davies Liu] convert array into list
      c974a716
    • Doris Xin's avatar
      [SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics · fe473595
      Doris Xin authored
      For both Scala and Python.
      
      The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.
      
      `MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1911 from dorx/colStats and squashes the following commits:
      
      77b9924 [Doris Xin] developerAPI tag
      de9cbbe [Doris Xin] reviewer comments and moved more ser/de
      459faba [Doris Xin] colStats in Statistics for both Scala and Python
      fe473595
  16. Aug 12, 2014
    • Davies Liu's avatar
      fix flaky tests · 882da57a
      Davies Liu authored
      Python 2.6 does not handle float error well as 2.7+
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1910 from davies/fix_test and squashes the following commits:
      
      7e51200 [Davies Liu] fix flaky tests
      882da57a
  17. Aug 11, 2014
  18. Aug 10, 2014
    • Davies Liu's avatar
      [SPARK-2898] [PySpark] fix bugs in deamon.py · 28dcbb53
      Davies Liu authored
      1. do not use signal handler for SIGCHILD, it's easy to cause deadlock
      2. handle EINTR during accept()
      3. pass errno into JVM
      4. handle EAGAIN during fork()
      
      Now, it can pass 50k tasks tests in 180 seconds.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1842 from davies/qa and squashes the following commits:
      
      f0ea451 [Davies Liu] fix lint
      03a2e8c [Davies Liu] cleanup dead children every seconds
      32cb829 [Davies Liu] fix lint
      0cd0817 [Davies Liu] fix bugs in deamon.py
      28dcbb53
  19. Aug 09, 2014
    • Kousuke Saruta's avatar
      [SPARK-2894] spark-shell doesn't accept flags · 4f4a9884
      Kousuke Saruta authored
      As sryza reported, spark-shell doesn't accept any flags.
      The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1715, Closes #1864, and Closes #1861
      
      Closes #1825 from sarutak/SPARK-2894 and squashes the following commits:
      
      47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
      2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
      98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
      513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
      28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
      5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
      e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
      4f4a9884
  20. Aug 07, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2851] [mllib] DecisionTree Python consistency update · 47ccd5e7
      Joseph K. Bradley authored
      Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
      
      Added factory classes for Algo and Impurity, but made private[mllib].
      
      CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:
      
      6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
      a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
      ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
      00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
      fe6dbfa [Joseph K. Bradley] removed unnecessary imports
      e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
      c699850 [Joseph K. Bradley] a few doc comments
      eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
      47ccd5e7
  21. Aug 06, 2014
Loading