Skip to content
Snippets Groups Projects
  1. Jan 20, 2016
  2. Jan 19, 2016
  3. Jan 15, 2016
    • Yanbo Liang's avatar
      [SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA · 5f843781
      Yanbo Liang authored
      Add PySpark missing methods and params for ml.feature:
      * ```RegexTokenizer``` should support setting ```toLowercase```.
      * ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
      * ```PCAModel``` should support output ```pc```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9908 from yanboliang/spark-11925.
      5f843781
    • Herman van Hovell's avatar
      [SPARK-12575][SQL] Grammar parity with existing SQL parser · 7cd7f220
      Herman van Hovell authored
      In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.
      
      Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
      - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
      - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
      - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
      - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.
      
      cc rxin viirya marmbrus yhuai cloud-fan
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10745 from hvanhovell/SPARK-12575-2.
      7cd7f220
  4. Jan 14, 2016
    • Wenchen Fan's avatar
      [SPARK-12756][SQL] use hash expression in Exchange · 962e9bcf
      Wenchen Fan authored
      This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
      
      This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
      962e9bcf
  5. Jan 13, 2016
  6. Jan 12, 2016
  7. Jan 11, 2016
  8. Jan 08, 2016
  9. Jan 07, 2016
  10. Jan 06, 2016
    • Shixiong Zhu's avatar
      [SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming · 1e6648d6
      Shixiong Zhu authored
      Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10621 from zsxwing/SPARK-12617-2.
      1e6648d6
    • zero323's avatar
      [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · fcd013cf
      zero323 authored
      If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9986 from zero323/SPARK-12006.
      fcd013cf
    • Yanbo Liang's avatar
      [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier &... · 3aa34882
      Yanbo Liang authored
      [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed
      
      PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9807 from yanboliang/spark-11815.
      3aa34882
    • Yanbo Liang's avatar
      [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml · 95eb6516
      Yanbo Liang authored
      Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9931 from yanboliang/SPARK-11945.
      95eb6516
    • Joshi's avatar
      [SPARK-11531][ML] SparseVector error Msg · 007da1a9
      Joshi authored
      PySpark SparseVector should have "Found duplicate indices" error message
      
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #9525 from rekhajoshm/SPARK-11531.
      007da1a9
    • Holden Karau's avatar
      [SPARK-7675][ML][PYSPARK] sparkml params type conversion · 3b29004d
      Holden Karau authored
      From JIRA:
      Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.
      
      A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.
      
      This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
      3b29004d
  11. Jan 05, 2016
  12. Jan 04, 2016
  13. Jan 03, 2016
  14. Dec 30, 2015
    • Holden Karau's avatar
      [SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections · d1ca634d
      Holden Karau authored
      Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
      d1ca634d
  15. Dec 28, 2015
  16. Dec 22, 2015
  17. Dec 21, 2015
  18. Dec 20, 2015
    • Bryan Cutler's avatar
      [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs · ce1798b3
      Bryan Cutler authored
      Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized.  It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer."  Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
      ce1798b3
  19. Dec 19, 2015
  20. Dec 18, 2015
    • gatorsmile's avatar
      [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels · 499ac3e6
      gatorsmile authored
      The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
      
      davies Is this inconsistency intentional? Thanks!
      
      Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
      
      Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10092 from gatorsmile/persistStorageLevel.
      499ac3e6
  21. Dec 17, 2015
    • Yanbo Liang's avatar
      [SQL] Update SQLContext.read.text doc · 6e077166
      Yanbo Liang authored
      Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10349 from yanboliang/text-value.
      6e077166
Loading