Commits · 8f90c151878571e20625e2a53561441ec0035dfc · cs525-sp18-g07 / spark

Jan 20, 2016

[SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests · 9bb35c5b

Gábor Lipták authored 9 years ago

This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.

Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10850 from mengxr/SPARK-11295.

9bb35c5b

Jan 19, 2016

Revert "[SPARK-11295] Add packages to JUnit output for Python tests" · beda9014
Xiangrui Meng authored 9 years ago
```
This reverts commit c6f971b4.
```
beda9014

[SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column · f6f7ca9d

BenFradet authored 9 years ago

This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10472 from BenFradet/SPARK-9716.

f6f7ca9d

[SPARK-11295] Add packages to JUnit output for Python tests · c6f971b4

Gábor Lipták authored 9 years ago

SPARK-11295 Add packages to JUnit output for Python tests

This improves grouping/display of test case results.

Author: Gábor Lipták <gliptak@gmail.com>

Closes #9263 from gliptak/SPARK-11295.

c6f971b4

[SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means · 0ddba6d8

Holden Karau authored 9 years ago

From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.

Author: Holden Karau <holden@us.ibm.com>

Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.

0ddba6d8

[SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pyspark · d8c4b00a

Sean Owen authored 9 years ago

Fix order of arguments that Pyspark RDD.fold passes to its op -  should be (acc, obj) like other implementations.

Obviously, this is a potentially breaking change, so can only happen for 2.x

CC davies

Author: Sean Owen <sowen@cloudera.com>

Closes #10771 from srowen/SPARK-7683.

d8c4b00a

Jan 15, 2016

[SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA · 5f843781

Yanbo Liang authored 9 years ago

Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9908 from yanboliang/spark-11925.

5f843781

[SPARK-12575][SQL] Grammar parity with existing SQL parser · 7cd7f220

Herman van Hovell authored 9 years ago

In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10745 from hvanhovell/SPARK-12575-2.

7cd7f220

Jan 14, 2016

[SPARK-12756][SQL] use hash expression in Exchange · 962e9bcf

Wenchen Fan authored 9 years ago

This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.

This PR also fixes the tests that are broken by the new hash behaviour in shuffle.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.

962e9bcf

Jan 13, 2016

[SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" · cbbcd8e4

Reynold Xin authored 9 years ago

This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.

Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #10734 from rxin/simplify-case.

cbbcd8e4

[SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row · c2ea79f9

Wenchen Fan authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12642

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10694 from cloud-fan/hash-expr.

c2ea79f9

[SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3 · e4e0b3f7

Erik Selin authored 9 years ago

This replaces the `execfile` used for running custom python shell scripts
with explicit open, compile and exec (as recommended by 2to3). The reason
for this change is to make the pythonstartup option compatible with python3.

Author: Erik Selin <erik.selin@gmail.com>

Closes #10255 from tyro89/pythonstartup-python3.

e4e0b3f7

Jan 12, 2016

[SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1 · 4f60651c

Shixiong Zhu authored 9 years ago

- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. https://github.com/zsxwing/spark/commit/bfd4b5c040eb29394c3132af3c670b1a7272457c
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.

4f60651c

Jan 11, 2016

[SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single... · ee4ee02b

Yanbo Liang authored 9 years ago

[SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.

ee4ee02b

Jan 08, 2016

[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition · b9c83533

Sean Owen authored 9 years ago

Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.

b9c83533

Jan 07, 2016

[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · 592f6498

zero323 authored 9 years ago

If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.

592f6498

Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None" · e5cde7ab
Yin Huai authored 9 years ago
```
This reverts commit fcd013cf.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
```
e5cde7ab

Jan 06, 2016

[SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming · 1e6648d6

Shixiong Zhu authored 9 years ago

Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10621 from zsxwing/SPARK-12617-2.

1e6648d6

[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · fcd013cf

zero323 authored 9 years ago

If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9986 from zero323/SPARK-12006.

fcd013cf

[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier &... · 3aa34882

Yanbo Liang authored 9 years ago

[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed

PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9807 from yanboliang/spark-11815.

3aa34882

[SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml · 95eb6516

Yanbo Liang authored 9 years ago

Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9931 from yanboliang/SPARK-11945.

95eb6516

[SPARK-11531][ML] SparseVector error Msg · 007da1a9

Joshi authored 9 years ago

PySpark SparseVector should have "Found duplicate indices" error message

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #9525 from rekhajoshm/SPARK-11531.

007da1a9

[SPARK-7675][ML][PYSPARK] sparkml params type conversion · 3b29004d

Holden Karau authored 9 years ago

From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.

A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.

This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.

Author: Holden Karau <holden@us.ibm.com>

Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.

3b29004d

Jan 05, 2016

[SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrix · 1537e556

Kai Jiang authored 9 years ago

Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #10158 from vectorijk/spark-12041.

1537e556

[SPARK-12511] [PYSPARK] [STREAMING] Make sure PythonDStream.registerSerializer is called only once · 6cfe341e

Shixiong Zhu authored 9 years ago

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10514 from zsxwing/SPARK-12511.

6cfe341e

[SPARK-12617] [PYSPARK] Clean up the leak sockets of Py4J · 047a31bb

Shixiong Zhu authored 9 years ago

This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue https://github.com/bartdag/py4j/issues/187

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10579 from zsxwing/SPARK-12617.

047a31bb

[SPARK-12480][FOLLOW-UP] use a single column vararg for hash · 76768337

Wenchen Fan authored 9 years ago

address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.

76768337

Jan 04, 2016
- [SPARK-12600][SQL] Remove deprecated methods in Spark SQL · 77ab49b8
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #10559 from rxin/remove-deprecated-sql.
  77ab49b8
Jan 03, 2016

[SPARK-12611][SQL][PYSPARK][TESTS] Fix test_infer_schema_to_local · 13dab9c3

Holden Karau authored 9 years ago

Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test.

Author: Holden Karau <holden@us.ibm.com>

Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local.

13dab9c3

[SPARK-12537][SQL] Add option to accept quoting of all character backslash quoting mechanism · b8410ff9

Cazen authored 9 years ago

We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not.

Author: Cazen <Cazen@korea.com>
Author: Cazen Lee <cazen.lee@samsung.com>
Author: Cazen Lee <Cazen@korea.com>
Author: cazen.lee <cazen.lee@samsung.com>

Closes #10497 from Cazen/master.

b8410ff9

Dec 30, 2015

[SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections · d1ca634d

Holden Karau authored 9 years ago

Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <holden@us.ibm.com>

Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.

d1ca634d

Dec 28, 2015

[SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API · 8d494009

jerryshao authored 9 years ago

The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.

Author: jerryshao <sshao@hortonworks.com>

Closes #10350 from jerryshao/SPARK-12353.

8d494009

[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join · 9ab296ec

gatorsmile authored 9 years ago

After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.

For example, users can do the Equi-Join like
  ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated.

Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10477 from gatorsmile/pyOuterJoin.

9ab296ec

Dec 22, 2015

[SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard scaler model · 969d5665

Holden Karau authored 9 years ago

Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.

Author: Holden Karau <holden@us.ibm.com>

Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.

969d5665

Dec 21, 2015

Doc typo: ltrim = trim from left end, not right · fc6dbcc7
pshearer authored 9 years ago
```
Author: pshearer <pshearer@massmutual.com>

Closes #10414 from pshearer/patch-1.
```
fc6dbcc7

[PYSPARK] Pyspark typo & Add missing abstractmethod annotation · 1920d72a

Jeff Zhang authored 9 years ago

No jira is created since this is a trivial change.

davies  Please help review it

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10143 from zjffdu/pyspark_typo.

1920d72a

Dec 20, 2015

[SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs · ce1798b3

Bryan Cutler authored 9 years ago

Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.

ce1798b3

Dec 19, 2015

[SQL] Fix mistake doc of join type for dataframe.join · a073a73a

Yanbo Liang authored 9 years ago

Fix mistake doc of join type for ```dataframe.join```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10378 from yanboliang/leftsemi.

a073a73a

Dec 18, 2015

[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels · 499ac3e6

gatorsmile authored 9 years ago

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10092 from gatorsmile/persistStorageLevel.

499ac3e6

Dec 17, 2015

[SQL] Update SQLContext.read.text doc · 6e077166

Yanbo Liang authored 9 years ago

Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10349 from yanboliang/text-value.

6e077166