Skip to content
Snippets Groups Projects
  1. Jul 02, 2015
    • Reynold Xin's avatar
      Revert "[SPARK-8784] [SQL] Add Python API for hex and unhex" · e589e71a
      Reynold Xin authored
      This reverts commit fc7aebd9.
      e589e71a
    • Yu ISHIKAWA's avatar
      [SPARK-7104] [MLLIB] Support model save/load in Python's Word2Vec · 488bad31
      Yu ISHIKAWA authored
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits:
      
      975136b [Yu ISHIKAWA] Organize import
      0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
      cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
      1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
      488bad31
    • Davies Liu's avatar
      [SPARK-8784] [SQL] Add Python API for hex and unhex · fc7aebd9
      Davies Liu authored
      Also improve the performance of hex/unhex
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7181 from davies/hex and squashes the following commits:
      
      f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
      49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
      b31fc9a [Davies Liu] Update math.scala
      25156b7 [Davies Liu] address comments and fix test
      c3af78c [Davies Liu] address commments
      1a24082 [Davies Liu] Add Python API for hex and unhex
      fc7aebd9
    • Yijie Shen's avatar
      [SPARK-8407] [SQL] complex type constructors: struct and named_struct · 52302a80
      Yijie Shen authored
      This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.
      
      After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...
      
      This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:
      
      4cd3375ac [Yijie Shen] change struct documentation
      d599d0b [Yijie Shen] rebase code
      9a7039e [Yijie Shen] fix reviews and regenerate golden answers
      b487354 [Yijie Shen] replace assert using checkAnswer
      f07e114 [Yijie Shen] tiny fix
      9613be9 [Yijie Shen] review fix
      7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
      60812a7 [Yijie Shen] Fix type check
      828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
      fd3cd8e [Yijie Shen] remove type check from eval
      7a71255 [Yijie Shen] tiny fix
      ccbbd86 [Yijie Shen] Fix reviews
      47da332 [Yijie Shen] remove nameStruct API from DataFrame
      917e680 [Yijie Shen] Fix reviews
      4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
      0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
      52302a80
    • Tarek Auel's avatar
      [SPARK-8223] [SPARK-8224] [SQL] shift left and shift right · 5b333813
      Tarek Auel authored
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8223
      https://issues.apache.org/jira/browse/SPARK-8224
      
      ~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
      I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged
      
      rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).
      
      If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7178 from tarekauel/8223 and squashes the following commits:
      
      8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
      f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
      f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
      3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
      5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
      9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
      44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
      ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
      5b333813
  2. Jul 01, 2015
    • Reynold Xin's avatar
      [SPARK-8770][SQL] Create BinaryOperator abstract class. · 9fd13d56
      Reynold Xin authored
      Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.
      
      This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7174 from rxin/binary-opterator and squashes the following commits:
      
      f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
      fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
      d8518cf [Reynold Xin] Updated Python tests.
      9fd13d56
    • Davies Liu's avatar
      [SPARK-8766] support non-ascii character in column names · f958f27e
      Davies Liu authored
      Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii').
      
      This PR also fix a bug when there is Java exception without error message.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7165 from davies/non_ascii and squashes the following commits:
      
      02cb61a [Davies Liu] fix tests
      3b09d31 [Davies Liu] add encoding in header
      867754a [Davies Liu] support non-ascii character in column names
      f958f27e
    • zsxwing's avatar
      [SPARK-8378] [STREAMING] Add the Python API for Flume · 75b9fe4c
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6830 from zsxwing/flume-python and squashes the following commits:
      
      78dfdac [zsxwing] Fix the compile error in the test code
      f1bf3c0 [zsxwing] Address TD's comments
      0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
      e93736b [zsxwing] Fix the test case for determine_modules_to_test
      9d5821e [zsxwing] Fix pyspark_core dependencies
      f9ee681 [zsxwing] Merge branch 'master' into flume-python
      7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
      b96b0de [zsxwing] Merge branch 'master' into flume-python
      ce85e83 [zsxwing] Fix incompatible issues for Python 3
      01cbb3d [zsxwing] Add import sys
      152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
      14ba0ff [zsxwing] Add flume-assembly for sbt building
      b8d5551 [zsxwing] Merge branch 'master' into flume-python
      4762c34 [zsxwing] Fix the doc
      0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
      9f33873 [zsxwing] Add the Python API for Flume
      75b9fe4c
    • Joseph K. Bradley's avatar
      [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC test · b8faa328
      Joseph K. Bradley authored
      See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console]
      
      CC yanboliang  mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7164 from jkbradley/pic-python-test and squashes the following commits:
      
      156d55b [Joseph K. Bradley] removed flaky python PIC test
      b8faa328
    • lewuathe's avatar
      [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils · 184de91d
      lewuathe authored
      Implement missing API in pyspark.
      
      MLUtils
      * appendBias
      * loadVectors
      
      `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:
      
      16863ea [lewuathe] Merge master
      3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
      6084e9c [lewuathe] Resolv conflict
      d2aa2a0 [lewuathe] Resolv conflict
      9c329d8 [lewuathe] Fix efficiency
      3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
      1d4714b [lewuathe] Fix style
      b29e2bc [lewuathe] Remove scipy dependencies
      e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
      25d3c9d [lewuathe] Remove unnecessary imports
      7ec04db [lewuathe] Resolv conflict
      1502d13 [lewuathe] Resolv conflict
      d6bd416 [lewuathe] Check existence of scipy.sparse
      5d555b1 [lewuathe] Construct scipy.sparse matrix
      c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
      b8b5ef7 [lewuathe] Fix unnecessary sort method
      d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
      62a9c7e [lewuathe] Fix appendBias return type
      454c73d [lewuathe] Merge branch 'master' into SPARK-6263
      a353354 [lewuathe] Remove unnecessary appendBias implementation
      44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
      64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
      c728046 [lewuathe] Fix style
      2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
      184de91d
    • cocoatomo's avatar
      [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with... · fdcad6ef
      cocoatomo authored
      [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
      
      Running run-tests.py with Python 2.6 cause following error:
      
      ```
      Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log
      Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Traceback (most recent call last):
        File "./python/run-tests.py", line 196, in <module>
          main()
        File "./python/run-tests.py", line 159, in main
          python_implementation = subprocess.check_output(
      AttributeError: 'module' object has no attribute 'check_output'
      ...
      ```
      
      The cause of this error is using subprocess.check_output function, which exists since Python 2.7.
      (ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output)
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #7161 from cocoatomo/issues/8763-test-fails-py26 and squashes the following commits:
      
      cf4f901 [cocoatomo] [SPARK-8763] backport process.check_output function from Python 2.7
      fdcad6ef
  3. Jun 30, 2015
    • x1-'s avatar
      [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe... · b6e76edf
      x1- authored
      [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
      
      Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`.
      So I think `pandas.columns` are should be convert to `String`.
      
      ### issue
      
      * [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535)
      
      Author: x1- <viva008@gmail.com>
      
      Closes #7124 from x1-/SPARK-8535 and squashes the following commits:
      
      d68fd38 [x1-] modify unit-test using pandas.
      ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
      b6e76edf
    • Tarek Auel's avatar
      [SPARK-8727] [SQL] Missing python api; md5, log2 · ccdb0522
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-8727
      
      Author: Tarek Auel <tarek.auel@gmail.com>
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7114 from tarekauel/missing-python and squashes the following commits:
      
      ef4c61b [Tarek Auel] [SPARK-8727] revert dataframe change
      4029d4d [Tarek Auel] removed dataframe pi and e unit test
      66f0d2b [Tarek Auel] removed pi and e from python api and dataframe api; added _to_java_column(col) for strlen
      4d07318 [Tarek Auel] fixed python unit test
      45f2bee [Tarek Auel] fixed result of pi and e
      c39f47b [Tarek Auel] add python api
      bd50a3a [Tarek Auel] add missing python functions
      ccdb0522
    • Davies Liu's avatar
      [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API · 58ee2a2e
      Davies Liu authored
      Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7135 from davies/ananylis and squashes the following commits:
      
      dad7ae7 [Davies Liu] add comment
      ec0c0e8 [Davies Liu] Update utils.py
      cdd7edd [Davies Liu] add doc
      7b044c2 [Davies Liu] fix python 3
      f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
      58ee2a2e
    • MechCoder's avatar
      [SPARK-8679] [PYSPARK] [MLLIB] Default values in Pipeline API should be immutable · 5fa08636
      MechCoder authored
      It might be dangerous to have a mutable as value for default param. (http://stackoverflow.com/a/11416002/1170730)
      
      e.g
      
          def func(example, f={}):
              f[example] = 1
              return f
      
          func(2)
      
          {2: 1}
          func(3)
          {2:1, 3:1}
      
      mengxr
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7058 from MechCoder/pipeline_api_playground and squashes the following commits:
      
      40a5eb2 [MechCoder] copy
      95f7ff2 [MechCoder] [SPARK-8679] [PySpark] [MLlib] Default values in Pipeline API should be immutable
      5fa08636
    • MechCoder's avatar
      [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for StreamingLinearRegressionWithSGD · 45281664
      MechCoder authored
      Python bindings for StreamingLinearRegressionWithSGD
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6744 from MechCoder/spark-4127 and squashes the following commits:
      
      d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
      d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
      1b4ddd6 [MechCoder] minor
      4de6c68 [MechCoder] Minor refactor
      5e85a3b [MechCoder] Add tests for simultaneous training and prediction
      fb27889 [MechCoder] Add example and docs
      505380b [MechCoder] Add tests
      d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
      45281664
    • zsxwing's avatar
      [SPARK-8434][SQL]Add a "pretty" parameter to the "show" method to display long strings · 12671dd5
      zsxwing authored
      Sometimes the user may want to show the complete content of cells. Now `sql("set -v").show()` displays:
      
      ![screen shot 2015-06-18 at 4 34 51 pm](https://cloud.githubusercontent.com/assets/1000778/8227339/14d3c5ea-15d9-11e5-99b9-f00b7e93beef.png)
      
      The user needs to use something like `sql("set -v").collect().foreach(r => r.toSeq.mkString("\t"))` to show the complete content.
      
      This PR adds a `pretty` parameter to show. If `pretty` is false, `show` won't truncate strings or align cells right.
      
      ![screen shot 2015-06-18 at 4 21 44 pm](https://cloud.githubusercontent.com/assets/1000778/8227407/b6f8dcac-15d9-11e5-8219-8079280d76fc.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6877 from zsxwing/show and squashes the following commits:
      
      22e28e9 [zsxwing] pretty -> truncate
      e582628 [zsxwing] Add pretty parameter to the show method in R
      a3cd55b [zsxwing] Fix calling showString in R
      923cee4 [zsxwing] Add a "pretty" parameter to show to display long strings
      12671dd5
    • Josh Rosen's avatar
      [SPARK-5161] [HOTFIX] Fix bug in Python test failure reporting · 6c5a6db4
      Josh Rosen authored
      This patch fixes a bug introduced in #7031 which can cause Jenkins to incorrectly report a build with failed Python tests as passing if an error occurred while printing the test failure message.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7112 from JoshRosen/python-tests-hotfix and squashes the following commits:
      
      c3f2961 [Josh Rosen] Hotfix for bug in Python test failure reporting
      6c5a6db4
  4. Jun 29, 2015
    • Josh Rosen's avatar
      [SPARK-5161] Parallelize Python test execution · 7bbbe380
      Josh Rosen authored
      This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times.  Parallelism is now configurable by passing the `-p` or `--parallelism` flags to either `dev/run-tests` or `python/run-tests` (the default parallelism is 4, but I've successfully tested with higher parallelism).
      
      To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7031 from JoshRosen/parallelize-python-tests and squashes the following commits:
      
      feb3763 [Josh Rosen] Re-enable other tests
      f87ea81 [Josh Rosen] Only log output from failed tests
      d4ded73 [Josh Rosen] Logging improvements
      a2717e1 [Josh Rosen] Make parallelism configurable via dev/run-tests
      1bacf1b [Josh Rosen] Merge remote-tracking branch 'origin/master' into parallelize-python-tests
      110cd9d [Josh Rosen] Fix universal_newlines for Python 3
      cd13db8 [Josh Rosen] Also log python_implementation
      9e31127 [Josh Rosen] Log Python --version output for each executable.
      a2b9094 [Josh Rosen] Bump up parallelism.
      5552380 [Josh Rosen] Python 3 fix
      866b5b9 [Josh Rosen] Fix lazy logging warnings in Prospector checks
      87cb988 [Josh Rosen] Skip MLLib tests for PyPy
      8309bfe [Josh Rosen] Temporarily disable parallelism to debug a failure
      9129027 [Josh Rosen] Disable Spark UI in Python tests
      037b686 [Josh Rosen] Temporarily disable JVM tests so we can test Python speedup in Jenkins.
      af4cef4 [Josh Rosen] Initial attempt at parallelizing Python test execution
      7bbbe380
    • Yanbo Liang's avatar
      [SPARK-7667] [MLLIB] MLlib Python API consistency check · f9b6bf2f
      Yanbo Liang authored
      MLlib Python API consistency check
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6856 from yanboliang/spark-7667 and squashes the following commits:
      
      21bae35 [Yanbo Liang] remove duplicate code
      eb12f95 [Yanbo Liang] fix doc inherit problem
      9e7ec3c [Yanbo Liang] address comments
      e763d32 [Yanbo Liang] MLlib Python API consistency check
      f9b6bf2f
    • Feynman Liang's avatar
      [SPARK-8456] [ML] Ngram featurizer python · 620605a4
      Feynman Liang authored
      Python API for N-gram feature transformer
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:
      
      f9e37c9 [Feynman Liang] Remove debugging code
      4dd81f4 [Feynman Liang] Fix typo and doctest
      06c79ac [Feynman Liang] Style guide
      26c1175 [Feynman Liang] Add python NGram API
      620605a4
    • Ai He's avatar
      [SPARK-7810] [PYSPARK] solve python rdd socket connection problem · ecd3aacf
      Ai He authored
      Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols.
      
      Author: Ai He <ai.he@ussuning.com>
      Author: AiHe <ai.he@ussuning.com>
      
      Closes #6338 from AiHe/pyspark-networking-issue and squashes the following commits:
      
      d4fc9c4 [Ai He] handle code review 2
      e75c5c8 [Ai He] handle code review
      5644953 [AiHe] solve python rdd socket connection problem to jvm
      ecd3aacf
    • Ilya Ganelin's avatar
      [SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python · f6fc254e
      Ilya Ganelin authored
      I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.
      
      I've also added tests for this type of creation.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits:
      
      27c1de1 [Ilya Ganelin] Rename
      467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
      5fef5a4 [Ilya Ganelin] Updates for type parsing
      4085489 [Ilya Ganelin] Style errors
      3670cf5 [Ilya Ganelin] added string to DataType conversion
      8109e00 [Ilya Ganelin] Fixed error in tests
      41ab686 [Ilya Ganelin] Fixed style errors
      e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
      15868fa [Ilya Ganelin] Fixed python errors
      b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
      a3369fc [Ilya Ganelin] Fixing space errors
      e240040 [Ilya Ganelin] Style
      bab7823 [Ilya Ganelin] Constructor error
      73d4677 [Ilya Ganelin] Style
      4ed00d9 [Ilya Ganelin] Fixed default arg
      67df57a [Ilya Ganelin] Removed Foo
      04cbf0c [Ilya Ganelin] Added comments for single object
      0484d7a [Ilya Ganelin] Restored second method
      6aeb740 [Ilya Ganelin] Style
      689e54d [Ilya Ganelin] Style
      f497e9e [Ilya Ganelin] Got rid of old code
      e3c7a88 [Ilya Ganelin] Fixed doctest failure
      a62ccde [Ilya Ganelin] Style
      966ac06 [Ilya Ganelin] style checks
      dabb7e6 [Ilya Ganelin] Added Python tests
      a3f4152 [Ilya Ganelin] added python bindings and better comments
      e6e536c [Ilya Ganelin] Added extra space
      7529a2e [Ilya Ganelin] Fixed formatting
      d388f86 [Ilya Ganelin] Fixed small bug
      c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
      d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
      22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
      faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
      1acf76e [Ilya Ganelin] Scala style
      e31c674 [Ilya Ganelin] Fixed bug in test
      8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
      fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.
      f6fc254e
    • Davies Liu's avatar
      [SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame · afae9766
      Davies Liu authored
      Avoid the unnecessary jobs when infer schema from list.
      
      cc yhuai mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6606 from davies/improve_create and squashes the following commits:
      
      a5928bf [Davies Liu] Update MimaExcludes.scala
      62da911 [Davies Liu] fix mima
      bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create
      eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      8d9292d [Davies Liu] Update context.py
      eb24531 [Davies Liu] Update context.py
      c969997 [Davies Liu] bug fix
      d5a8ab0 [Davies Liu] fix tests
      8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      6ea5925 [Davies Liu] address comments
      6ceaeff [Davies Liu] avoid spark jobs in createDataFrame
      afae9766
    • Vladimir Vladimirov's avatar
      [SPARK-8528] Expose SparkContext.applicationId in PySpark · 492dca3a
      Vladimir Vladimirov authored
      Use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps
      
      Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
      
      Closes #6936 from smartkiwi/master and squashes the following commits:
      
      870338b [Vladimir Vladimirov] this would make doctest to run in python3
      0eae619 [Vladimir Vladimirov] Scala doesn't use u'...' for unicode literals
      14d77a8 [Vladimir Vladimirov] stop using ELLIPSIS
      b4ebfc5 [Vladimir Vladimirov] addressed PR feedback - updated docstring
      223a32f [Vladimir Vladimirov] fixed test - applicationId is property that returns the string
      3221f5a [Vladimir Vladimirov] [SPARK-8528] added documentation for Scala
      2cff090 [Vladimir Vladimirov] [SPARK-8528] add applicationId property for SparkContext object in pyspark
      492dca3a
    • Tarek Auel's avatar
      [SPARK-8235] [SQL] misc function sha / sha1 · a5c2961c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-8235
      
      I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?
      
      Please take a close look on the Python part. This is adopted from #6934
      
      Author: Tarek Auel <tarek.auel@gmail.com>
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits:
      
      f064563 [Tarek Auel] change to shaHex
      7ce3cdc [Tarek Auel] rely on automatic cast
      a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
      68eb043 [Tarek Auel] added docstring
      be5aff1 [Tarek Auel] improved error message
      7336c96 [Tarek Auel] added type check
      cf23a80 [Tarek Auel] simplified example
      ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
      6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
      ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
      e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
      e5dad4e [Tarek Auel] SPARK[8235] sha / sha1
      a5c2961c
    • Reynold Xin's avatar
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should... · 660c6cec
      Reynold Xin authored
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7079 from rxin/SPARK-8698 and squashes the following commits:
      
      8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      660c6cec
    • Cheolsoo Park's avatar
      [SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala · ac2e17b0
      Cheolsoo Park authored
      I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.
      
      I added `Option` to reader and writer and updated the `pyspark-sql` test.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:
      
      c63d419 [Cheolsoo Park] Fix version
      524e0aa [Cheolsoo Park] Add option function to df reader and writer
      ac2e17b0
    • Yanbo Liang's avatar
      [SPARK-5962] [MLLIB] Python support for Power Iteration Clustering · dfde31da
      Yanbo Liang authored
      Python support for Power Iteration Clustering
      https://issues.apache.org/jira/browse/SPARK-5962
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6992 from yanboliang/pyspark-pic and squashes the following commits:
      
      6b03d82 [Yanbo Liang] address comments
      4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
      dfde31da
    • Feynman Liang's avatar
      [SPARK-7212] [MLLIB] Add sequence learning flag · 25f574eb
      Feynman Liang authored
      Support mining of ordered frequent item sequences.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6997 from feynmanliang/fp-sequence and squashes the following commits:
      
      7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
      0d3e4b6 [Feynman Liang] Fix python test
      ce987cb [Feynman Liang] Backwards compatibility aux constructor
      34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
      f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
      648d4d4 [Feynman Liang] Test case for frequent item sequences
      252a36a [Feynman Liang] Add sequence learning flag
      25f574eb
  5. Jun 27, 2015
    • Josh Rosen's avatar
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with... · 40648c56
      Josh Rosen authored
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
      
      This patch refactors the `python/run-tests` script:
      
      - It's now written in Python instead of Bash.
      - The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
      - `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):
      
        ```
      Usage: run-tests [options]
      
      Options:
        -h, --help            show this help message and exit
        --python-executables=PYTHON_EXECUTABLES
                              A comma-separated list of Python executables to test
                              against (default: python2.6,python3.4,pypy)
        --modules=MODULES     A comma-separated list of Python modules to test
                              (default: pyspark-core,pyspark-ml,pyspark-mllib
                              ,pyspark-sql,pyspark-streaming)
         ```
      - `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:
      
      f578d6d [Josh Rosen] Fix print for Python 2.x
      8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
      34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
      8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
      37aff00 [Josh Rosen] Python 3 fix
      27a389f [Josh Rosen] Skip MLLib tests for PyPy
      c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
      568a3fd [Josh Rosen] Fix hashbang
      3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
      f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
      9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
      d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
      4f8902c [Josh Rosen] Python lint fixes.
      8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
      f542ac5 [Josh Rosen] Fix lint check for Python 3
      fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
      2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
      b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
      caeb040 [Josh Rosen] Fixes to PySpark test module definitions
      d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
      def2d8a [Josh Rosen] Two minor fixes
      aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
      04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
      4c97136 [Josh Rosen] PYTHONPATH fixes
      dcc9c09 [Josh Rosen] Fix time division
      32660fc [Josh Rosen] Initial cut at Python test runner refactoring
      311c6a9 [Josh Rosen] Move shell utility functions to own module.
      1bdeb87 [Josh Rosen] Move module definitions to separate file.
      40648c56
  6. Jun 26, 2015
    • Josh Rosen's avatar
      [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() · 41afa165
      Josh Rosen authored
      This patch addresses a critical issue in the PySpark tests:
      
      Several of our Python modules' `__main__` methods call `doctest.testmod()` in order to run doctests but forget to check and handle its return value. As a result, some PySpark test failures can go unnoticed because they will not fail the build.
      
      Fortunately, there was only one test failure which was masked by this bug: a `pyspark.profiler` doctest was failing due to changes in RDD pipelining.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7032 from JoshRosen/testmod-fix and squashes the following commits:
      
      60dbdc0 [Josh Rosen] Account for int vs. long formatting change in Python 3
      8b8d80a [Josh Rosen] Fix failing test.
      e6423f9 [Josh Rosen] Check return code for all uses of doctest.testmod().
      41afa165
    • Liang-Chi Hsieh's avatar
      [SPARK-8237] [SQL] Add misc function sha2 · 47c874ba
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8237
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6934 from viirya/expr_sha2 and squashes the following commits:
      
      35e0bb3 [Liang-Chi Hsieh] For comments.
      68b5284 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
      8573aff [Liang-Chi Hsieh] Remove unnecessary Product.
      ee61e06 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
      59e41aa [Liang-Chi Hsieh] Add misc function: sha2.
      47c874ba
  7. Jun 25, 2015
    • Yanbo Liang's avatar
      [MINOR] [MLLIB] rename some functions of PythonMLLibAPI · 2519dcc3
      Yanbo Liang authored
      Keep the same naming conventions for PythonMLLibAPI.
      Only the following three functions is different from others
      ```scala
      trainNaiveBayes
      trainGaussianMixture
      trainWord2Vec
      ```
      So change them to
      ```scala
      trainNaiveBayesModel
      trainGaussianMixtureModel
      trainWord2VecModel
      ```
      It does not affect any users and public APIs, only to make better understand for developer and code hacker.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
      
      771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
      2519dcc3
  8. Jun 24, 2015
    • MechCoder's avatar
      [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD · fb32c388
      MechCoder authored
      Add Python bindings to StreamingLogisticRegressionwithSGD.
      
      No Java wrappers are needed as models are updated directly using train.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6849 from MechCoder/spark-3258 and squashes the following commits:
      
      b4376a5 [MechCoder] minor
      d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
      9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
      fb32c388
  9. Jun 23, 2015
    • Reynold Xin's avatar
      Revert "[SPARK-7157][SQL] add sampleBy to DataFrame" · a458efc6
      Reynold Xin authored
      This reverts commit 0401cbaa.
      
      The new test case on Jenkins is failing.
      a458efc6
    • Xiangrui Meng's avatar
      [SPARK-7157][SQL] add sampleBy to DataFrame · 0401cbaa
      Xiangrui Meng authored
      Add `sampleBy` to DataFrame. rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6769 from mengxr/SPARK-7157 and squashes the following commits:
      
      991f26f [Xiangrui Meng] fix seed
      4a14834 [Xiangrui Meng] move sampleBy to stat
      832f7cc [Xiangrui Meng] add sampleBy to DataFrame
      0401cbaa
    • Davies Liu's avatar
      [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression · 7fb5ae50
      Davies Liu authored
      It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6961 from davies/column_bool and squashes the following commits:
      
      9f19beb [Davies Liu] update message
      af74bd6 [Davies Liu] fix tests
      07dff84 [Davies Liu] address comments, fix tests
      f70c08e [Davies Liu] raise Exception if column is used in booelan expression
      7fb5ae50
    • MechCoder's avatar
      [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utils · f2022fa0
      MechCoder authored
      It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6715 from MechCoder/generate_linear_input and squashes the following commits:
      
      6182884 [MechCoder] Minor changes
      8bda047 [MechCoder] Minor style fixes
      0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
      f2022fa0
    • Scott Taylor's avatar
      [SPARK-8541] [PYSPARK] test the absolute error in approx doctests · f0dcbe8a
      Scott Taylor authored
      A minor change but one which is (presumably) visible on the public api docs webpage.
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:
      
      fbed000 [Scott Taylor] test the absolute error in approx doctests
      f0dcbe8a
Loading