Skip to content
Snippets Groups Projects
  1. Apr 05, 2014
    • Matei Zaharia's avatar
      SPARK-1421. Make MLlib work on Python 2.6 · 0b855167
      Matei Zaharia authored
      The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:
      
      f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
      a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
      0b855167
  2. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
  3. Apr 03, 2014
    • Prashant Sharma's avatar
      Spark 1162 Implemented takeOrdered in pyspark. · c1ea3afb
      Prashant Sharma authored
      Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
      
      We have our own implementation of max heap.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
      
      35f86ba [Prashant Sharma] code review
      2b1124d [Prashant Sharma] fixed tests
      e8a08e2 [Prashant Sharma] Code review comments.
      49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
      c1ea3afb
  4. Apr 02, 2014
    • Xiangrui Meng's avatar
      [SPARK-1212, Part II] Support sparse data in MLlib · 9c65fa76
      Xiangrui Meng authored
      In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
      
      1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
      2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
      3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
      4. Add libSVMFile to MLContext.
      5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
      6. Gradient computation no longer creates temp vectors.
      7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
      
      TODO:
      1. ~~Use axpy when possible.~~
      2. ~~Optimize Naive Bayes.~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #245 from mengxr/vector and squashes the following commits:
      
      eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
      c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
      11999c7 [Xiangrui Meng] Merge branch 'master' into vector
      f7da54b [Xiangrui Meng] add minSplits to libSVMFile
      da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
      493f26f [Xiangrui Meng] Merge branch 'master' into vector
      7c1bc01 [Xiangrui Meng] add a TODO to NB
      b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
      b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
      4addc50 [Xiangrui Meng] merge master
      4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
      f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
      d088552 [Xiangrui Meng] use static constructor for MLContext
      6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
      3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
      0f8759b [Xiangrui Meng] minor updates to NB
      b11659c [Xiangrui Meng] style update
      78c4671 [Xiangrui Meng] add libSVMFile to MLContext
      f0fe616 [Xiangrui Meng] add a test for sparse linear regression
      44733e1 [Xiangrui Meng] use in-place gradient computation
      e981396 [Xiangrui Meng] use axpy in Updater
      db808a1 [Xiangrui Meng] update JavaLR example
      befa592 [Xiangrui Meng] passed scala/java tests
      75c83a4 [Xiangrui Meng] passed test compile
      1859701 [Xiangrui Meng] passed compile
      834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
      135ab72 [Xiangrui Meng] merge glm
      0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
      d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
      3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
      9c65fa76
  5. Mar 30, 2014
    • Prashant Sharma's avatar
      SPARK-1336 Reducing the output of run-tests script. · df1b9f7b
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:
      
      87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
      811170f [Prashant Sharma] Reducing the ouput of run-tests script.
      df1b9f7b
  6. Mar 26, 2014
  7. Mar 19, 2014
    • Jyotiska NK's avatar
      Added doctest for map function in rdd.py · 67fa71cb
      Jyotiska NK authored
      Doctest added for map in rdd.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:
      
      a38527f [Jyotiska NK] Added doctest for map function in rdd.py
      67fa71cb
  8. Mar 18, 2014
    • Dan McClary's avatar
      Spark 1246 add min max to stat counter · e3681f26
      Dan McClary authored
      Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.
      
      Author: Dan McClary <dan.mcclary@gmail.com>
      
      Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:
      
      fd3fd4b [Dan McClary] fixed  error, updated test
      82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
      5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
      21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
      1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
      a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
      ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
      1e7056d [Dan McClary] added underscore to getBucket
      37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
      29981f2 [Dan McClary] fixed indentation on doctest comment
      eaf89d9 [Dan McClary] added correct doctest for histogram
      4916016 [Dan McClary] added histogram method, added max and min to statscounter
      e3681f26
  9. Mar 17, 2014
    • CodingCat's avatar
      SPARK-1240: handle the case of empty RDD when takeSample · dc965463
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1240
      
      It seems that the current implementation does not handle the empty RDD case when run takeSample
      
      In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
      
      In the test case, I also add several lines for this case
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:
      
      fef57d4 [CodingCat] fix the same problem in PySpark
      36db06b [CodingCat] create new test cases for takeSample from an empty red
      810948d [CodingCat] further fix
      a40e8fb [CodingCat] replace if with require
      ad483fd [CodingCat] handle the case with empty RDD when take sample
      dc965463
  10. Mar 12, 2014
    • Prashant Sharma's avatar
      SPARK-1162 Added top in python. · b8afe305
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
      
      ece1fa4 [Prashant Sharma] Added top in python.
      b8afe305
    • prabinb's avatar
      Spark-1163, Added missing Python RDD functions · af7f2f10
      prabinb authored
      Author: prabinb <prabin.banka@imaginea.com>
      
      Closes #92 from prabinb/python-api-rdd and squashes the following commits:
      
      51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
      af7f2f10
  11. Mar 10, 2014
    • Prashant Sharma's avatar
      SPARK-1168, Added foldByKey to pyspark. · a59419c2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
      
      db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
      a59419c2
    • jyotiska's avatar
      [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted) · f5518989
      jyotiska authored
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #34 from jyotiska/pyspark_code and squashes the following commits:
      
      c9439be [jyotiska] replaced dict with namedtuple
      a6bf4cd [jyotiska] added callsite info for context.py
      f5518989
    • Prabin Banka's avatar
      SPARK-977 Added Python RDD.zip function · e1e09e0e
      Prabin Banka authored
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #76 from prabinb/python-api-zip and squashes the following commits:
      
      b1a31a0 [Prabin Banka] Added Python RDD.zip function
      e1e09e0e
  12. Mar 09, 2014
    • Aaron Davidson's avatar
      SPARK-929: Fully deprecate usage of SPARK_MEM · 52834d76
      Aaron Davidson authored
      (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)
      
      This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
      SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY
      
      The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.
      
      SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.
      
      SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.
      
      Other memory considerations:
      - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
      - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).
      
      This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #99 from aarondav/sparkmem and squashes the following commits:
      
      9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
      52834d76
  13. Mar 07, 2014
    • Prashant Sharma's avatar
      Spark 1165 rdd.intersection in python and java · 6e730edc
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:
      
      9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
      1fea813 [Prashant Sharma] correct the lines wrapping
      d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
      d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
      6e730edc
  14. Mar 06, 2014
    • Prabin Banka's avatar
      SPARK-1187, Added missing Python APIs · 3d3acef0
      Prabin Banka authored
      The following Python APIs are added,
      RDD.id()
      SparkContext.setJobGroup()
      SparkContext.setLocalProperty()
      SparkContext.getLocalProperty()
      SparkContext.sparkUser()
      
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #75 from prabinb/python-api-backup and squashes the following commits:
      
      cc3c6cd [Prabin Banka] Added missing Python APIs
      3d3acef0
  15. Mar 04, 2014
  16. Feb 26, 2014
    • Bouke van der Bijl's avatar
      SPARK-1115: Catch depickling errors · 12738c1a
      Bouke van der Bijl authored
      This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason
      
      @JoshRosen
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #644 from bouk/catch-depickling-errors and squashes the following commits:
      
      f0f67cc [Bouke van der Bijl] Lol indentation
      0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
      12738c1a
  17. Feb 22, 2014
    • jyotiska's avatar
      doctest updated for mapValues, flatMapValues in rdd.py · 722199fa
      jyotiska authored
      Updated doctests for mapValues and flatMapValues in rdd.py
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #621 from jyotiska/python_spark and squashes the following commits:
      
      716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
      722199fa
    • jyotiska's avatar
      Fixed minor typo in worker.py · 3ff077d4
      jyotiska authored
      Fixed minor typo in worker.py
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #630 from jyotiska/pyspark_code and squashes the following commits:
      
      ee44201 [jyotiska] typo fixed in worker.py
      3ff077d4
  18. Feb 20, 2014
    • Ahir Reddy's avatar
      SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795
      Ahir Reddy authored
      Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:
      
      a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      59b13795
  19. Feb 09, 2014
    • jyotiska's avatar
      Merge pull request #562 from jyotiska/master. Closes #562. · 2ef37c93
      jyotiska authored
      Added example Python code for sort
      
      I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code.
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      == Merge branch commits ==
      
      commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df
      Author: jyotiska <jyotiska123@gmail.com>
      Date:   Sun Feb 9 11:00:41 2014 +0530
      
          Added comments in code on collect() method
      
      commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c
      Author: jyotiska <jyotiska123@gmail.com>
      Date:   Sat Feb 8 13:12:37 2014 +0530
      
          Updated python example code sort.py
      
      commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb
      Author: jyotiska <jyotiska123@gmail.com>
      Date:   Sat Feb 8 12:59:09 2014 +0530
      
          Added example python code for sort
      2ef37c93
  20. Feb 08, 2014
    • Mark Hamstra's avatar
      Merge pull request #542 from markhamstra/versionBump. Closes #542. · c2341c92
      Mark Hamstra authored
      Version number to 1.0.0-SNAPSHOT
      
      Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.
      
      @pwendell
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      == Merge branch commits ==
      
      commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
      Author: Mark Hamstra <markhamstra@gmail.com>
      Date:   Wed Feb 5 09:30:32 2014 -0800
      
          Version number to 1.0.0-SNAPSHOT
      c2341c92
  21. Feb 06, 2014
    • Prashant Sharma's avatar
      Merge pull request #498 from ScrapCodes/python-api. Closes #498. · 084839ba
      Prashant Sharma authored
      Python api additions
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      == Merge branch commits ==
      
      commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Fri Jan 24 11:50:29 2014 +0530
      
          Josh's and Patricks review comments.
      
      commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Thu Jan 23 17:27:17 2014 +0530
      
          fixed doc tests
      
      commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Thu Jan 23 16:48:43 2014 +0530
      
          Added keys and values methods for PairFunctions in python
      
      commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Thu Jan 23 13:51:26 2014 +0530
      
          Added foreachPartition
      
      commit 05f05341a187cba829ac0e6c2bdf30be49948c89
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Thu Jan 23 13:02:59 2014 +0530
      
          Added coalesce fucntion to python API
      
      commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Date:   Thu Jan 23 12:52:44 2014 +0530
      
          added repartition function to python API.
      084839ba
  22. Jan 28, 2014
    • Josh Rosen's avatar
      Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72
      Josh Rosen authored
      This fixes SPARK-1043, a bug introduced in 0.9.0
      where PySpark couldn't serialize strings > 64kB.
      
      This fix was written by @tyro89 and @bouk in #512.
      This commit squashes and rebases their pull request
      in order to fix some merge conflicts.
      1381fc72
  23. Jan 23, 2014
  24. Jan 18, 2014
  25. Jan 14, 2014
  26. Jan 13, 2014
  27. Jan 12, 2014
    • Matei Zaharia's avatar
      Log Python exceptions to stderr as well · 5741078c
      Matei Zaharia authored
      This helps in case the exception happened while serializing a record to
      be sent to Java, leaving the stream to Java in an inconsistent state
      where PythonRDD won't be able to read the error.
      5741078c
    • Matei Zaharia's avatar
      Update some Python MLlib parameters to use camelCase, and tweak docs · 4c28a2ba
      Matei Zaharia authored
      We've used camel case in other Spark methods so it felt reasonable to
      keep using it here and make the code match Scala/Java as much as
      possible. Note that parameter names matter in Python because it allows
      passing optional parameters by name.
      4c28a2ba
    • Matei Zaharia's avatar
      Add Naive Bayes to Python MLlib, and some API fixes · 9a0dfdf8
      Matei Zaharia authored
      - Added a Python wrapper for Naive Bayes
      - Updated the Scala Naive Bayes to match the style of our other
        algorithms better and in particular make it easier to call from Java
        (added builder pattern, removed default value in train method)
      - Updated Python MLlib functions to not require a SparkContext; we can
        get that from the RDD the user gives
      - Added a toString method in LabeledPoint
      - Made the Python MLlib tests run as part of run-tests as well (before
        they could only be run individually through each file)
      9a0dfdf8
  28. Jan 06, 2014
Loading