Skip to content
Snippets Groups Projects
  1. May 20, 2015
    • Josh Rosen's avatar
      [SPARK-7251] Perform sequential scan when iterating over BytesToBytesMap · f2faa7af
      Josh Rosen authored
      This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations.
      
      This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries.  In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length).
      
      This patch incorporates / closes #5836.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6159 from JoshRosen/SPARK-7251 and squashes the following commits:
      
      05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY
      2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity
      bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap
      f5feadf [Josh Rosen] Add test for iterating over an empty map
      273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap
      f2faa7af
    • Josh Rosen's avatar
      [SPARK-7698] Cache and reuse buffers in ExecutorMemoryAllocator when using heap allocation · 7956dd7a
      Josh Rosen authored
      When on-heap memory allocation is used, ExecutorMemoryManager should maintain a cache / pool of buffers for re-use by tasks. This will significantly improve the performance of the new Tungsten's sort-shuffle for jobs with many short-lived tasks by eliminating a major source of GC.
      
      This pull request is a minimum-viable-implementation of this idea.  In its current form, this patch significantly improves performance on a stress test which launches huge numbers of short-lived shuffle map tasks back-to-back in the same JVM.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6227 from JoshRosen/SPARK-7698 and squashes the following commits:
      
      fd6cb55 [Josh Rosen] SoftReference -> WeakReference
      b154e86 [Josh Rosen] WIP sketch of pooling in ExecutorMemoryManager
      7956dd7a
    • Tathagata Das's avatar
      [SPARK-7767] [STREAMING] Added test for checkpoint serialization in StreamingContext.start() · 3c434cbf
      Tathagata Das authored
      Currently, the background checkpointing thread fails silently if the checkpoint is not serializable. It is hard to debug and therefore its best to fail fast at `start()` when checkpointing is enabled and the checkpoint is not serializable.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6292 from tdas/SPARK-7767 and squashes the following commits:
      
      51304e6 [Tathagata Das] Addressed comments.
      c35237b [Tathagata Das] Added test for checkpoint serialization in StreamingContext.start()
      3c434cbf
    • Andrew Or's avatar
      [SPARK-7237] [SPARK-7741] [CORE] [STREAMING] Clean more closures that need cleaning · 9b84443d
      Andrew Or authored
      SPARK-7741 is the equivalent of SPARK-7237 in streaming. This is an alternative to #6268.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6269 from andrewor14/clean-moar and squashes the following commits:
      
      c51c9ab [Andrew Or] Add periods (trivial)
      6c686ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
      79a435b [Andrew Or] Fix tests
      d18c9f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
      65ef07b [Andrew Or] Fix tests?
      4b487a3 [Andrew Or] Add tests for closures passed to DStream operations
      328139b [Andrew Or] Do not forget foreachRDD
      5431f61 [Andrew Or] Clean streaming closures
      72b7b73 [Andrew Or] Clean core closures
      9b84443d
    • Holden Karau's avatar
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42... · 191ee474
      Holden Karau authored
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits:
      
      591f8e5 [Holden Karau] specify old seed for doc tests
      2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name
      cbad96d [Holden Karau] Add the setParams function that is used in the real code
      423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence
      140d25d [Holden Karau] remove extra space
      926165a [Holden Karau] Add some missing newlines for pep8 style
      8616751 [Holden Karau] merge in master
      58532e6 [Holden Karau] its the __name__ method, also treat None values as not set
      56ef24a [Holden Karau] fix test and regenerate base
      afdaa5c [Holden Karau] make sure different classes have different results
      68eb528 [Holden Karau] switch default seed to hash of type of self
      89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random
      31cd96f [Holden Karau] specify the seed to randomforestregressor test
      e1b947f [Holden Karau] Style fixes
      ce90ec8 [Holden Karau] merge in master
      bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42
      65eba21 [Holden Karau] pep8 fixes
      0e3797e [Holden Karau] Make seed default to random in more places
      213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code
      1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
      191ee474
    • Patrick Wendell's avatar
      Revert "[SPARK-7320] [SQL] Add Cube / Rollup for dataframe" · 6338c40d
      Patrick Wendell authored
      This reverts commit 10698e11.
      6338c40d
    • Sandy Ryza's avatar
      [SPARK-7579] [ML] [DOC] User guide update for OneHotEncoder · 829f1d95
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits:
      
      5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
      829f1d95
    • Xiangrui Meng's avatar
      [SPARK-7537] [MLLIB] spark.mllib API updates · 2ad4837c
      Xiangrui Meng authored
      Minor updates to the spark.mllib APIs:
      
      1. Add `DeveloperApi` to `PMMLExportable` and add `Experimental` to `toPMML` methods.
      2. Mention `RankingMetrics.of` in the `RankingMetrics` constructor.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6280 from mengxr/SPARK-7537 and squashes the following commits:
      
      1bd2583 [Xiangrui Meng] organize imports
      94afa7a [Xiangrui Meng] mark all toPMML methods experimental
      4c40da1 [Xiangrui Meng] mention the factory method for RankingMetrics for Java users
      88c62d0 [Xiangrui Meng] add DeveloperApi to PMMLExportable
      2ad4837c
    • Yin Huai's avatar
      [SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. · b631bf73
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7713
      
      I tested the performance with the following code:
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      (1 to 5000).foreach { i =>
        val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
      }
      
      sqlContext.sql("""
      CREATE TEMPORARY TABLE partitionedParquet
      USING org.apache.spark.sql.parquet
      OPTIONS (
        path '/tmp/partitioned'
      )""")
      
      table("partitionedParquet").explain(true)
      ```
      
      In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits:
      
      6fa73df [Yin Huai] Address comments of Josh and Andrew.
      807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
      e393555 [Yin Huai] Cheng's comments.
      2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
      b631bf73
    • Yanbo Liang's avatar
      [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlib · 98a46f9d
      Yanbo Liang authored
      Add MultilabelMetrics in PySpark/MLlib
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6276 from yanboliang/spark-6094 and squashes the following commits:
      
      b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
      98a46f9d
    • Xiangrui Meng's avatar
      [SPARK-7654] [MLLIB] Migrate MLlib to the DataFrame reader/writer API · 589b12f8
      Xiangrui Meng authored
      parquetFile -> read.parquet rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6281 from mengxr/SPARK-7654 and squashes the following commits:
      
      a79b612 [Xiangrui Meng] parquetFile -> read.parquet
      589b12f8
    • ehnalis's avatar
      [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. · 3ddf051e
      ehnalis authored
      Added faster RM-heartbeats on pending container allocations with multiplicative back-off.
      Also updated related documentations.
      
      Author: ehnalis <zoltan.zvara@gmail.com>
      
      Closes #6082 from ehnalis/yarn and squashes the following commits:
      
      a1d2101 [ehnalis] MIss-spell fixed.
      90f8ba4 [ehnalis] Changed default HB values.
      6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value.
      08bac63 [ehnalis] Refined style, grammar, removed duplicated code.
      073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
      d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
      3ddf051e
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] Add Cube / Rollup for dataframe · 09265ad7
      Cheng Hao authored
      Add `cube` & `rollup` for DataFrame
      For example:
      ```scala
      testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6257 from chenghao-intel/rollup and squashes the following commits:
      
      7302319 [Cheng Hao] cancel the implicit keyword
      a66e38f [Cheng Hao] remove the unnecessary code changes
      a2869d4 [Cheng Hao] update the code as comments
      c441777 [Cheng Hao] update the code as suggested
      84c9564 [Cheng Hao] Remove the CubedData & RollupedData
      279584c [Cheng Hao] hiden the CubedData & RollupedData
      ef357e1 [Cheng Hao] Add Cube / Rollup for dataframe
      09265ad7
    • Xusen Yin's avatar
      [SPARK-7663] [MLLIB] Add requirement for word2vec model · b3abf0b8
      Xusen Yin authored
      JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663).
      
      We should check the model size of word2vec, to prevent the unexpected empty.
      
      CC srowen.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits:
      
      21770c5 [Xusen Yin] check the vocab size
      54ae63e [Xusen Yin] add requirement for word2vec model
      b3abf0b8
  2. May 19, 2015
    • scwf's avatar
      [SPARK-7656] [SQL] use CatalystConf in FunctionRegistry · 60336e3b
      scwf authored
      follow up for #5806
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6164 from scwf/FunctionRegistry and squashes the following commits:
      
      15e6697 [scwf] use catalogconf in FunctionRegistry
      60336e3b
    • Mike Dusenberry's avatar
      [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types"... · 38605206
      Mike Dusenberry authored
      [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types" documentation should be reordered.
      
      The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits:
      
      6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
      38605206
    • alyaxey's avatar
      [SPARK-6246] [EC2] fixed support for more than 100 nodes · 2bc5e061
      alyaxey authored
      This is a small fix. But it is important for amazon users because as the ticket states, "spark-ec2 can't handle clusters with > 100 nodes" now.
      
      Author: alyaxey <oleksii.sliusarenko@grammarly.com>
      
      Closes #6267 from alyaxey/ec2_100_nodes_fix and squashes the following commits:
      
      1e0d747 [alyaxey] [SPARK-6246] fixed support for more than 100 nodes
      2bc5e061
    • Cheng Hao's avatar
      [SPARK-7662] [SQL] Resolve correct names for generator in projection · bcb1ff81
      Cheng Hao authored
      ```
      select explode(map(value, key)) from src;
      ```
      Throws exception
      ```
      org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
      at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6178 from chenghao-intel/explode and squashes the following commits:
      
      916fbe9 [Cheng Hao] add more strict rules for TGF alias
      5c3f2c5 [Cheng Hao] fix bug in unit test
      e1d93ab [Cheng Hao] Add more unit test
      19db09e [Cheng Hao] resolve names for generator in projection
      bcb1ff81
    • Davies Liu's avatar
      [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python · 4de74d26
      Davies Liu authored
      cc rxin, please take a quick look, I'm working on tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6238 from davies/readwrite and squashes the following commits:
      
      c7200eb [Davies Liu] update tests
      9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      f0c5a04 [Davies Liu] use sqlContext.read.load
      5f68bc8 [Davies Liu] update tests
      6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      bcc6668 [Davies Liu] add reader amd writer API in Python
      4de74d26
    • Liang-Chi Hsieh's avatar
      [SPARK-7652] [MLLIB] Update the implementation of naive Bayes prediction with BLAS · c12dff9b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7652
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6189 from viirya/naive_bayes_blas_prediction and squashes the following commits:
      
      ab611fd [Liang-Chi Hsieh] Remove unnecessary space.
      ddc48b9 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into naive_bayes_blas_prediction
      b5772b4 [Liang-Chi Hsieh] Fix binary compatibility.
      2f65186 [Liang-Chi Hsieh] Remove toDense.
      1b6cdfe [Liang-Chi Hsieh] Update the implementation of naive Bayes prediction with BLAS.
      c12dff9b
    • Xusen Yin's avatar
      [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml package · 68fb2a46
      Xusen Yin authored
      CC jkbradley.
      
      JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits:
      
      77014c5 [Xusen Yin] comment fix
      57a4c07 [Xusen Yin] small fix for docs
      1178c8f [Xusen Yin] remove the correctness check in java suite
      1c3f389 [Xusen Yin] delete sbt commit
      1af152b [Xusen Yin] check python example code
      1b5369e [Xusen Yin] add docs of word2vec
      68fb2a46
    • Iulian Dragos's avatar
      [SPARK-7726] Fix Scaladoc false errors · 3c4c1f96
      Iulian Dragos authored
      Visibility rules for static members are different in Scala and Java, and this case requires an explicit static import. Even though these are Java files, they are run through scaladoc, which enforces Scala rules.
      
      Also reverted the commit that reverts the upgrade to 2.11.6
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #6260 from dragos/issue/scaladoc-false-error and squashes the following commits:
      
      f2e998e [Iulian Dragos] Revert "[HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6""
      0bad052 [Iulian Dragos] Fix scaladoc faux-error.
      3c4c1f96
    • Joseph K. Bradley's avatar
      [SPARK-7678] [ML] Fix default random seed in HasSeed · 7b16e9f2
      Joseph K. Bradley authored
      Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      Also, removed fixed random seeds from Word2Vec and ALS.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits:
      
      0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds
      678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      7b16e9f2
    • Joseph K. Bradley's avatar
      [SPARK-7047] [ML] ml.Model optional parent support · fb902732
      Joseph K. Bradley authored
      Made Model.parent transient.  Added Model.hasParent to test for null parent
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5914 from jkbradley/parent-optional and squashes the following commits:
      
      d501774 [Joseph K. Bradley] Made Model.parent transient.  Added Model.hasParent to test for null parent
      fb902732
    • Dice's avatar
      [SPARK-7704] Updating Programming Guides per SPARK-4397 · 32fa611b
      Dice authored
      The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher)
      
      Author: Dice <poleon.kd@gmail.com>
      
      Closes #6234 from daisukebe/patch-1 and squashes the following commits:
      
      b77ecd9 [Dice] fix a typo
      45dfcd3 [Dice] rewording per Sean's advice
      a094bcf [Dice] Adding a note for users on any previous releases
      a29be5f [Dice] Updating Programming Guides per SPARK-4397
      32fa611b
    • Xiangrui Meng's avatar
      [SPARK-7681] [MLLIB] remove mima excludes for 1.3 · 6845cb2f
      Xiangrui Meng authored
      There excludes are unnecessary for 1.3 because the changes were made in 1.4.x.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits:
      
      7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3
      6845cb2f
    • Saleem Ansari's avatar
      [SPARK-7723] Fix string interpolation in pipeline examples · df34793a
      Saleem Ansari authored
      https://issues.apache.org/jira/browse/SPARK-7723
      
      Author: Saleem Ansari <tuxdna@gmail.com>
      
      Closes #6258 from tuxdna/master and squashes the following commits:
      
      2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline
      e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples
      df34793a
    • Patrick Wendell's avatar
      27fa88b9
    • Mike Dusenberry's avatar
      Fixing a few basic typos in the Programming Guide. · 61f164d3
      Mike Dusenberry authored
      Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits:
      
      ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide.
      61f164d3
    • Xusen Yin's avatar
      [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion · 6008ec14
      Xusen Yin authored
      JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).
      
      CC jkbradley
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:
      
      1a7d80d [Xusen Yin] merge with master
      892a8e9 [Xusen Yin] fix python 3 compatibility
      ec935bf [Xusen Yin] small fix
      3e9fa1d [Xusen Yin] delete note
      69fcf85 [Xusen Yin] simplify and add python example
      81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
      40babfb [Xusen Yin] add java test suite for PolynomialExpansion
      6008ec14
    • Patrick Wendell's avatar
      23cf8971
    • Patrick Wendell's avatar
      [HOTFIX]: Java 6 Build Breaks · 9ebb44f8
      Patrick Wendell authored
      These were blocking RC1 so I fixed them manually.
      9ebb44f8
  3. May 18, 2015
    • Josh Rosen's avatar
      [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String · c9fa870a
      Josh Rosen authored
      In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:
      
      146b615 [Josh Rosen] Fix R test.
      2974bd5 [Josh Rosen] Cast to string type instead
      f206580 [Josh Rosen] Cast to double to fix SPARK-7687
      307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
      c9fa870a
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Tathagata Das's avatar
      [SPARK-7692] Updated Kinesis examples · 3a600386
      Tathagata Das authored
      - Updated Kinesis examples to use stable API
      - Cleaned up comments, etc.
      - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6249 from tdas/kinesis-examples and squashes the following commits:
      
      7cc307b [Tathagata Das] More tweaks
      f080872 [Tathagata Das] More cleanup
      841987f [Tathagata Das] Small update
      011cbe2 [Tathagata Das] More fixes
      b0d74f9 [Tathagata Das] Updated examples.
      3a600386
    • jerluc's avatar
      [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners · 0a7a94ea
      jerluc authored
      PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s.
      
      Author: jerluc <jeremyalucas@gmail.com>
      
      Closes #6204 from jerluc/master and squashes the following commits:
      
      82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
      0a7a94ea
    • Davies Liu's avatar
      [SPARK-7624] Revert #4147 · 4fb52f95
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6172 from davies/revert_4147 and squashes the following commits:
      
      3bfbbde [Davies Liu] Revert #4147
      4fb52f95
    • Michael Armbrust's avatar
      [SQL] Fix serializability of ORC table scan · eb4632f2
      Michael Armbrust authored
      A follow-up to #6244.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6247 from marmbrus/fixOrcTests and squashes the following commits:
      
      e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
      eb4632f2
    • Jihong MA's avatar
      [SPARK-7063] when lz4 compression is used, it causes core dump · 6525fc0a
      Jihong MA authored
      this fix is to solve one issue found in lz4 1.2.0, which caused core dump in Spark Core with IBM JDK.  that issue is fixed in lz4 1.3.0 version.
      
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #6226 from JihongMA/SPARK-7063-1 and squashes the following commits:
      
      0cca781 [Jihong MA] SPARK-7063
      4559ed5 [Jihong MA] SPARK-7063
      daa520f [Jihong MA] SPARK-7063 upgrade lz4 jars
      71738ee [Jihong MA] Merge remote-tracking branch 'upstream/master'
      dfaa971 [Jihong MA] SPARK-7265 minor fix of the content
      ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation
      9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      9c84695 [Jihong MA] SPARK-7265 address review comment
      a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
      6525fc0a
Loading