Skip to content
Snippets Groups Projects
  1. May 22, 2015
    • Michael Armbrust's avatar
      [SPARK-7834] [SQL] Better window error messages · 3c130510
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6363 from marmbrus/windowErrors and squashes the following commits:
      
      516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
      3c130510
    • Liang-Chi Hsieh's avatar
      [SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table · 126d7235
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7270
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5864 from viirya/dyn_partition_insert and squashes the following commits:
      
      b5627df [Liang-Chi Hsieh] For comments.
      3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert
      8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table.
      126d7235
    • Santiago M. Mola's avatar
      [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. · e4aef91f
      Santiago M. Mola authored
      Author: Santiago M. Mola <santi@mola.io>
      
      Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits:
      
      11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
      e4aef91f
    • WangTaoTheTonic's avatar
      [SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql · 31d5d463
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-7758
      
      When initializing `executionHive`, we only masks
      `javax.jdo.option.ConnectionURL` to override metastore location.  However,
      other properties that relates to the actual Hive metastore data source are not
      masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
      metastore, `executionHive` actually tries to use settings read from
      `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
      Derby metastore, thus causes error.
      
      To fix this, we need to mask all metastore data source properties.
      Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
      method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
      included.
      
      [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288
      
      Have tested using postgre sql as metastore, it worked fine.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:
      
      ca7ae7c [WangTaoTheTonic] add comments
      86caf2c [WangTaoTheTonic] delete unused import
      e4f0feb [WangTaoTheTonic] block more data source related property
      92a81fa [WangTaoTheTonic] fix style check
      e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
      31d5d463
    • Michael Armbrust's avatar
      [SPARK-6743] [SQL] Fix empty projections of cached data · 3b68cb04
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6165 from marmbrus/wrongColumn and squashes the following commits:
      
      4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
      aad7eab [Michael Armbrust] rxins comments
      f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
      3b68cb04
    • Cheng Lian's avatar
      [MINOR] [SQL] Ignores Thrift server UISeleniumSuite · 4e5220c3
      Cheng Lian authored
      This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits:
      
      09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite
      4e5220c3
    • Cheng Hao's avatar
      [SPARK-7322][SQL] Window functions in DataFrame · f6f2eeb1
      Cheng Hao authored
      This closes #6104.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6343 from rxin/window-df and squashes the following commits:
      
      026d587 [Reynold Xin] Address code review feedback.
      dc448fe [Reynold Xin] Fixed Hive tests.
      9794d9d [Reynold Xin] Moved Java test package.
      9331605 [Reynold Xin] Refactored API.
      3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
      d625a64 [Cheng Hao] Update the dataframe window API as suggsted
      c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
      3b1865f [Cheng Hao] scaladoc typos
      f3fd2d0 [Cheng Hao] polish the unit test
      6847825 [Cheng Hao] Add additional analystcs functions
      57e3bc0 [Cheng Hao] typos
      24a08ec [Cheng Hao] scaladoc
      28222ed [Cheng Hao] fix bug of range/row Frame
      1d91865 [Cheng Hao] style issue
      53f89f2 [Cheng Hao] remove the over from the functions.scala
      964c013 [Cheng Hao] add more unit tests and window functions
      64e18a7 [Cheng Hao] Add Window Function support for DataFrame
      f6f2eeb1
  2. May 21, 2015
    • Yin Huai's avatar
      [SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions. · 347b5010
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7737
      
      cc liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6329 from yhuai/spark-7737 and squashes the following commits:
      
      7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions.
      347b5010
    • Andrew Or's avatar
      [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning · 5287eec5
      Andrew Or authored
      According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:
      
      a82b451 [Andrew Or] Fix style
      10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
      17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
      523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
      f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
      5287eec5
    • Tathagata Das's avatar
      [SPARK-7478] [SQL] Added SQLContext.getOrCreate · 3d0cccc8
      Tathagata Das authored
      Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like
      
      1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.
      
      2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
      
      This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.
      
      rxin marmbrus
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6006 from tdas/SPARK-7478 and squashes the following commits:
      
      25f4da9 [Tathagata Das] Addressed comments.
      79fe069 [Tathagata Das] Added comments.
      c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
      bf8cf50 [Tathagata Das] Fix more bug
      dec5594 [Tathagata Das] Fixed bug
      b4e9721 [Tathagata Das] Remove unnecessary import
      4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      d3ea8e4 [Tathagata Das] Added HiveContext
      83bc950 [Tathagata Das] Updated tests
      f82ae81 [Tathagata Das] Fixed test
      bc72868 [Tathagata Das] Added SQLContext.getOrCreate
      3d0cccc8
    • Yin Huai's avatar
      [SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore · 30f3f556
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6285 from liancheng/spark-7763 and squashes the following commits:
      
      bb2829d [Yin Huai] Fix hashCode.
      d677f7d [Cheng Lian] Fixes Scala style issue
      44b283f [Cheng Lian] Adds test case for SPARK-7616
      6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616.
      6cabf3c [Yin Huai] Update unit test.
      7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec.
      e9a03ec [Cheng Lian] Persists partition columns into metastore
      30f3f556
    • scwf's avatar
      [SQL] [TEST] udf_java_method failed due to jdk version · f6c486aa
      scwf authored
      java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it.
      ```
      jdk version   	result
      1.7.0_11		2.7182818284590455
      1.7.0_05        2.7182818284590455
      1.7.0_71		2.718281828459045
      ```
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6274 from scwf/java_method and squashes the following commits:
      
      3dd2516 [scwf] address comments
      5fa1459 [scwf] style
      df46445 [scwf] fix test error
      fcb6d22 [scwf] fix udf_java_method
      f6c486aa
    • Cheng Lian's avatar
      [SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables · 8730fbb4
      Cheng Lian authored
      When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns.
      
      This PR together with #6285 should fix SPARK-7749.
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6287 from liancheng/spark-7749 and squashes the following commits:
      
      a799ff3 [Cheng Lian] Adds test cases for SPARK-7749
      c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name
      5aa87ea [Yin Huai] Make parsePartitions more robust.
      fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred
      19ae41e [Cheng Lian] Don't list base directory as leaf directory
      8730fbb4
    • Davies Liu's avatar
      [SPARK-7565] [SQL] fix MapType in JsonRDD · a25c1ab8
      Davies Liu authored
      The key of Map in JsonRDD should be converted into UTF8String (also failed records), Thanks to yhuai viirya
      
      Closes #6084
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6299 from davies/string_in_json and squashes the following commits:
      
      0dbf559 [Davies Liu] improve test, fix corrupt record
      6836a80 [Davies Liu] move unit tests into Scala
      b97af11 [Davies Liu] fix MapType in JsonRDD
      a25c1ab8
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] [Minor] Move the testData into beforeAll() · feb3a9d3
      Cheng Hao authored
      Follow up of #6340, to avoid the test report missing once it fails.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits:
      
      b03a25f [Cheng Hao] simplify the testData instantiation
      09b7e8b [Cheng Hao] move the testData into beforeAll()
      feb3a9d3
    • Liang-Chi Hsieh's avatar
      [SPARK-7746][SQL] Add FetchSize parameter for JDBC driver · d0eb9ffe
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7746
      
      Looks like an easy to add parameter but can show significant performance improvement if the JDBC driver accepts it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6283 from viirya/jdbc_fetchsize and squashes the following commits:
      
      de47f94 [Liang-Chi Hsieh] Don't keep fetchSize as single parameter.
      b7bff2f [Liang-Chi Hsieh] Add FetchSize parameter for JDBC driver.
      d0eb9ffe
  3. May 20, 2015
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] Add Cube / Rollup for dataframe · 42c592ad
      Cheng Hao authored
      This is a follow up for #6257, which broke the maven test.
      
      Add cube & rollup for DataFrame
      For example:
      ```scala
      testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6304 from chenghao-intel/rollup and squashes the following commits:
      
      04bb1de [Cheng Hao] move the table register/unregister into beforeAll/afterAll
      a6069f1 [Cheng Hao] cancel the implicit keyword
      ced4b8f [Cheng Hao] remove the unnecessary code changes
      9959dfa [Cheng Hao] update the code as comments
      e1d88aa [Cheng Hao] update the code as suggested
      03bc3d9 [Cheng Hao] Remove the CubedData & RollupedData
      5fd62d0 [Cheng Hao] hiden the CubedData & RollupedData
      5ffb196 [Cheng Hao] Add Cube / Rollup for dataframe
      42c592ad
    • Patrick Wendell's avatar
      Revert "[SPARK-7320] [SQL] Add Cube / Rollup for dataframe" · 6338c40d
      Patrick Wendell authored
      This reverts commit 10698e11.
      6338c40d
    • Yin Huai's avatar
      [SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. · b631bf73
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7713
      
      I tested the performance with the following code:
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      (1 to 5000).foreach { i =>
        val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
      }
      
      sqlContext.sql("""
      CREATE TEMPORARY TABLE partitionedParquet
      USING org.apache.spark.sql.parquet
      OPTIONS (
        path '/tmp/partitioned'
      )""")
      
      table("partitionedParquet").explain(true)
      ```
      
      In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits:
      
      6fa73df [Yin Huai] Address comments of Josh and Andrew.
      807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
      e393555 [Yin Huai] Cheng's comments.
      2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
      b631bf73
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] Add Cube / Rollup for dataframe · 09265ad7
      Cheng Hao authored
      Add `cube` & `rollup` for DataFrame
      For example:
      ```scala
      testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6257 from chenghao-intel/rollup and squashes the following commits:
      
      7302319 [Cheng Hao] cancel the implicit keyword
      a66e38f [Cheng Hao] remove the unnecessary code changes
      a2869d4 [Cheng Hao] update the code as comments
      c441777 [Cheng Hao] update the code as suggested
      84c9564 [Cheng Hao] Remove the CubedData & RollupedData
      279584c [Cheng Hao] hiden the CubedData & RollupedData
      ef357e1 [Cheng Hao] Add Cube / Rollup for dataframe
      09265ad7
  4. May 19, 2015
    • scwf's avatar
      [SPARK-7656] [SQL] use CatalystConf in FunctionRegistry · 60336e3b
      scwf authored
      follow up for #5806
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6164 from scwf/FunctionRegistry and squashes the following commits:
      
      15e6697 [scwf] use catalogconf in FunctionRegistry
      60336e3b
    • Cheng Hao's avatar
      [SPARK-7662] [SQL] Resolve correct names for generator in projection · bcb1ff81
      Cheng Hao authored
      ```
      select explode(map(value, key)) from src;
      ```
      Throws exception
      ```
      org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
      at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6178 from chenghao-intel/explode and squashes the following commits:
      
      916fbe9 [Cheng Hao] add more strict rules for TGF alias
      5c3f2c5 [Cheng Hao] fix bug in unit test
      e1d93ab [Cheng Hao] Add more unit test
      19db09e [Cheng Hao] resolve names for generator in projection
      bcb1ff81
    • Patrick Wendell's avatar
      [HOTFIX]: Java 6 Build Breaks · 9ebb44f8
      Patrick Wendell authored
      These were blocking RC1 so I fixed them manually.
      9ebb44f8
  5. May 18, 2015
    • Josh Rosen's avatar
      [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String · c9fa870a
      Josh Rosen authored
      In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:
      
      146b615 [Josh Rosen] Fix R test.
      2974bd5 [Josh Rosen] Cast to string type instead
      f206580 [Josh Rosen] Cast to double to fix SPARK-7687
      307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
      c9fa870a
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Michael Armbrust's avatar
      [SQL] Fix serializability of ORC table scan · eb4632f2
      Michael Armbrust authored
      A follow-up to #6244.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6247 from marmbrus/fixOrcTests and squashes the following commits:
      
      e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
      eb4632f2
    • Michael Armbrust's avatar
      [HOTFIX] Fix ORC build break · fcf90b75
      Michael Armbrust authored
      Fix break caused by merging #6225 and #6194.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:
      
      b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
      fcf90b75
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
    • Cheng Lian's avatar
      [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations · 9dadf019
      Cheng Lian authored
      This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`:
      
      1.  Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`.
      
          This new cache generalizes and replaces the one used in `ParquetRelation2`.
      
          This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`.
      
      1.  When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.
      
          This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel.
      
      Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark.  However, this complicates data source user code because user code must merge partition values manually.
      
      To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`.  All results are shown below.
      
      ### Microbenchmark
      
      #### Preparation code
      
      Generating a partitioned table with 50k partitions, 1k rows per partition:
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      for (n <- 0 until 500) {
        val data = for {
          p <- (n * 10) until ((n + 1) * 10)
          i <- 0 until 1000
        } yield (i, f"val_$i%04d", f"$p%04d")
      
        data.
          toDF("a", "b", "p").
          write.
          partitionBy("p").
          mode("append").
          parquet(path)
      }
      ```
      
      #### Benchmarking code
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      import org.apache.spark.sql.types._
      import com.google.common.base.Stopwatch
      
      val path = "hdfs://localhost:9000/user/lian/5k"
      
      def benchmark(n: Int)(f: => Unit) {
        val stopwatch = new Stopwatch()
      
        def run() = {
          stopwatch.reset()
          stopwatch.start()
          f
          stopwatch.stop()
          stopwatch.elapsedMillis()
        }
      
        val records = (0 until n).map(_ => run())
      
        (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
        println(s"Average: ${records.sum / n.toDouble} ms")
      }
      
      benchmark(3) { read.parquet(path).explain(extended = true) }
      ```
      
      #### Results
      
      Before:
      
      ```
      Round 0: 72528 ms
      Round 1: 68938 ms
      Round 2: 65372 ms
      Average: 68946.0 ms
      ```
      
      After:
      
      ```
      Round 0: 59499 ms
      Round 1: 53645 ms
      Round 2: 53844 ms
      Round 3: 49093 ms
      Round 4: 50555 ms
      Average: 53327.2 ms
      ```
      
      Also removing Hadoop configuration broadcasting:
      
      (Note that I was testing on a local laptop, thus network cost is pretty low.)
      
      ```
      Round 0: 15806 ms
      Round 1: 14394 ms
      Round 2: 14699 ms
      Round 3: 15334 ms
      Round 4: 14123 ms
      Average: 14871.2 ms
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6225 from liancheng/spark-7673 and squashes the following commits:
      
      2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading
      7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files
      ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2
      3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file
      b84612a [Cheng Lian] Fixes Scala style issue
      6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
      9dadf019
    • Yin Huai's avatar
      [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based on mapreduce apis · 530397ba
      Yin Huai authored
      cc liancheng marmbrus
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6130 from yhuai/directOutput and squashes the following commits:
      
      312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer.
      530397ba
    • Wenchen Fan's avatar
      [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals) · 103c863c
      Wenchen Fan authored
      A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6173 from cloud-fan/7269 and squashes the following commits:
      
      e4a3cc7 [Wenchen Fan] address comments
      cc02045 [Wenchen Fan] consider elements length equal
      d7ff8f4 [Wenchen Fan] fix 7269
      103c863c
    • scwf's avatar
      [SPARK-7631] [SQL] treenode argString should not print children · fc2480ed
      scwf authored
      spark-sql>
      > explain extended
      > select * from (
      > select key from src union all
      > select key from src) t;
      
      now the spark plan will print children in argString
      ```
      == Physical Plan ==
      Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
      HiveTableScan key#3, (MetastoreRelation default, src, None), None]
      HiveTableScan key#1, (MetastoreRelation default, src, None), None
      HiveTableScan key#3, (MetastoreRelation default, src, None), None
      ```
      
      after this patch:
      ```
      == Physical Plan ==
      Union
       HiveTableScan [key#1], (MetastoreRelation default, src, None), None
       HiveTableScan [key#3], (MetastoreRelation default, src, None), None
      ```
      
      I have tested this locally
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6144 from scwf/fix-argString and squashes the following commits:
      
      1a642e0 [scwf] fix treenode argString
      fc2480ed
    • Zhan Zhang's avatar
      [SPARK-2883] [SQL] ORC data source for Spark SQL · aa31e431
      Zhan Zhang authored
      This PR updates PR #6135 authored by zhzhan from Hortonworks.
      
      ----
      
      This PR implements a Spark SQL data source for accessing ORC files.
      
      > **NOTE**
      >
      > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.
      
      1.  Saving/loading ORC files without contacting Hive metastore
      
      1.  Support for complex data types (i.e. array, map, and struct)
      
      1.  Aware of common optimizations provided by Spark SQL:
      
          - Column pruning
          - Partitioning pruning
          - Filter push-down
      
      1.  Schema evolution support
      1.  Hive metastore table conversion
      
      This PR also include initial work done by scwf from Huawei (PR #3753).
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6194 from liancheng/polishing-orc and squashes the following commits:
      
      55ecd96 [Cheng Lian] Reorganizes ORC test suites
      d4afeed [Cheng Lian] Addresses comments
      21ada22 [Cheng Lian] Adds @since and @Experimental annotations
      128bd3b [Cheng Lian] ORC filter bug fix
      d734496 [Cheng Lian] Polishes the ORC data source
      2650a42 [Zhan Zhang] resolve review comments
      3c9038e [Zhan Zhang] resolve review comments
      7b3c7c5 [Zhan Zhang] save mode fix
      f95abfd [Zhan Zhang] reuse test suite
      7cc2c64 [Zhan Zhang] predicate fix
      4e61c16 [Zhan Zhang] minor change
      305418c [Zhan Zhang] orc data source support
      aa31e431
    • Wenchen Fan's avatar
      [SQL] [MINOR] [THIS] use private for internal field in ScalaUdf · 56ede884
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6235 from cloud-fan/tmp and squashes the following commits:
      
      8f16367 [Wenchen Fan] use private[this]
      56ede884
    • Cheng Lian's avatar
      [SPARK-7570] [SQL] Ignores _temporary during partition discovery · 010a1c27
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6091 from liancheng/spark-7570 and squashes the following commits:
      
      8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery
      010a1c27
    • Rene Treffer's avatar
      [SPARK-6888] [SQL] Make the jdbc driver handling user-definable · e1ac2a95
      Rene Treffer authored
      Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
      and allow developers to change the dialects on the fly (for new JDBCRRDs only).
      
      Some types (like an unsigned 64bit number) can be trivially mapped to java.
      The status quo is that the RRD will fail to load.
      This patch makes it possible to overwrite the type mapping to read e.g.
      64Bit numbers as strings and handle them afterwards in software.
      
      JDBCSuite has an example that maps all types to String, which should always
      work (at the cost of extra code afterwards).
      
      As a side effect it should now be possible to develop simple dialects
      out-of-tree and even with spark-shell.
      
      Author: Rene Treffer <treffer@measite.de>
      
      Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:
      
      3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
      fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
      e1ac2a95
    • Liang-Chi Hsieh's avatar
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC... · e32c0f69
      Liang-Chi Hsieh authored
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC metadata instead of returned BigDecimal
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-7299
      
      When connecting with oracle db through jdbc, the precision and scale of `BigDecimal` object returned by `ResultSet.getBigDecimal` is not correctly matched to the table schema reported by `ResultSetMetaData.getPrecision` and `ResultSetMetaData.getScale`.
      
      So in case you insert a value like `19999` into a column with `NUMBER(12, 2)` type, you get through a `BigDecimal` object with scale as 0. But the dataframe schema has correct type as `DecimalType(12, 2)`. Thus, after you save the dataframe into parquet file and then retrieve it, you will get wrong result `199.99`.
      
      Because it is reported to be problematic on jdbc connection with oracle db. It might be difficult to add test case for it. But according to the user's test on JIRA, it solves this problem.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5833 from viirya/jdbc_decimal_precision and squashes the following commits:
      
      69bc2b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into jdbc_decimal_precision
      928f864 [Liang-Chi Hsieh] Add comments.
      5f9da94 [Liang-Chi Hsieh] Set up Decimal's precision and scale according to table schema instead of returned BigDecimal.
      e32c0f69
  6. May 17, 2015
    • zsxwing's avatar
      [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global" · ff71d34e
      zsxwing authored
      Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.
      
      This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:
      
      a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
      cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
      ff71d34e
    • Wenchen Fan's avatar
      [SQL] [MINOR] use catalyst type converter in ScalaUdf · 2f22424e
      Wenchen Fan authored
      It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6182 from cloud-fan/tmp and squashes the following commits:
      
      241cfe9 [Wenchen Fan] use converter in ScalaUdf
      2f22424e
    • Michael Armbrust's avatar
      [SPARK-7491] [SQL] Allow configuration of classloader isolation for hive · 2ca60ace
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6167 from marmbrus/configureIsolation and squashes the following commits:
      
      6147cbe [Michael Armbrust] filter other conf
      22cc3bc7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into configureIsolation
      07476ee [Michael Armbrust] filter empty prefixes
      dfdf19c [Michael Armbrust] [SPARK-6906][SQL] Allow configuration of classloader isolation for hive
      2ca60ace
Loading