Skip to content
Snippets Groups Projects
  1. Oct 02, 2017
  2. Oct 01, 2017
  3. Sep 29, 2017
  4. Sep 28, 2017
    • Reynold Xin's avatar
      [SPARK-22160][SQL] Make sample points per partition (in range partitioner)... · 323806e6
      Reynold Xin authored
      [SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100
      
      ## What changes were proposed in this pull request?
      Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100.
      
      ## How was this patch tested?
      Added a pretty sophisticated test based on chi square test ...
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #19387 from rxin/SPARK-22160.
      323806e6
    • Reynold Xin's avatar
      [SPARK-22159][SQL] Make config names consistently end with "enabled". · d29d1e87
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #19384 from rxin/SPARK-22159.
      d29d1e87
    • Reynold Xin's avatar
      [SPARK-22153][SQL] Rename ShuffleExchange -> ShuffleExchangeExec · d74dee13
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      For some reason when we added the Exec suffix to all physical operators, we missed this one. I was looking for this physical operator today and couldn't find it, because I was looking for ExchangeExec.
      
      ## How was this patch tested?
      This is a simple rename and should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #19376 from rxin/SPARK-22153.
      d74dee13
    • Sean Owen's avatar
      [SPARK-22128][CORE] Update paranamer to 2.8 to avoid BytecodeReadingParanamer... · 01bd00d1
      Sean Owen authored
      [SPARK-22128][CORE] Update paranamer to 2.8 to avoid BytecodeReadingParanamer ArrayIndexOutOfBoundsException with Scala 2.12 + Java 8 lambda
      
      ## What changes were proposed in this pull request?
      
      Un-manage jackson-module-paranamer version to let it use the version desired by jackson-module-scala; manage paranamer up from 2.8 for jackson-module-scala 2.7.9, to override avro 1.7.7's desired paranamer 2.3
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19352 from srowen/SPARK-22128.
      01bd00d1
    • Paul Mackles's avatar
      [SPARK-22135][MESOS] metrics in spark-dispatcher not being registered properly · f20be4d7
      Paul Mackles authored
      ## What changes were proposed in this pull request?
      
      Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called.
      
      ## How was this patch tested?
      
      Verified manually on local mesos setup
      
      Author: Paul Mackles <pmackles@adobe.com>
      
      Closes #19358 from pmackles/SPARK-22135.
      f20be4d7
  5. Sep 27, 2017
    • zhoukang's avatar
      [SPARK-22123][CORE] Add latest failure reason for task set blacklist · 3b117d63
      zhoukang authored
      ## What changes were proposed in this pull request?
      This patch add latest failure reason for task set blacklist.Which can be showed on spark ui and let user know failure reason directly.
      Till now , every job which aborted by completed blacklist just show log like below which has no more information:
      `Aborting $taskSet because task $indexInTaskSet (partition $partition) cannot run anywhere due to node and executor blacklist.  Blacklisting behavior cannot run anywhere due to node and executor blacklist.Blacklisting behavior can be configured via spark.blacklist.*."`
      **After modify:**
      ```
      Aborting TaskSet 0.0 because task 0 (partition 0)
      cannot run anywhere due to node and executor blacklist.
      Most recent failure:
      Some(Lost task 0.1 in stage 0.0 (TID 3,xxx, executor 1): java.lang.Exception: Fake error!
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:73)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:305)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      ).
      
      Blacklisting behavior can be configured via spark.blacklist.*.
      
      ```
      
      ## How was this patch tested?
      
      Unit test and manually test.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #19338 from caneGuy/zhoukang/improve-blacklist.
      3b117d63
    • Bryan Cutler's avatar
      [MINOR] Fixed up pandas_udf related docs and formatting · 7bf4da8a
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Fixed some minor issues with pandas_udf related docs and formatting.
      
      ## How was this patch tested?
      
      NA
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19375 from BryanCutler/arrow-pandas_udf-cleanup-minor.
      7bf4da8a
    • gatorsmile's avatar
      [SPARK-22140] Add TPCDSQuerySuite · 9244957b
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Now, we are not running TPC-DS queries as regular test cases. Thus, we need to add a test suite using empty tables for ensuring the new code changes will not break them. For example, optimizer/analyzer batches should not exceed the max iteration.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19361 from gatorsmile/tpcdsQuerySuite.
      9244957b
    • Herman van Hovell's avatar
      [SPARK-22143][SQL] Fix memory leak in OffHeapColumnVector · 02bb0682
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `WriteableColumnVector` does not close its child column vectors. This can create memory leaks for `OffHeapColumnVector` where we do not clean up the memory allocated by a vectors children. This can be especially bad for string columns (which uses a child byte column vector).
      
      ## How was this patch tested?
      I have updated the existing tests to always use both on-heap and off-heap vectors. Testing and diagnoses was done locally.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19367 from hvanhovell/SPARK-22143.
      02bb0682
    • Sean Owen's avatar
      [HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle · 9b98aef6
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix finalizer checkstyle violation by just turning it off; re-disable checkstyle as it won't be run by SBT PR builder. See https://github.com/apache/spark/pull/18887#issuecomment-332580700
      
      ## How was this patch tested?
      
      `./dev/lint-java` runs successfully
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19371 from srowen/HotfixFinalizerCheckstlye.
      9b98aef6
    • Takuya UESHIN's avatar
      [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. · 09cbf3df
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently we use Arrow File format to communicate with Python worker when invoking vectorized UDF but we can use Arrow Stream format.
      
      This pr replaces the Arrow File format with the Arrow Stream format.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #19349 from ueshin/issues/SPARK-22125.
      09cbf3df
    • Kazuaki Ishizaki's avatar
      [SPARK-22130][CORE] UTF8String.trim() scans " " twice · 12e740bb
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR allows us to scan a string including only white space (e.g. `"     "`) once while the current implementation scans twice (right to left, and then left to right).
      
      ## How was this patch tested?
      
      Existing test suites
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19355 from kiszk/SPARK-22130.
      12e740bb
    • guoxiaolong's avatar
      [SAPRK-20785][WEB-UI][SQL] Spark should provide jump links and add (count) in the SQL web ui. · d2b8b63b
      guoxiaolong authored
      ## What changes were proposed in this pull request?
      
      propose:
      
      it provide links that jump to Running Queries,Completed Queries and Failed Queries.
      it add (count) about Running Queries,Completed Queries and Failed Queries.
      This is a small optimization in in the SQL web ui.
      
      fix before:
      
      ![1](https://user-images.githubusercontent.com/26266482/30840686-36025cc0-a2ab-11e7-8d8d-1de0122a84fb.png)
      
      fix after:
      ![2](https://user-images.githubusercontent.com/26266482/30840723-6cc67a52-a2ab-11e7-8002-9191a55895a6.png)
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
      
      Closes #19346 from guoxiaolongzte/SPARK-20785.
      d2b8b63b
    • Marcelo Vanzin's avatar
      [SPARK-20642][CORE] Store FsHistoryProvider listing data in a KVStore. · 74daf622
      Marcelo Vanzin authored
      The application listing is still generated from event logs, but is now stored
      in a KVStore instance. By default an in-memory store is used, but a new config
      allows setting a local disk path to store the data, in which case a LevelDB
      store will be created.
      
      The provider stores things internally using the public REST API types; I believe
      this is better going forward since it will make it easier to get rid of the
      internal history server API which is mostly redundant at this point.
      
      I also added a finalizer to LevelDBIterator, to make sure that resources are
      eventually released. This helps when code iterates but does not exhaust the
      iterator, thus not triggering the auto-close code.
      
      HistoryServerSuite was modified to not re-start the history server unnecessarily;
      this makes the json validation tests run more quickly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18887 from vanzin/SPARK-20642.
      74daf622
    • Wang Gengliang's avatar
      [SPARK-22141][SQL] Propagate empty relation before checking Cartesian products · 9c5935d0
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      When inferring constraints from children, Join's condition can be simplified as None.
      For example,
      ```
      val testRelation = LocalRelation('a.int)
      val x = testRelation.as("x")
      val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
      x.join.where($"x.a" === $"y.a")
      ```
      The plan will become
      ```
      Join Inner
      :- LocalRelation <empty>, [a#23]
      +- LocalRelation <empty>, [a#224]
      ```
      And the Cartesian products check will throw exception for above plan.
      
      Propagate empty relation before checking Cartesian products, and the issue is resolved.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19362 from gengliangwang/MoveCheckCartesianProducts.
      9c5935d0
  6. Sep 26, 2017
    • goldmedal's avatar
      [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark · 1fdfe693
      goldmedal authored
      ## What changes were proposed in this pull request?
      We added a method to the scala API for creating a `DataFrame` from `DataSet[String]` storing CSV in [SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark doesn't have `Dataset` to support this feature. Therfore, I add an API to create a `DataFrame` from `RDD[String]` storing csv and it's also consistent with PySpark's `spark.read.json`.
      
      For example as below
      ```
      >>> rdd = sc.textFile('python/test_support/sql/ages.csv')
      >>> df2 = spark.read.csv(rdd)
      >>> df2.dtypes
      [('_c0', 'string'), ('_c1', 'string')]
      ```
      ## How was this patch tested?
      add unit test cases.
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19339 from goldmedal/SPARK-22112.
      1fdfe693
    • hyukjinkwon's avatar
      [BUILD] Close stale PRs · ceaec938
      hyukjinkwon authored
      Closes #13794
      Closes #18474
      Closes #18897
      Closes #18978
      Closes #19152
      Closes #19238
      Closes #19295
      Closes #19334
      Closes #19335
      Closes #19347
      Closes #19236
      Closes #19244
      Closes #19300
      Closes #19315
      Closes #19356
      Closes #15009
      Closes #18253
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19348 from HyukjinKwon/stale-prs.
      ceaec938
    • Juliusz Sompolski's avatar
      [SPARK-22103][FOLLOWUP] Rename addExtraCode to addInnerClass · f21f6ce9
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      Address PR comments that appeared post-merge, to rename `addExtraCode` to `addInnerClass`,
      and not count the size of the inner class to the size of the outer class.
      
      ## How was this patch tested?
      
      YOLO.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #19353 from juliuszsompolski/SPARK-22103followup.
      f21f6ce9
    • Liang-Chi Hsieh's avatar
      [SPARK-22124][SQL] Sample and Limit should also defer input evaluation under codegen · 64fbd1ce
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We can override `usedInputs` to claim that an operator defers input evaluation. `Sample` and `Limit` are two operators which should claim it but don't. We should do it.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19345 from viirya/SPARK-22124.
      64fbd1ce
  7. Sep 25, 2017
    • Bryan Cutler's avatar
      [SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests · d8e825e3
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This change disables the use of 0-parameter pandas_udfs due to the API being overly complex and awkward, and can easily be worked around by using an index column as an input argument.  Also added doctests for pandas_udfs which revealed bugs for handling empty partitions and using the pandas_udf decorator.
      
      ## How was this patch tested?
      
      Reworked existing 0-parameter test to verify error is raised, added doctest for pandas_udf, added new tests for empty partition and decorator usage.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19325 from BryanCutler/arrow-pandas_udf-0-param-remove-SPARK-22106.
      d8e825e3
    • Greg Owen's avatar
      [SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive warehouse directory · ce204780
      Greg Owen authored
      ## What changes were proposed in this pull request?
      During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory.
      
      ## How was this patch tested?
      Ran full suite of tests locally, verified that they pass.
      
      Author: Greg Owen <greg@databricks.com>
      
      Closes #19341 from GregOwen/SPARK-22120.
      ce204780
    • Juliusz Sompolski's avatar
      [SPARK-22103] Move HashAggregateExec parent consume to a separate function in codegen · 038b1857
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      HashAggregateExec codegen uses two paths for fast hash table and a generic one.
      It generates code paths for iterating over both, and both code paths generate the consume code of the parent operator, resulting in that code being expanded twice.
      This leads to a long generated function that might be an issue for the compiler (see e.g. SPARK-21603).
      I propose to remove the double expansion by generating the consume code in a helper function that can just be called from both iterating loops.
      
      An issue with separating the `consume` code to a helper function was that a number of places relied and assumed on being in the scope of an outside `produce` loop and e.g. use `continue` to jump out.
      I replaced such code flows with nested scopes. It is code that should be handled the same by compiler, while getting rid of depending on assumptions that are outside of the `consume`'s own scope.
      
      ## How was this patch tested?
      
      Existing test coverage.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #19324 from juliuszsompolski/aggrconsumecodegen.
      038b1857
    • Imran Rashid's avatar
      [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace · 2c5b9b11
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      MemoryStore.evictBlocksToFreeSpace acquires write locks for all the
      blocks it intends to evict up front.  If there is a failure to evict
      blocks (eg., some failure dropping a block to disk), then we have to
      release the lock.  Otherwise the lock is never released and an executor
      trying to get the lock will wait forever.
      
      ## How was this patch tested?
      
      Added unit test.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #19311 from squito/SPARK-22083.
      2c5b9b11
    • Zhenhua Wang's avatar
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and... · 365a29bd
      Zhenhua Wang authored
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type
      
      ## What changes were proposed in this pull request?
      
      The `percentile_approx` function previously accepted numeric type input and output double type results.
      
      But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
      
      After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
      
      This change is also required when we generate equi-height histograms for these types.
      
      ## How was this patch tested?
      
      Added a new test and modified some existing tests.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #19321 from wzhfy/approx_percentile_support_types.
      365a29bd
  8. Sep 24, 2017
    • John O'Leary's avatar
      [SPARK-22107] Change as to alias in python quickstart · 20adf9aa
      John O'Leary authored
      ## What changes were proposed in this pull request?
      
      Updated docs so that a line of python in the quick start guide executes. Closes #19283
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: John O'Leary <jgoleary@gmail.com>
      
      Closes #19326 from jgoleary/issues/22107.
      20adf9aa
    • Sean Owen's avatar
      [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE] Compile Spark REPL for... · 576c43fb
      Sean Owen authored
      [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE] Compile Spark REPL for Scala 2.12 + other 2.12 fixes
      
      ## What changes were proposed in this pull request?
      
      Enable Scala 2.12 REPL. Fix most remaining issues with 2.12 compilation and warnings, including:
      
      - Selecting Kafka 0.10.1+ for Scala 2.12 and patching over a minor API difference
      - Fixing lots of "eta expansion of zero arg method deprecated" warnings
      - Resolving the SparkContext.sequenceFile implicits compile problem
      - Fixing an odd but valid jetty-server missing dependency in hive-thriftserver
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19307 from srowen/Scala212.
      576c43fb
    • zuotingbing's avatar
      [SPARK-22058][CORE] the BufferedInputStream will not be closed if an exception occurs. · 4943ea59
      zuotingbing authored
      ## What changes were proposed in this pull request?
      
      EventLoggingListener use `val in = new BufferedInputStream(fs.open(log))` and will close it if `codec.map(_.compressedInputStream(in)).getOrElse(in)`  occurs an exception .
      But, if `CompressionCodec.createCodec(new SparkConf, c)` throws an exception, the BufferedInputStream `in` will not be closed anymore.
      
      ## How was this patch tested?
      
      exist tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #19277 from zuotingbing/SPARK-22058.
      4943ea59
    • hyukjinkwon's avatar
      [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuite` and `HiveDDLSuite` · 9d48bd0b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove `assume` in `Utils.resolveURIs` and replace `assume` to `assert` in `Utils.resolveURI` in the test cases in `UtilsSuite`.
      
      It looks `Utils.resolveURIs` supports multiple but also single paths as input. So, it looks not meaningful to check if the input has `,`.
      
      For the test for `Utils.resolveURI`, I replaced it to `assert` because it looks taking single path and in order to prevent future mistakes when adding more tests here.
      
      For `assume` in `HiveDDLSuite`, it looks it should be `assert` to test at the last
      ## How was this patch tested?
      
      Fixed unit tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19332 from HyukjinKwon/SPARK-22093.
      9d48bd0b
Loading