Skip to content
Snippets Groups Projects
  1. Mar 01, 2018
    • KaiXinXiaoLei's avatar
      [SPARK-23405] Generate additional constraints for Join's children · cdcccd7b
      KaiXinXiaoLei authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table,  and the number is 10 billion. The task will be hang up. And i find the many null values of `cs_order_number` in the `catalog_sales` table. I think the null value should be removed in the logical plan.
      
      >== Optimized Logical Plan ==
      >Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
      >:- Project cs_order_number#1
      >   : +- Filter isnotnull(cs_order_number#1)
      >      : +- MetastoreRelation 100t, ls
      >+- Project cs_order_number#22
      >   +- MetastoreRelation 100t, catalog_sales
      
      Now, use this patch, the plan will be:
      >== Optimized Logical Plan ==
      >Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
      >:- Project cs_order_number#1
      >   : +- Filter isnotnull(cs_order_number#1)
      >      : +- MetastoreRelation 100t, ls
      >+- Project cs_order_number#22
      >   : **+- Filter isnotnull(cs_order_number#22)**
      >     :+- MetastoreRelation 100t, catalog_sales
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: KaiXinXiaoLei <584620569@qq.com>
      Author: hanghang <584620569@qq.com>
      
      Closes #20670 from KaiXinXiaoLei/Spark-23405.
      cdcccd7b
    • Yuming Wang's avatar
      [SPARK-23510][SQL] Support Hive 2.2 and Hive 2.3 metastore · ff148018
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      This is based on https://github.com/apache/spark/pull/20668 for supporting Hive 2.2 and Hive 2.3 metastore.
      
      When we merge the PR, we should give the major credit to wangyum
      
      ## How was this patch tested?
      Added the test cases
      
      Author: Yuming Wang <yumwang@ebay.com>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #20671 from gatorsmile/pr-20668.
      ff148018
    • liuxian's avatar
      [SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and... · 22f3d333
      liuxian authored
      [SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine =false`, we should be able to use serialized sorting.
      
      ## What changes were proposed in this pull request?
      When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, in the map side,there is no need for aggregation and sorting, so we should be able to use serialized sorting.
      
      ## How was this patch tested?
      Existing unit test
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #20576 from 10110346/mapsidecombine.
      22f3d333
  2. Feb 28, 2018
    • Xingbo Jiang's avatar
      [SPARK-23523][SQL][FOLLOWUP] Minor refactor of OptimizeMetadataOnlyQuery · 25c2776d
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Inside `OptimizeMetadataOnlyQuery.getPartitionAttrs`, avoid using `zip` to generate attribute map.
      Also include other minor update of comments and format.
      
      ## How was this patch tested?
      
      Existing test cases.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #20693 from jiangxb1987/SPARK-23523.
      25c2776d
    • Juliusz Sompolski's avatar
      [SPARK-23514] Use SessionState.newHadoopConf() to propage hadoop configs set in SQLConf. · 476a7f02
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      A few places in `spark-sql` were using `sc.hadoopConfiguration` directly. They should be using `sessionState.newHadoopConf()` to blend in configs that were set through `SQLConf`.
      
      Also, for better UX, for these configs blended in from `SQLConf`, we should consider removing the `spark.hadoop` prefix, so that the settings are recognized whether or not they were specified by the user.
      
      ## How was this patch tested?
      
      Tested that AlterTableRecoverPartitions now correctly recognizes settings that are passed in to the FileSystem through SQLConf.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #20679 from juliuszsompolski/SPARK-23514.
      476a7f02
    • hyukjinkwon's avatar
      [SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace... · fab563b9
      hyukjinkwon authored
      [SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace from Java side by Py4JJavaError
      
      ## What changes were proposed in this pull request?
      
      This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.
      
      Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:
      
      ```python
      >>> from pyspark.util import _exception_message
      >>> try:
      ...     sc._jvm.java.lang.String(None)
      ... except Exception as e:
      ...     pass
      ...
      >>> e.message
      ''
      ```
      
      Seems we should use `str` instead for now:
      
       https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412
      
      but this doesn't address the problem with non-ascii string from Java side -
       `https://github.com/bartdag/py4j/issues/306`
      
      So, we could directly call `__str__()`:
      
      ```python
      >>> e.__str__()
      u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
      ```
      
      which doesn't type coerce unicodes to `str` in Python 2.
      
      This can be actually a problem:
      
      ```python
      from pyspark.sql.functions import udf
      spark.conf.set("spark.sql.execution.arrow.enabled", True)
      spark.range(1).select(udf(lambda x: [[]])()).toPandas()
      ```
      
      **Before**
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
          raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
      RuntimeError:
      Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
      ```
      
      **After**
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
          raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
      RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
        File "/.../spark/python/pyspark/worker.py", line 245, in main
          process()
        File "/.../spark/python/pyspark/worker.py", line 240, in process
      ...
      Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
      ```
      
      ## How was this patch tested?
      
      Manually tested and unit tests were added.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20680 from HyukjinKwon/SPARK-23517.
      fab563b9
    • zhoukang's avatar
      [SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom · 6a8abe29
      zhoukang authored
      … cause oom
      
      ## What changes were proposed in this pull request?
      blockManagerIdCache in BlockManagerId will not remove old values which may cause oom
      
      `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
      Since whenever we apply a new BlockManagerId, it will put into this map.
      
      This patch will use guava cahce for  blockManagerIdCache instead.
      
      A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)
      
      ## How was this patch tested?
      Exist tests.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #20667 from caneGuy/zhoukang/fix-history.
      6a8abe29
  3. Feb 27, 2018
    • Liang-Chi Hsieh's avatar
      [SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document · b14993e1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Clarify JSON and CSV reader behavior in document.
      
      JSON doesn't support partial results for corrupted records.
      CSV only supports partial results for the records with more or less tokens.
      
      ## How was this patch tested?
      
      Pass existing tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #20666 from viirya/SPARK-23448-2.
      b14993e1
    • Bruce Robbins's avatar
      [SPARK-23417][PYTHON] Fix the build instructions supplied by exception... · 23ac3aab
      Bruce Robbins authored
      [SPARK-23417][PYTHON] Fix the build instructions supplied by exception messages in python streaming tests
      
      ## What changes were proposed in this pull request?
      
      Fix the build instructions supplied by exception messages in python streaming tests.
      
      I also added -DskipTests to the maven instructions to avoid the 170 minutes of scala tests that occurs each time one wants to add a jar to the assembly directory.
      
      ## How was this patch tested?
      
      - clone branch
      - run build/sbt package
      - run python/run-tests --modules "pyspark-streaming" , expect error message
      - follow instructions in error message. i.e., run build/sbt assembly/package streaming-kafka-0-8-assembly/assembly
      - rerun python tests, expect error message
      - follow instructions in error message. i.e run build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
      - rerun python tests, see success.
      - repeated all of the above for mvn version of the process.
      
      Author: Bruce Robbins <bersprockets@gmail.com>
      
      Closes #20638 from bersprockets/SPARK-23417_propa.
      23ac3aab
    • Marco Gaido's avatar
      [SPARK-23501][UI] Refactor AllStagesPage in order to avoid redundant code · 598446b7
      Marco Gaido authored
      As suggested in #20651, the code is very redundant in `AllStagesPage` and modifying it is a copy-and-paste work. We should avoid such a pattern, which is error prone, and have a cleaner solution which avoids code redundancy.
      
      existing UTs
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      
      Closes #20663 from mgaido91/SPARK-23475_followup.
      598446b7
    • Imran Rashid's avatar
      [SPARK-23365][CORE] Do not adjust num executors when killing idle executors. · ecb8b383
      Imran Rashid authored
      The ExecutorAllocationManager should not adjust the target number of
      executors when killing idle executors, as it has already adjusted the
      target number down based on the task backlog.
      
      The name `replace` was misleading with DynamicAllocation on, as the target number
      of executors is changed outside of the call to `killExecutors`, so I adjusted that name.  Also separated out the logic of `countFailures` as you don't always want that tied to `replace`.
      
      While I was there I made two changes that weren't directly related to this:
      1) Fixed `countFailures` in a couple cases where it was getting an incorrect value since it used to be tied to `replace`, eg. when killing executors on a blacklisted node.
      2) hard error if you call `sc.killExecutors` with dynamic allocation on, since that's another way the ExecutorAllocationManager and the CoarseGrainedSchedulerBackend would get out of sync.
      
      Added a unit test case which verifies that the calls to ExecutorAllocationClient do not adjust the number of executors.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #20604 from squito/SPARK-23365.
      ecb8b383
    • gatorsmile's avatar
      [SPARK-23523][SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery · 414ee867
      gatorsmile authored
      ## What changes were proposed in this pull request?
      ```Scala
      val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
       Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
       .write.json(tablePath.getCanonicalPath)
       val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
       df.show()
      ```
      
      It generates a wrong result.
      ```
      [c,e,a]
      ```
      
      We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it.
      
      ## How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #20684 from gatorsmile/optimizeMetadataOnly.
      414ee867
    • cody koeninger's avatar
      [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive offsets · eac0b067
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Add a configuration spark.streaming.kafka.allowNonConsecutiveOffsets to allow streaming jobs to proceed on compacted topics (or other situations involving gaps between offsets in the log).
      
      ## How was this patch tested?
      
      Added new unit test
      
      justinrmiller has been testing this branch in production for a few weeks
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #20572 from koeninger/SPARK-17147.
      eac0b067
    • Kazuaki Ishizaki's avatar
      [SPARK-23509][BUILD] Upgrade commons-net from 2.2 to 3.1 · 649ed9c5
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.
      
      ```
      [warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
      ...
      [warn] 	* commons-net:commons-net:3.1 is selected over 2.2
      [warn] 	    +- org.apache.hadoop:hadoop-common:2.6.5              (depends on 3.1)
      [warn] 	    +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT    (depends on 2.2)
      [warn]
      ```
      
      [Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.
      
      [Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #20672 from kiszk/SPARK-23509.
      649ed9c5
    • Juliusz Sompolski's avatar
      [SPARK-23445] ColumnStat refactoring · 8077bb04
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      Refactor ColumnStat to be more flexible.
      
      * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information.
      * For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore.
      * Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.
      
      The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.
      
      ## How was this patch tested?
      
      Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`.
      New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #20624 from juliuszsompolski/SPARK-23445.
      8077bb04
  4. Feb 26, 2018
    • Jose Torres's avatar
      [SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring · 7ec83658
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Jose Torres <jose@databricks.com>
      
      Closes #20622 from jose-torres/SPARK-23441.
      7ec83658
    • Andrew Korzhuev's avatar
      [SPARK-23449][K8S] Preserve extraJavaOptions ordering · 185f5bc7
      Andrew Korzhuev authored
      For some JVM options, like `-XX:+UnlockExperimentalVMOptions` ordering is necessary.
      
      ## What changes were proposed in this pull request?
      
      Keep original `extraJavaOptions` ordering, when passing them through environment variables inside the Docker container.
      
      ## How was this patch tested?
      
      Ran base branch a couple of times and checked startup command in logs. Ordering differed every time. Added sorting, ordering was consistent to what user had in `extraJavaOptions`.
      
      Author: Andrew Korzhuev <korzhuev@andrusha.me>
      
      Closes #20628 from andrusha/patch-2.
      185f5bc7
    • Gabor Somogyi's avatar
      [SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes · b308182f
      Gabor Somogyi authored
      ## What changes were proposed in this pull request?
      
      There is a race condition introduced in SPARK-11141 which could cause data loss.
      The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue.
      
      In this PR only the allocated blocks will be removed from the queue which will prevent data loss.
      
      ## How was this patch tested?
      
      Additional unit test + manually.
      
      Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
      
      Closes #20620 from gaborgsomogyi/SPARK-23438.
      b308182f
  5. Feb 25, 2018
    • Gabor Somogyi's avatar
      [SPARK-22886][ML][TESTS] ML test for structured streaming: ml.recomme… · 3ca9a2c5
      Gabor Somogyi authored
      ## What changes were proposed in this pull request?
      
      Converting spark.ml.recommendation tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882.
      
      ## How was this patch tested?
      
      Automated: Pass the Jenkins.
      
      Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
      
      Closes #20362 from gaborgsomogyi/SPARK-22886.
      3ca9a2c5
  6. Feb 23, 2018
    • Kazuaki Ishizaki's avatar
      [SPARK-23459][SQL] Improve the error message when unknown column is specified in partition columns · 1a198ce8
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids to print schema internal information when unknown column is specified in partition columns. This PR prints column names in the schema with more readable format.
      
      The following is an example.
      
      Source code
      ```
      test("save with an unknown partition column") {
        withTempDir { dir =>
          val path = dir.getCanonicalPath
            Seq(1L -> "a").toDF("i", "j").write
              .format("parquet")
              .partitionBy("unknownColumn")
              .save(path)
        }
      ```
      Output without this PR
      ```
      Partition column unknownColumn not found in schema StructType(StructField(i,LongType,false), StructField(j,StringType,true));
      ```
      
      Output with this PR
      ```
      Partition column unknownColumn not found in schema struct<i:bigint,j:string>;
      ```
      
      ## How was this patch tested?
      
      Manually tested
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #20653 from kiszk/SPARK-23459.
      1a198ce8
    • Tathagata Das's avatar
      [SPARK-23408][SS] Synchronize successive AddData actions in Streaming*JoinSuite · 855ce13d
      Tathagata Das authored
      **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1**
      
      ## What changes were proposed in this pull request?
      
      The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.
      
      Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following.
      
      - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch.
      - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously.
      
      This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic.
      
      ## How was this patch tested?
      Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #20650 from tdas/SPARK-23408.
      855ce13d
  7. Feb 22, 2018
    • Wang Gengliang's avatar
      [SPARK-23490][SQL] Check storage.locationUri with existing table in CreateTable · 049f243c
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      For CreateTable with Append mode, we should check if `storage.locationUri` is the same with existing table in `PreprocessTableCreation`
      
      In the current code, there is only a simple exception if the `storage.locationUri` is different with existing table:
      `org.apache.spark.sql.AnalysisException: Table or view not found:`
      
      which can be improved.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <gengliang.wang@databricks.com>
      
      Closes #20660 from gengliangwang/locationUri.
      049f243c
    • Gabor Somogyi's avatar
      [SPARK-23476][CORE] Generate secret in local mode when authentication on · c5abb3c2
      Gabor Somogyi authored
      ## What changes were proposed in this pull request?
      
      If spark is run with "spark.authenticate=true", then it will fail to start in local mode.
      
      This PR generates secret in local mode when authentication on.
      
      ## How was this patch tested?
      
      Modified existing unit test.
      Manually started spark-shell.
      
      Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
      
      Closes #20652 from gaborgsomogyi/SPARK-23476.
      c5abb3c2
    • Marco Gaido's avatar
      [SPARK-23475][UI] Show also skipped stages · 87293c74
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, previously, skipped stages were shown as `PENDING`; after this change, they are not shown on the UI.
      
      The PR introduce a new section in order to show also `SKIPPED` stages in a proper table.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      
      Closes #20651 from mgaido91/SPARK-23475.
      87293c74
  8. Feb 21, 2018
    • Shixiong Zhu's avatar
      [SPARK-23475][WEBUI] Skipped stages should be evicted before completed stages · 45cf714e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The root cause of missing completed stages is because `cleanupStages` will never remove skipped stages.
      
      This PR changes the logic to always remove skipped stage first. This is safe since  the job itself contains enough information to render skipped stages in the UI.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #20656 from zsxwing/SPARK-23475.
      45cf714e
    • Shixiong Zhu's avatar
      [SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist · 744d5af6
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The issue here is `AppStatusStore.lastStageAttempt` will return the next available stage in the store when a stage doesn't exist.
      
      This PR adds `last(stageId)` to ensure it returns a correct `StageData`
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #20654 from zsxwing/SPARK-23481.
      744d5af6
    • Tathagata Das's avatar
      [SPARK-23484][SS] Fix possible race condition in KafkaContinuousReader · 3fd0ccb1
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is accessed from multiple threads - the query thread at the time of reader factory creation, and the epoch tracking thread at the time of `needsReconfiguration`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #20655 from tdas/SPARK-23484.
      3fd0ccb1
    • Marco Gaido's avatar
      [SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API · e836c27c
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      The PR adds the `distanceMeasure` param to ClusteringEvaluator in the Python API. This allows the user to specify `cosine` as distance measure in addition to the default `squaredEuclidean`.
      
      ## How was this patch tested?
      
      added UT
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      
      Closes #20627 from mgaido91/SPARK-23217_python.
      e836c27c
    • Ryan Blue's avatar
      [SPARK-23418][SQL] Fail DataSourceV2 reads when user schema is passed, but not supported. · c8c4441d
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      DataSourceV2 initially allowed user-supplied schemas when a source doesn't implement `ReadSupportWithSchema`, as long as the schema was identical to the source's schema. This is confusing behavior because changes to an underlying table can cause a previously working job to fail with an exception that user-supplied schemas are not allowed.
      
      This reverts commit adcb25a0624, which was added to #20387 so that it could be removed in a separate JIRA issue and PR.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #20603 from rdblue/SPARK-23418-revert-adcb25a0624.
      c8c4441d
  9. Feb 20, 2018
    • Kazuaki Ishizaki's avatar
      [SPARK-23424][SQL] Add codegenStageId in comment · 95e25ed1
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR always adds `codegenStageId` in comment of the generated class. This is a replication of #20419  for post-Spark 2.3.
      Closes #20419
      
      ```
      /* 001 */ public Object generate(Object[] references) {
      /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
      /* 003 */ }
      /* 004 */
      /* 005 */ // codegenStageId=1
      /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 007 */   private Object[] references;
      ...
      ```
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #20612 from kiszk/SPARK-23424.
      95e25ed1
    • Tathagata Das's avatar
      [SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide · 601d653b
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Added clear information about triggers
      - Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #20631 from tdas/SPARK-23454.
      601d653b
    • Marcelo Vanzin's avatar
      [SPARK-23468][CORE] Stringify auth secret before storing it in credentials. · 6d398c05
      Marcelo Vanzin authored
      The secret is used as a string in many parts of the code, so it has
      to be turned into a hex string to avoid issues such as the random
      byte sequence not containing a valid UTF8 sequence.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #20643 from vanzin/SPARK-23468.
      6d398c05
    • Marcelo Vanzin's avatar
      [SPARK-23470][UI] Use first attempt of last stage to define job description. · 2ba77ed9
      Marcelo Vanzin authored
      This is much faster than finding out what the last attempt is, and the
      data should be the same.
      
      There's room for improvement in this page (like only loading data for
      the jobs being shown, instead of loading all available jobs and sorting
      them), but this should bring performance on par with the 2.2 version.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #20644 from vanzin/SPARK-23470.
      2ba77ed9
    • Dongjoon Hyun's avatar
      [SPARK-23434][SQL] Spark should not warn `metadata directory` for a HDFS file path · 3e48f3b9
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), it warns with a wrong warning message during looking up `people.json/_spark_metadata`. The root cause of this situation is the difference between `LocalFileSystem` and `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` raises `org.apache.hadoop.security.AccessControlException`.
      
      ```scala
      scala> spark.version
      res0: String = 2.4.0-SNAPSHOT
      
      scala> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
      +----+-------+
      | age|   name|
      +----+-------+
      |null|Michael|
      |  30|   Andy|
      |  19| Justin|
      +----+-------+
      
      scala> spark.read.json("hdfs:///tmp/people.json")
      18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory.
      18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory.
      ```
      
      After this PR,
      ```scala
      scala> spark.read.json("hdfs:///tmp/people.json").show
      +----+-------+
      | age|   name|
      +----+-------+
      |null|Michael|
      |  30|   Andy|
      |  19| Justin|
      +----+-------+
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #20616 from dongjoon-hyun/SPARK-23434.
      3e48f3b9
    • Dongjoon Hyun's avatar
      [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default · 83c00876
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.
      
      ## How was this patch tested?
      
      Pass the Jenkins with existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #20634 from dongjoon-hyun/SPARK-23456.
      83c00876
    • Kent Yao's avatar
      [SPARK-23383][BUILD][MINOR] Make a distribution should exit with usage while... · 189f56f3
      Kent Yao authored
      [SPARK-23383][BUILD][MINOR] Make a distribution should exit with usage while detecting wrong options
      
      ## What changes were proposed in this pull request?
      ```shell
      ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
      +++ dirname ./dev/make-distribution.sh
      ++ cd ./dev/..
      ++ pwd
      + SPARK_HOME=/Users/Kent/Documents/spark
      + DISTDIR=/Users/Kent/Documents/spark/dist
      + MAKE_TGZ=false
      + MAKE_PIP=false
      + MAKE_R=false
      + NAME=none
      + MVN=/Users/Kent/Documents/spark/build/mvn
      + ((  5  ))
      + case $1 in
      + NAME=ne-1.0.0-SNAPSHOT
      + shift
      + shift
      + ((  3  ))
      + case $1 in
      + break
      + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
      + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
      ++ command -v git
      + '[' /usr/local/bin/git ']'
      ++ git rev-parse --short HEAD
      + GITREV=98ea6a7
      + '[' '!' -z 98ea6a7 ']'
      + GITREVSTRING=' (git revision 98ea6a7)'
      + unset GITREV
      ++ command -v /Users/Kent/Documents/spark/build/mvn
      + '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
      ++ /Users/Kent/Documents/spark/build/mvn help:evaluate -Dexpression=project.version xyz --tgz -Phadoop-2.7
      ++ grep -v INFO
      ++ tail -n 1
      + VERSION=' -X,--debug                             Produce execution debug output'
      ```
      It is better to declare the mistakes and exit with usage than `break`
      
      ## How was this patch tested?
      
      manually
      
      cc srowen
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #20571 from yaooqinn/SPARK-23383.
      189f56f3
    • Bruce Robbins's avatar
      [SPARK-23240][PYTHON] Better error message when extraneous data in pyspark.daemon's stdout · 862fa697
      Bruce Robbins authored
      ## What changes were proposed in this pull request?
      
      Print more helpful message when daemon module's stdout is empty or contains a bad port number.
      
      ## How was this patch tested?
      
      Manually recreated the environmental issues that caused the mysterious exceptions at one site. Tested that the expected messages are logged.
      
      Also, ran all scala unit tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Bruce Robbins <bersprockets@gmail.com>
      
      Closes #20424 from bersprockets/SPARK-23240_prop2.
      862fa697
    • Ryan Blue's avatar
      [SPARK-23203][SQL] DataSourceV2: Use immutable logical plans. · aadf9535
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      SPARK-23203: DataSourceV2 should use immutable catalyst trees instead of wrapping a mutable DataSourceV2Reader. This commit updates DataSourceV2Relation and consolidates much of the DataSourceV2 API requirements for the read path in it. Instead of wrapping a reader that changes, the relation lazily produces a reader from its configuration.
      
      This commit also updates the predicate and projection push-down. Instead of the implementation from SPARK-22197, this reuses the rule matching from the Hive and DataSource read paths (using `PhysicalOperation`) and copies most of the implementation of `SparkPlanner.pruneFilterProject`, with updates for DataSourceV2. By reusing the implementation from other read paths, this should have fewer regressions from other read paths and is less code to maintain.
      
      The new push-down rules also supports the following edge cases:
      
      * The output of DataSourceV2Relation should be what is returned by the reader, in case the reader can only partially satisfy the requested schema projection
      * The requested projection passed to the DataSourceV2Reader should include filter columns
      * The push-down rule may be run more than once if filters are not pushed through projections
      
      ## How was this patch tested?
      
      Existing push-down and read tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #20387 from rdblue/SPARK-22386-push-down-immutable-trees.
      aadf9535
  10. Feb 19, 2018
    • Marco Gaido's avatar
      [SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date · 651b0277
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`).
      
      When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value.
      
      The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.
      
      ## How was this patch tested?
      
      added UT
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      
      Closes #20621 from mgaido91/SPARK-23436.
      651b0277
    • Dongjoon Hyun's avatar
      [SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat · f5850e78
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      ParquetFileFormat leaks opened files in some cases. This PR prevents that by registering task completion listers first before initialization.
      
      - [spark-branch-2.3-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/205/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/)
      - [spark-master-test-sbt-hadoop-2.6](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4228/testReport/junit/org.apache.spark.sql.execution.datasources.parquet/ParquetQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/)
      
      ```
      Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
      	at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
      	at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
      	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
      	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:538)
      	at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
      	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
      	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
      	at
      ```
      
      ## How was this patch tested?
      
      Manual. The following test case generates the same leakage.
      
      ```scala
        test("SPARK-23457 Register task completion listeners first in ParquetFileFormat") {
          withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> s"${Int.MaxValue}") {
            withTempDir { dir =>
              val basePath = dir.getCanonicalPath
              Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, "first").toString)
              Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, "second").toString)
              val df = spark.read.parquet(
                new Path(basePath, "first").toString,
                new Path(basePath, "second").toString)
              val e = intercept[SparkException] {
                df.collect()
              }
              assert(e.getCause.isInstanceOf[OutOfMemoryError])
            }
          }
        }
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #20619 from dongjoon-hyun/SPARK-23390.
      f5850e78
Loading