Skip to content
Snippets Groups Projects
  1. Nov 23, 2017
  2. Nov 22, 2017
    • Jakub Nowacki's avatar
      [SPARK-22495] Fix setup of SPARK_HOME variable on Windows · b4edafa9
      Jakub Nowacki authored
      ## What changes were proposed in this pull request?
      
      Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.
      
      First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.
      
      Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.
      
      ## How was this patch tested?
      
      Tested on local installation.
      
      Author: Jakub Nowacki <j.s.nowacki@gmail.com>
      
      Closes #19370 from jsnowacki/fix_spark_cmds.
      b4edafa9
    • Ilya Matiach's avatar
      [SPARK-21866][ML][PYSPARK] Adding spark image reader · 1edb3175
      Ilya Matiach authored
      ## What changes were proposed in this pull request?
      Adding spark image reader, an implementation of schema for representing images in spark DataFrames
      
      The code is taken from the spark package located here:
      (https://github.com/Microsoft/spark-images)
      
      Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
      
      Please see mailing list for SPIP vote and approval information:
      (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
      
      # Background and motivation
      As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
      This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
      This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
      The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
      
      ## How was this patch tested?
      
      Unit tests in scala ImageSchemaSuite, unit tests in python
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19439 from imatiach-msft/ilmat/spark-images.
      1edb3175
    • Wenchen Fan's avatar
      [SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions · 0605ad76
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons:
      
      1. a deep expression tree, e.g. a very complex filter condition
      2. many individual expressions, e.g. expressions can have many children, operators can have many expressions.
      3. a deep query plan tree (with whole stage codegen)
      
      This PR focuses on 1. There are already several patches(#15620  #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue.
      
      This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big.
      
      According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing
      
      ## How was this patch tested?
      
      existing test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #19767 from cloud-fan/codegen.
      0605ad76
    • Mark Petruska's avatar
      [SPARK-22572][SPARK SHELL] spark-shell does not re-initialize on :replay · 327d25fe
      Mark Petruska authored
      ## What changes were proposed in this pull request?
      
      Ticket: [SPARK-22572](https://issues.apache.org/jira/browse/SPARK-22572)
      
      ## How was this patch tested?
      
      Added a new test case to `org.apache.spark.repl.ReplSuite`
      
      Author: Mark Petruska <petruska.mark@gmail.com>
      
      Closes #19791 from mpetruska/SPARK-22572.
      327d25fe
    • Kazuaki Ishizaki's avatar
      [SPARK-20101][SQL][FOLLOW-UP] use correct config name "spark.sql.columnVector.offheap.enabled" · 572af502
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR addresses [the spelling miss](https://github.com/apache/spark/pull/17436#discussion_r152189670) of the config name `spark.sql.columnVector.offheap.enabled`.
      We should use `spark.sql.columnVector.offheap.enabled`.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19794 from kiszk/SPARK-20101-follow.
      572af502
    • Takeshi Yamamuro's avatar
      [SPARK-22445][SQL][FOLLOW-UP] Respect stream-side child's needCopyResult in BroadcastHashJoin · 2c0fe818
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      I found #19656 causes some bugs, for example, it changed the result set of `q6` in tpcds (I keep tracking TPCDS results daily [here](https://github.com/maropu/spark-tpcds-datagen/tree/master/reports/tests)):
      - w/o pr19658
      ```
      +-----+---+
      |state|cnt|
      +-----+---+
      |   MA| 10|
      |   AK| 10|
      |   AZ| 11|
      |   ME| 13|
      |   VT| 14|
      |   NV| 15|
      |   NH| 16|
      |   UT| 17|
      |   NJ| 21|
      |   MD| 22|
      |   WY| 25|
      |   NM| 26|
      |   OR| 31|
      |   WA| 36|
      |   ND| 38|
      |   ID| 39|
      |   SC| 45|
      |   WV| 50|
      |   FL| 51|
      |   OK| 53|
      |   MT| 53|
      |   CO| 57|
      |   AR| 58|
      |   NY| 58|
      |   PA| 62|
      |   AL| 63|
      |   LA| 63|
      |   SD| 70|
      |   WI| 80|
      | null| 81|
      |   MI| 82|
      |   NC| 82|
      |   MS| 83|
      |   CA| 84|
      |   MN| 85|
      |   MO| 88|
      |   IL| 95|
      |   IA|102|
      |   TN|102|
      |   IN|103|
      |   KY|104|
      |   NE|113|
      |   OH|114|
      |   VA|130|
      |   KS|139|
      |   GA|168|
      |   TX|216|
      +-----+---+
      ```
      - w/   pr19658
      ```
      +-----+---+
      |state|cnt|
      +-----+---+
      |   RI| 14|
      |   AK| 16|
      |   FL| 20|
      |   NJ| 21|
      |   NM| 21|
      |   NV| 22|
      |   MA| 22|
      |   MD| 22|
      |   UT| 22|
      |   AZ| 25|
      |   SC| 28|
      |   AL| 36|
      |   MT| 36|
      |   WA| 39|
      |   ND| 41|
      |   MI| 44|
      |   AR| 45|
      |   OR| 47|
      |   OK| 52|
      |   PA| 53|
      |   LA| 55|
      |   CO| 55|
      |   NY| 64|
      |   WV| 66|
      |   SD| 72|
      |   MS| 73|
      |   NC| 79|
      |   IN| 82|
      | null| 85|
      |   ID| 88|
      |   MN| 91|
      |   WI| 95|
      |   IL| 96|
      |   MO| 97|
      |   CA|109|
      |   CA|109|
      |   TN|114|
      |   NE|115|
      |   KY|128|
      |   OH|131|
      |   IA|156|
      |   TX|160|
      |   VA|182|
      |   KS|211|
      |   GA|230|
      +-----+---+
      ```
      This pr is to keep the original logic of `CodegenContext.copyResult` in `BroadcastHashJoinExec`.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19781 from maropu/SPARK-22445-bugfix.
      2c0fe818
    • vinodkc's avatar
      [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table... · e0d7665c
      vinodkc authored
      [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url'
      
      ## What changes were proposed in this pull request?
      SPARK-19580 Support for avro.schema.url while writing to hive table
      SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
      SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url
      
      Support writing to Hive table which uses Avro schema url 'avro.schema.url'
      For ex:
      create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
      
      create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
      
       insert overwrite table avro_out select * from avro_in;  // fails with java.lang.NullPointerException
      
       WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem
      java.lang.NullPointerException
      	at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
      	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
      
      ## Changes proposed in this fix
      Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object
      ## How was this patch tested?
      Added new test case in VersionsSuite
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19779 from vinodkc/br_Fix_SPARK-17920.
      e0d7665c
  3. Nov 21, 2017
    • Jia Li's avatar
      [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source · 881c5c80
      Jia Li authored
      ## What changes were proposed in this pull request?
      
      Let’s say I have a nested AND expression shown below and p2 can not be pushed down,
      
      (p1 AND p2) OR p3
      
      In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.
      
      Note that:
      - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
      - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
      - The current Spark code logic for OR is OK. It either pushes both legs or nothing.
      
      The same translation method is also called by Data Source V2.
      
      ## How was this patch tested?
      
      Added new unit test cases to JDBCSuite
      
      gatorsmile
      
      Author: Jia Li <jiali@us.ibm.com>
      
      Closes #19776 from jliwork/spark-22548.
      881c5c80
    • Kazuaki Ishizaki's avatar
      [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast · ac10171b
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.
      
      ## How was this patch tested?
      
      Added new test cases into `CastSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19730 from kiszk/SPARK-22500.
      ac10171b
    • Marco Gaido's avatar
      [SPARK-22475][SQL] show histogram in DESC COLUMN command · b96f61b6
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      Added the histogram representation to the output of the `DESCRIBE EXTENDED table_name column_name` command.
      
      ## How was this patch tested?
      
      Modified SQL UT and checked output
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #19774 from mgaido91/SPARK-22475.
      b96f61b6
    • hyukjinkwon's avatar
      [SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates... · 6d7ebf2f
      hyukjinkwon authored
      [SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to add a rule that re-uses `TypeCoercion.findWiderCommonType` when resolving type conflicts in partition values.
      
      Currently, this uses numeric precedence-like comparison; therefore, it looks introducing failures for type conflicts between timestamps, dates and decimals, please see:
      
      ```scala
      private val upCastingOrder: Seq[DataType] =
        Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
      ...
      literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
      ```
      
      The codes below:
      
      ```scala
      val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
      df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
      spark.read.load("/tmp/foo").printSchema()
      
      val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
      df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
      spark.read.load("/tmp/bar").printSchema()
      ```
      
      produces output as below:
      
      **Before**
      
      ```
      root
       |-- i: integer (nullable = true)
       |-- ts: date (nullable = true)
      
      root
       |-- i: integer (nullable = true)
       |-- decimal: integer (nullable = true)
      ```
      
      **After**
      
      ```
      root
       |-- i: integer (nullable = true)
       |-- ts: timestamp (nullable = true)
      
      root
       |-- i: integer (nullable = true)
       |-- decimal: decimal(30,0) (nullable = true)
      ```
      
      ### Type coercion table:
      
      This PR proposes the type conflict resolusion as below:
      
      **Before**
      
      |InputA \ InputB|`NullType`|`IntegerType`|`LongType`|`DecimalType(38,0)`|`DoubleType`|`DateType`|`TimestampType`|`StringType`|
      |------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
      |**`NullType`**|`StringType`|`IntegerType`|`LongType`|`StringType`|`DoubleType`|`StringType`|`StringType`|`StringType`|
      |**`IntegerType`**|`IntegerType`|`IntegerType`|`LongType`|`IntegerType`|`DoubleType`|`IntegerType`|`IntegerType`|`StringType`|
      |**`LongType`**|`LongType`|`LongType`|`LongType`|`LongType`|`DoubleType`|`LongType`|`LongType`|`StringType`|
      |**`DecimalType(38,0)`**|`StringType`|`IntegerType`|`LongType`|`DecimalType(38,0)`|`DoubleType`|`DecimalType(38,0)`|`DecimalType(38,0)`|`StringType`|
      |**`DoubleType`**|`DoubleType`|`DoubleType`|`DoubleType`|`DoubleType`|`DoubleType`|`DoubleType`|`DoubleType`|`StringType`|
      |**`DateType`**|`StringType`|`IntegerType`|`LongType`|`DateType`|`DoubleType`|`DateType`|`DateType`|`StringType`|
      |**`TimestampType`**|`StringType`|`IntegerType`|`LongType`|`TimestampType`|`DoubleType`|`TimestampType`|`TimestampType`|`StringType`|
      |**`StringType`**|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|
      
      **After**
      
      |InputA \ InputB|`NullType`|`IntegerType`|`LongType`|`DecimalType(38,0)`|`DoubleType`|`DateType`|`TimestampType`|`StringType`|
      |------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
      |**`NullType`**|`NullType`|`IntegerType`|`LongType`|`DecimalType(38,0)`|`DoubleType`|`DateType`|`TimestampType`|`StringType`|
      |**`IntegerType`**|`IntegerType`|`IntegerType`|`LongType`|`DecimalType(38,0)`|`DoubleType`|`StringType`|`StringType`|`StringType`|
      |**`LongType`**|`LongType`|`LongType`|`LongType`|`DecimalType(38,0)`|`StringType`|`StringType`|`StringType`|`StringType`|
      |**`DecimalType(38,0)`**|`DecimalType(38,0)`|`DecimalType(38,0)`|`DecimalType(38,0)`|`DecimalType(38,0)`|`StringType`|`StringType`|`StringType`|`StringType`|
      |**`DoubleType`**|`DoubleType`|`DoubleType`|`StringType`|`StringType`|`DoubleType`|`StringType`|`StringType`|`StringType`|
      |**`DateType`**|`DateType`|`StringType`|`StringType`|`StringType`|`StringType`|`DateType`|`TimestampType`|`StringType`|
      |**`TimestampType`**|`TimestampType`|`StringType`|`StringType`|`StringType`|`StringType`|`TimestampType`|`TimestampType`|`StringType`|
      |**`StringType`**|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|`StringType`|
      
      This was produced by:
      
      ```scala
        test("Print out chart") {
          val supportedTypes: Seq[DataType] = Seq(
            NullType, IntegerType, LongType, DecimalType(38, 0), DoubleType,
            DateType, TimestampType, StringType)
      
          // Old type conflict resolution:
          val upCastingOrder: Seq[DataType] =
            Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
          def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
            val topType = dataTypes.maxBy(upCastingOrder.indexOf(_))
            if (topType == NullType) StringType else topType
          }
          println(s"|InputA \\ InputB|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("|")}|")
          println(s"|------------------------|${supportedTypes.map(_ => "----------").mkString("|")}|")
          supportedTypes.foreach { inputA =>
            val types = supportedTypes.map(inputB => oldResolveTypeConflicts(Seq(inputA, inputB)))
            println(s"|**`$inputA`**|${types.map(dt => s"`${dt.toString}`").mkString("|")}|")
          }
      
          // New type conflict resolution:
          def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
            dataTypes.fold[DataType](NullType)(findWiderTypeForPartitionColumn)
          }
          println(s"|InputA \\ InputB|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("|")}|")
          println(s"|------------------------|${supportedTypes.map(_ => "----------").mkString("|")}|")
          supportedTypes.foreach { inputA =>
            val types = supportedTypes.map(inputB => newResolveTypeConflicts(Seq(inputA, inputB)))
            println(s"|**`$inputA`**|${types.map(dt => s"`${dt.toString}`").mkString("|")}|")
          }
        }
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `ParquetPartitionDiscoverySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19389 from HyukjinKwon/partition-type-coercion.
      6d7ebf2f
    • WeichenXu's avatar
      [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API · 2d868d93
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.
      
      ## How was this patch tested?
      
      doctest added.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19753 from WeichenXu123/vector_indexer_invalid_py.
      2d868d93
    • Prashant Sharma's avatar
      [MINOR][DOC] The left navigation bar should be fixed with respect to scrolling. · 5855b5c0
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      A minor CSS style change to make Left navigation bar stay fixed with respect to scrolling, it improves usability of the docs.
      
      ## How was this patch tested?
      It was tested on both, firefox and chrome.
      ### Before
      ![a2](https://user-images.githubusercontent.com/992952/33004206-6acf9fc0-cde5-11e7-9070-02f26f7899b0.gif)
      
      ### After
      ![a1](https://user-images.githubusercontent.com/992952/33004205-69b27798-cde5-11e7-8002-509b29786b37.gif)
      
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #19785 from ScrapCodes/doc/css.
      5855b5c0
    • gatorsmile's avatar
      [SPARK-22569][SQL] Clean usage of addMutableState and splitExpressions · 96e947ed
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is to clean the usage of addMutableState and splitExpressions
      
      1. replace hardcoded type string to ctx.JAVA_BOOLEAN etc.
      2. create a default value of the initCode for ctx.addMutableStats
      3. Use named arguments when calling `splitExpressions `
      
      ## How was this patch tested?
      The existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19790 from gatorsmile/codeClean.
      96e947ed
    • Kazuaki Ishizaki's avatar
      [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt · 9bdff0bc
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `elt` with a lot of argument
      
      ## How was this patch tested?
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19778 from kiszk/SPARK-22550.
      9bdff0bc
    • Kazuaki Ishizaki's avatar
      [SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() · c9577148
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large.
      
      ## How was this patch tested?
      
      Added a new test case into `GenerateUnsafeRowJoinerSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19737 from kiszk/SPARK-22508.
      c9577148
    • Liang-Chi Hsieh's avatar
      [SPARK-22541][SQL] Explicitly claim that Python udfs can't be conditionally... · 9d45e675
      Liang-Chi Hsieh authored
      [SPARK-22541][SQL] Explicitly claim that Python udfs can't be conditionally executed with short-curcuit evaluation
      
      ## What changes were proposed in this pull request?
      
      Besides conditional expressions such as `when` and `if`, users may want to conditionally execute python udfs by short-curcuit evaluation. We should also explicitly note that python udfs don't support this kind of conditional execution too.
      
      ## How was this patch tested?
      
      N/A, just document change.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19787 from viirya/SPARK-22541.
      9d45e675
  4. Nov 20, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws · 41c6f360
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `concat_ws` with a lot of argument
      
      ## How was this patch tested?
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19777 from kiszk/SPARK-22549.
      41c6f360
    • Marcelo Vanzin's avatar
      [SPARK-22533][CORE] Handle deprecated names in ConfigEntry. · c13b60e0
      Marcelo Vanzin authored
      This change hooks up the config reader to `SparkConf.getDeprecatedConfig`,
      so that config constants with deprecated names generate the proper warnings.
      It also changes two deprecated configs from the new "alternatives" system to
      the old deprecation system, since they're not yet hooked up to each other.
      
      Added a few unit tests to verify the desired behavior.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19760 from vanzin/SPARK-22533.
      c13b60e0
    • Kazuaki Ishizaki's avatar
      [SPARK-20101][SQL] Use OffHeapColumnVector when... · 3c3eebc8
      Kazuaki Ishizaki authored
      [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set to "true"
      
      This PR enables to use ``OffHeapColumnVector`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true``. While ``ColumnVector`` has two implementations ``OnHeapColumnVector`` and ``OffHeapColumnVector``, only ``OnHeapColumnVector`` is always used.
      
      This PR implements the followings
      - Pass ``OffHeapColumnVector`` to ``ColumnarBatch.allocate()`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true``
      - Free all of off-heap memory regions by ``OffHeapColumnVector.close()``
      - Ensure to call ``OffHeapColumnVector.close()``
      
      Use existing tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17436 from kiszk/SPARK-20101.
      3c3eebc8
  5. Nov 19, 2017
  6. Nov 18, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat · d54bfec2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `concat` with a lot of argument
      
      ## How was this patch tested?
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19728 from kiszk/SPARK-22498.
      d54bfec2
  7. Nov 17, 2017
    • Shixiong Zhu's avatar
      [SPARK-22544][SS] FileStreamSource should use its own hadoop conf to call globPathIfNecessary · bf0c0ae2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #19771 from zsxwing/fix-file-stream-conf.
      bf0c0ae2
    • Liang-Chi Hsieh's avatar
      [SPARK-22538][ML] SQLTransformer should not unpersist possibly cached input dataset · fccb337f
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `SQLTransformer.transform` unpersists input dataset when dropping temporary view. We should not change input dataset's cache status.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19772 from viirya/SPARK-22538.
      fccb337f
    • Li Jin's avatar
      [SPARK-22409] Introduce function type argument in pandas_udf · 7d039e0c
      Li Jin authored
      ## What changes were proposed in this pull request?
      
      * Add a "function type" argument to pandas_udf.
      * Add a new public enum class `PandasUdfType` in pyspark.sql.functions
      * Refactor udf related code from pyspark.sql.functions to pyspark.sql.udf
      * Merge "PythonUdfType" and "PythonEvalType" into a single enum class "PythonEvalType"
      
      Example:
      ```
      from pyspark.sql.functions import pandas_udf, PandasUDFType
      
      pandas_udf('double', PandasUDFType.SCALAR):
      def plus_one(v):
          return v + 1
      ```
      
      ## Design doc
      https://docs.google.com/document/d/1KlLaa-xJ3oz28xlEJqXyCAHU3dwFYkFs_ixcUXrJNTc/edit
      
      ## How was this patch tested?
      
      Added PandasUDFTests
      
      ## TODO:
      * [x] Implement proper enum type for `PandasUDFType`
      * [x] Update documentation
      * [x] Add more tests in PandasUDFTests
      
      Author: Li Jin <ice.xelloss@gmail.com>
      
      Closes #19630 from icexelloss/spark-22409-pandas-udf-type.
      7d039e0c
    • yucai's avatar
      [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calculates correct avgSize · d00b55d4
      yucai authored
      ## What changes were proposed in this pull request?
      
      Ensure HighlyCompressedMapStatus calculates correct avgSize
      
      ## How was this patch tested?
      
      New unit test added.
      
      Author: yucai <yucai.yu@intel.com>
      
      Closes #19765 from yucai/avgsize.
      d00b55d4
  8. Nov 16, 2017
    • Wenchen Fan's avatar
      [SPARK-22542][SQL] remove unused features in ColumnarBatch · b9dcbe5e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ColumnarBatch` provides features to do fast filter and project in a columnar fashion, however this feature is never used by Spark, as Spark uses whole stage codegen and processes the data in a row fashion. This PR proposes to remove these unused features as we won't switch to columnar execution in the near future. Even we do, I think this part needs a proper redesign.
      
      This is also a step to make `ColumnVector` public, as we don't wanna expose these features to users.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19766 from cloud-fan/vector.
      b9dcbe5e
    • Kazuaki Ishizaki's avatar
      [SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in · 7f2e62ee
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large.
      
      ## How was this patch tested?
      
      Added new test cases into `PredicateSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19733 from kiszk/SPARK-22501.
      7f2e62ee
    • Marco Gaido's avatar
      [SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls · 4e7f07e2
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
      This PR splits their expressions in order to avoid the issue.
      
      ## How was this patch tested?
      
      Added UTs
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #19720 from mgaido91/SPARK-22494.
      4e7f07e2
    • Kazuaki Ishizaki's avatar
      [SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and greatest · ed885e7a
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved two cases:
      
      * `least` with a lot of argument
      * `greatest` with a lot of argument
      
      ## How was this patch tested?
      
      Added a new test case into `ArithmeticExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19729 from kiszk/SPARK-22499.
      ed885e7a
  9. Nov 15, 2017
    • Shixiong Zhu's avatar
      [SPARK-22535][PYSPARK] Sleep before killing the python worker in PythonRunner.MonitorThread · 03f2b7bf
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `PythonRunner.MonitorThread` should give the task a little time to finish before forcibly killing the python worker. This will reduce the chance of the race condition a lot. I also improved the log a bit to find out the task to blame when it's stuck.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #19762 from zsxwing/SPARK-22535.
      03f2b7bf
    • ArtRand's avatar
      [SPARK-21842][MESOS] Support Kerberos ticket renewal and creation in Mesos · 1e823354
      ArtRand authored
      ## What changes were proposed in this pull request?
      tl;dr: Add a class, `MesosHadoopDelegationTokenManager` that updates delegation tokens on a schedule on the behalf of Spark Drivers. Broadcast renewed credentials to the executors.
      
      ## The problem
      We recently added Kerberos support to Mesos-based Spark jobs as well as Secrets support to the Mesos Dispatcher (SPARK-16742, SPARK-20812, respectively). However the delegation tokens have a defined expiration. This poses a problem for long running Spark jobs (e.g. Spark Streaming applications). YARN has a solution for this where a thread is scheduled to renew the tokens they reach 75% of their way to expiration. It then writes the tokens to HDFS for the executors to find (uses a monotonically increasing suffix).
      
      ## This solution
      We replace the current method in `CoarseGrainedSchedulerBackend` which used to discard the token renewal time with a protected method `fetchHadoopDelegationTokens`. Now the individual cluster backends are responsible for overriding this method to fetch and manage token renewal. The delegation tokens themselves, are still part of the `CoarseGrainedSchedulerBackend` as before.
      In the case of Mesos renewed Credentials are broadcasted to the executors. This maintains all transfer of Credentials within Spark (as opposed to Spark-to-HDFS). It also does not require any writing of Credentials to disk. It also does not require any GC of old files.
      
      ## How was this patch tested?
      Manually against a Kerberized HDFS cluster.
      
      Thank you for the reviews.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      
      Closes #19272 from ArtRand/spark-21842-450-kerberos-ticket-renewal.
      1e823354
    • osatici's avatar
      [SPARK-22479][SQL] Exclude credentials from SaveintoDataSourceCommand.simpleString · 2014e7a7
      osatici authored
      ## What changes were proposed in this pull request?
      
      Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it.
      
      ## How was this patch tested?
      
      building locally and trying to reproduce (per the steps in https://issues.apache.org/jira/browse/SPARK-22479):
      ```
      == Parsed Logical Plan ==
      SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *********(redacted), password -> *********(redacted)), ErrorIfExists
         +- Range (0, 100, step=1, splits=Some(8))
      
      == Analyzed Logical Plan ==
      SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *********(redacted), password -> *********(redacted)), ErrorIfExists
         +- Range (0, 100, step=1, splits=Some(8))
      
      == Optimized Logical Plan ==
      SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *********(redacted), password -> *********(redacted)), ErrorIfExists
         +- Range (0, 100, step=1, splits=Some(8))
      
      == Physical Plan ==
      Execute SaveIntoDataSourceCommand
         +- SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *********(redacted), password -> *********(redacted)), ErrorIfExists
               +- Range (0, 100, step=1, splits=Some(8))
      ```
      
      Author: osatici <osatici@palantir.com>
      
      Closes #19708 from onursatici/os/redact-jdbc-creds.
      2014e7a7
    • Marcelo Vanzin's avatar
      [SPARK-20649][CORE] Simplify REST API resource structure. · 39b3f10d
      Marcelo Vanzin authored
      With the new UI store, the API resource classes have a lot less code,
      since there's no need for complicated translations between the UI
      types and the API types. So the code ended up with a bunch of files
      with a single method declared in them.
      
      This change re-structures the API code so that it uses less classes;
      mainly, most sub-resources were removed, and the code to deal with
      single-attempt and multi-attempt apps was simplified.
      
      The only change was the addition of a method to return a single
      attempt's information; that was missing in the old API, so trying
      to retrieve "/v1/applications/appId/attemptId" would result in a
      404 even if the attempt existed (and URIs under that one would
      return valid data).
      
      The streaming API resources also overtook the same treatment, even
      though the data is not stored in the new UI store.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19748 from vanzin/SPARK-20649.
      39b3f10d
    • liutang123's avatar
      [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric · bc0848b4
      liutang123 authored
      ## What changes were proposed in this pull request?
      This fixes a problem caused by #15880
      `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
      `
      When compare string and numeric, cast them as double like Hive.
      
      Author: liutang123 <liutang123@yeah.net>
      
      Closes #19692 from liutang123/SPARK-22469.
      bc0848b4
    • Dongjoon Hyun's avatar
      [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder · aa88b8db
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description.
      ```
      SparkSession.builder = <pyspark.sql.session.Builder object ...
      ```
      
      This PR adds the doc.
      
      ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png)
      
      The following is the diff of the generated result.
      
      ```
      $ diff old.html new.html
      95a96,101
      > <dl class="attribute">
      > <dt id="pyspark.sql.SparkSession.builder">
      > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
      > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p>
      > </dd></dl>
      >
      212,216d217
      < <dt id="pyspark.sql.SparkSession.builder">
      < <code class="descname">builder</code><em class="property"> = &lt;pyspark.sql.session.SparkSession.Builder object&gt;</em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
      < <dd></dd></dl>
      <
      < <dl class="attribute">
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      ```
      cd python/docs
      make html
      open _build/html/pyspark.sql.html
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19726 from dongjoon-hyun/SPARK-22490.
      aa88b8db
    • test's avatar
      [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics · 7f99a05e
      test authored
      ## What changes were proposed in this pull request?
      
      I added adjusted R2 as a regression metric which was implemented in all major statistical analysis tools.
      
      In practice, no one looks at R2 alone. The reason is R2 itself is misleading. If we add more parameters, R2 will not decrease but only increase (or stay the same). This leads to overfitting. Adjusted R2 addressed this issue by using number of parameters as "weight" for the sum of errors.
      
      ## How was this patch tested?
      
      - Added a new unit test and passed.
      - ./dev/run-tests all passed.
      
      Author: test <joseph.peng@quetica.com>
      Author: tengpeng <tengpeng@users.noreply.github.com>
      
      Closes #19638 from tengpeng/master.
      7f99a05e
    • Bryan Cutler's avatar
      [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow · 8f0e88df
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      If schema is passed as a list of unicode strings for column names, they should be re-encoded to 'utf-8' to be consistent.  This is similar to the #13097 but for creation of DataFrame using Arrow.
      
      ## How was this patch tested?
      
      Added new test of using unicode names for schema.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19738 from BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.
      8f0e88df
Loading