Skip to content
Snippets Groups Projects
  1. Nov 11, 2016
    • Dongjoon Hyun's avatar
      [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT · d42bb7cc
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.
      
      **Before**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
      ```
      
      **After**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      scala> sql("SELECT id2 FROM v1")
      res4: org.apache.spark.sql.DataFrame = [id2: int]
      ```
      
      **Fixed cases in this PR**
      
      The following two cases are the detail query plans having problematic SQL generations.
      
      1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
      
          Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [id#1]
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- Project [gen_attr_0#1]
                     +- SubqueryAlias gen_subquery_0
                        +- Project [id#1 AS gen_attr_0#1]
                           +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
      ```
      
      2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
      
          Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- Project [id#1]
         +- GlobalLimit 2
            +- LocalLimit 2
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- SubqueryAlias gen_subquery_0
                     +- Project [id#1 AS gen_attr_0#1]
                        +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a newly added test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15546 from dongjoon-hyun/SPARK-17982.
      d42bb7cc
  2. Jul 27, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16621][SQL] Generate stable SQLs in SQLBuilder · 5b8e848b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, the generated SQLs have not-stable IDs for generated attributes.
      The stable generated SQL will give more benefit for understanding or testing the queries.
      This PR provides stable SQL generation by the followings.
      
       - Provide unique ids for generated subqueries, `gen_subquery_xxx`.
       - Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
      
      **Before**
      ```scala
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
      ```
      
      **After**
      ```scala
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      ```
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14257 from dongjoon-hyun/SPARK-16621.
      5b8e848b
  3. Jul 18, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly · ea78edb8
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on  `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.
      
      ## How was this patch tested?
      
      Pass the Jenkins. This is only a testsuite change.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14235 from dongjoon-hyun/SPARK-16590.
      ea78edb8
Loading