-
- Downloads
[SPARK-2096][SQL] Correctly parse dot notations
First let me write down the current `projections` grammar of spark sql: expression : orExpression orExpression : andExpression {"or" andExpression} andExpression : comparisonExpression {"and" comparisonExpression} comparisonExpression : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ... termExpression : productExpression {"+"|"-" productExpression} productExpression : baseExpression {"*"|"/"|"%" baseExpression} baseExpression : expression "[" expression "]" | ... | ident | ... ident : identChar {identChar | digit} | delimiters | ... identChar : letter | "_" | "." delimiters : "," | ";" | "(" | ")" | "[" | "]" | ... projection : expression [["AS"] ident] projections : projection { "," projection} For something like `a.b.c[1]`, it will be parsed as: <img src="http://img51.imgspice.com/i/03008/4iltjsnqgmtt_t.jpg" border=0> But for something like `a[1].b`, the current grammar can't parse it correctly. A simple solution is written in `ParquetQuerySuite#NestedSqlParser`, changed grammars are: delimiters : "." | "," | ";" | "(" | ")" | "[" | "]" | ... identChar : letter | "_" baseExpression : expression "[" expression "]" | expression "." ident | ... | ident | ... This works well, but can't cover some corner case like `select t.a.b from table as t`: <img src="http://img51.imgspice.com/i/03008/v2iau3hoxoxg_t.jpg" border=0> `t.a.b` parsed as `GetField(GetField(UnResolved("t"), "a"), "b")` instead of `GetField(UnResolved("t.a"), "b")` using this new grammar. However, we can't resolve `t` as it's not a filed, but the whole table.(if we could do this, then `select t from table as t` is legal, which is unexpected) My solution is: dotExpressionHeader : ident "." ident baseExpression : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader | ident | ... I passed all test cases under sql locally and add a more complex case. "arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it. I'm not familiar with the latter optimize phase, please correct me if I missed something. Author: Wenchen Fan <cloud0fan@163.com> Author: Michael Armbrust <michael@databricks.com> Closes #2230 from cloud-fan/dot and squashes the following commits: e1a8898 [Wenchen Fan] remove support for arbitrary nested arrays ee8a724 [Wenchen Fan] rollback LogicalPlan, support dot operation on nested array type a58df40 [Michael Armbrust] add regression test for doubly nested data 16bc4c6 [Wenchen Fan] some enhance 95d733f [Wenchen Fan] split long line dc31698 [Wenchen Fan] SPARK-2096 Correctly parse dot notations
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala 11 additions, 2 deletions.../main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala 1 addition, 5 deletions...apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
- sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 14 additions, 0 deletions.../src/test/scala/org/apache/spark/sql/json/JsonSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala 26 additions, 0 deletions...c/test/scala/org/apache/spark/sql/json/TestJsonData.scala
- sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala 24 additions, 78 deletions...cala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 12 additions, 5 deletions...a/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Loading
Please register or sign in to comment