[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark

## What changes were proposed in this pull request? Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in `q2`, this fix could catch all the input relations: `web_sales`, `date_dim`, and `catalog_sales` (the current code catches `date_dim` only). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19344 from maropu/RespectWithInTPCDSBench.

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark
c6610a99 · Takeshi Yamamuro · gatorsmile · 530fe683 · c6610a99
Commit c6610a99 authored 8 years ago by Takeshi Yamamuro Committed by gatorsmile 8 years ago
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
@@ -20,11 +20,10 @@ package org.apache.spark.sql.execution.benchmark
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.SparkSession
-import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.HiveTableRelation
-import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
-import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
-import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.execution.datasources.LogicalRelation
 import org.apache.spark.util.Benchmark
 /**
@@ -66,24 +65,15 @@ object TPCDSQueryBenchmark extends Logging {
        classLoader = Thread.currentThread().getContextClassLoader)
      // This is an indirect hack to estimate the size of each query's input by traversing the
-      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
+      // logical plan and adding up the sizes of all tables that appear in the plan.
-      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
-      // per-row processing time for those cases.
      val queryRelations = scala.collection.mutable.HashSet[String]()
-      spark.sql(queryString).queryExecution.logical.map {
+      spark.sql(queryString).queryExecution.analyzed.foreach {
-        case UnresolvedRelation(t: TableIdentifier) =>
+        case SubqueryAlias(alias, _: LogicalRelation) =>
-          queryRelations.add(t.table)
+          queryRelations.add(alias)
-        case lp: LogicalPlan =>
+        case LogicalRelation(_, _, Some(catalogTable), _) =>
-          lp.expressions.foreach { _ foreach {
+          queryRelations.add(catalogTable.identifier.table)
-            case subquery: SubqueryExpression =>
+        case HiveTableRelation(tableMeta, _, _) =>
-              subquery.plan.foreach {
+          queryRelations.add(tableMeta.identifier.table)
-                case UnresolvedRelation(t: TableIdentifier) =>
-                  queryRelations.add(t.table)
-                case _ =>
-              }
-            case _ =>
-          }
-        }
        case _ =>
      }
      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum