[SPARK-23445] ColumnStat refactoring
## What changes were proposed in this pull request? Refactor ColumnStat to be more flexible. * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information. * For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore. * Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate. The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans. ## How was this patch tested? Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`. New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken. Author: Juliusz Sompolski <julek@databricks.com> Closes #20624 from juliuszsompolski/SPARK-23445.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala 143 additions, 3 deletions...ala/org/apache/spark/sql/catalyst/catalog/interface.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala 3 additions, 3 deletions...he/spark/sql/catalyst/optimizer/StarSchemaDetection.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala 24 additions, 232 deletions.../apache/spark/sql/catalyst/plans/logical/Statistics.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala 4 additions, 2 deletions...t/plans/logical/statsEstimation/AggregateEstimation.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala 15 additions, 5 deletions...alyst/plans/logical/statsEstimation/EstimationUtils.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala 64 additions, 34 deletions...lyst/plans/logical/statsEstimation/FilterEstimation.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala 38 additions, 17 deletions...talyst/plans/logical/statsEstimation/JoinEstimation.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala 8 additions, 17 deletions...pache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala 32 additions, 64 deletions...ql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinReorderSuite.scala 29 additions, 48 deletions...e/spark/sql/catalyst/optimizer/StarJoinReorderSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggregateEstimationSuite.scala 12 additions, 12 deletions...l/catalyst/statsEstimation/AggregateEstimationSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala 8 additions, 4 deletions.../catalyst/statsEstimation/BasicStatsEstimationSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala 145 additions, 134 deletions.../sql/catalyst/statsEstimation/FilterEstimationSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala 78 additions, 60 deletions...rk/sql/catalyst/statsEstimation/JoinEstimationSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/ProjectEstimationSuite.scala 40 additions, 30 deletions...sql/catalyst/statsEstimation/ProjectEstimationSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala 8 additions, 2 deletions...ql/catalyst/statsEstimation/StatsEstimationTestBase.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala 128 additions, 10 deletions...he/spark/sql/execution/command/AnalyzeColumnCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 5 additions, 4 deletions...scala/org/apache/spark/sql/execution/command/tables.scala
- sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala 6 additions, 3 deletions...cala/org/apache/spark/sql/StatisticsCollectionSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala 143 additions, 25 deletions...a/org/apache/spark/sql/StatisticsCollectionTestBase.scala
Loading
Please register or sign in to comment