-
- Downloads
[SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources produce UnsafeRow
This PR fixes two issues: 1. `PhysicalRDD.outputsUnsafeRows` is always `false` Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`. 1. Internal/external row conversion for `HadoopFsRelation` is kinda messy Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary. Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`. This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s). All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s. In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`. A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows? At least all well known ones do so. However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations. If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]). This PR supersedes #9125. Follow-ups: 1. Makes JSON and ORC data sources output `UnsafeRow` directly 1. Makes `HiveTableScan` output `UnsafeRow` directly This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`. [1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353 [2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335 [3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669 Author: Cheng Lian <lian@databricks.com> Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/columnar/GenerateColumnAccessor.scala 1 addition, 1 deletion...rg/apache/spark/sql/columnar/GenerateColumnAccessor.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala 5 additions, 3 deletions...in/scala/org/apache/spark/sql/execution/ExistingRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala 19 additions, 11 deletions.../spark/sql/execution/datasources/DataSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala 13 additions, 5 deletions...e/spark/sql/execution/datasources/json/JSONRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRecordMaterializer.scala 1 addition, 1 deletion...tion/datasources/parquet/CatalystRecordMaterializer.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala 6 additions, 2 deletions.../execution/datasources/parquet/CatalystRowConverter.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala 3 additions, 3 deletions...k/sql/execution/datasources/parquet/ParquetRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala 22 additions, 12 deletions.../spark/sql/execution/datasources/text/DefaultSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 41 additions, 3 deletions.../main/scala/org/apache/spark/sql/sources/interfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala 1 addition, 1 deletion...sql/execution/datasources/parquet/ParquetQuerySuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala 13 additions, 17 deletions...ain/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
- sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala 31 additions, 0 deletions...org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
Loading
Please register or sign in to comment