-
- Downloads
[SPARK-14959][SQL] handle partitioned table directories in distributed filesystem
## What changes were proposed in this pull request? ##### The root cause: When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens. This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects. ## How was this patch tested? Regtest is run. Manual test: ``` scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show +-----+---+ | text| id| +-----+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-----+---+ spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show +-----+---+ | text| id| +-----+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-----+---+ ``` I also tried it with 2 level of partitioning. I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated. Author: Xin Wu <xinwu@us.ibm.com> Closes #13463 from xwu0226/SPARK-14959.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala 3 additions, 33 deletions.../spark/sql/execution/datasources/ListingFileCatalog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala 10 additions, 0 deletions...park/sql/execution/datasources/fileSourceInterfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala 1 addition, 0 deletions...k/sql/execution/datasources/FileSourceStrategySuite.scala
Please register or sign in to comment