Skip to content
Snippets Groups Projects
  • Yin Huai's avatar
    7003c163
    [SPARK-2179][SQL] Public API for DataTypes and Schema · 7003c163
    Yin Huai authored
    The current PR contains the following changes:
    * Expose `DataType`s in the sql package (internal details are private to sql).
    * Users can create Rows.
    * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
    * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
    * `JsonRDD` has been refactored to use changes introduced by this PR.
    * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
    
    New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
    [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
    
    An example of using `applySchema` is shown below.
    ```scala
    import org.apache.spark.sql._
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    val schema =
      StructType(
        StructField("name", StringType, false) ::
        StructField("age", IntegerType, true) :: Nil)
    
    val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
    val peopleSchemaRDD = sqlContext. applySchema(people, schema)
    peopleSchemaRDD.printSchema
    // root
    // |-- name: string (nullable = false)
    // |-- age: integer (nullable = true)
    
    peopleSchemaRDD.registerAsTable("people")
    sqlContext.sql("select name from people").collect.foreach(println)
    ```
    
    I will add new contents to the SQL programming guide later.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2179
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
    
    1d45977 [Yin Huai] Clean up.
    a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c712fbf [Yin Huai] Converts types of values based on defined schema.
    4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e5f8df5 [Yin Huai] Scaladoc.
    122d1e7 [Yin Huai] Address comments.
    03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2476ed0 [Yin Huai] Minor updates.
    ab71f21 [Yin Huai] Format.
    fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    bd40a33 [Yin Huai] Address comments.
    991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
    1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
    3edb3ae [Yin Huai] Python doc.
    692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    1d93395 [Yin Huai] Python APIs.
    246da96 [Yin Huai] Add java data type APIs to javadoc index.
    1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    d48fc7b [Yin Huai] Minor updates.
    33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b9f3071 [Yin Huai] Java API for applySchema.
    1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
    624765c [Yin Huai] Tests for applySchema.
    aa92e84 [Yin Huai] Update data type tests.
    8da1a17 [Yin Huai] Add Row.fromSeq.
    9c99bc0 [Yin Huai] Several minor updates.
    1d9c13a [Yin Huai] Update applySchema API.
    85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e495e4e [Yin Huai] More comments.
    42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
    68525a2 [Yin Huai] Update JSON unit test.
    3209108 [Yin Huai] Add unit tests.
    dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
    9168b83 [Yin Huai] Update comments.
    fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
    949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
    7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
    43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
    0266761 [Yin Huai] Format
    03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
    3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
    16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    7003c163
    History
    [SPARK-2179][SQL] Public API for DataTypes and Schema
    Yin Huai authored
    The current PR contains the following changes:
    * Expose `DataType`s in the sql package (internal details are private to sql).
    * Users can create Rows.
    * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
    * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
    * `JsonRDD` has been refactored to use changes introduced by this PR.
    * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
    
    New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
    [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
    
    An example of using `applySchema` is shown below.
    ```scala
    import org.apache.spark.sql._
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    val schema =
      StructType(
        StructField("name", StringType, false) ::
        StructField("age", IntegerType, true) :: Nil)
    
    val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
    val peopleSchemaRDD = sqlContext. applySchema(people, schema)
    peopleSchemaRDD.printSchema
    // root
    // |-- name: string (nullable = false)
    // |-- age: integer (nullable = true)
    
    peopleSchemaRDD.registerAsTable("people")
    sqlContext.sql("select name from people").collect.foreach(println)
    ```
    
    I will add new contents to the SQL programming guide later.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2179
    
    Author: Yin Huai <huai@cse.ohio-state.edu>
    
    Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
    
    1d45977 [Yin Huai] Clean up.
    a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c712fbf [Yin Huai] Converts types of values based on defined schema.
    4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e5f8df5 [Yin Huai] Scaladoc.
    122d1e7 [Yin Huai] Address comments.
    03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2476ed0 [Yin Huai] Minor updates.
    ab71f21 [Yin Huai] Format.
    fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    bd40a33 [Yin Huai] Address comments.
    991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
    1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
    3edb3ae [Yin Huai] Python doc.
    692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    1d93395 [Yin Huai] Python APIs.
    246da96 [Yin Huai] Add java data type APIs to javadoc index.
    1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    d48fc7b [Yin Huai] Minor updates.
    33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b9f3071 [Yin Huai] Java API for applySchema.
    1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
    624765c [Yin Huai] Tests for applySchema.
    aa92e84 [Yin Huai] Update data type tests.
    8da1a17 [Yin Huai] Add Row.fromSeq.
    9c99bc0 [Yin Huai] Several minor updates.
    1d9c13a [Yin Huai] Update applySchema API.
    85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    e495e4e [Yin Huai] More comments.
    42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
    68525a2 [Yin Huai] Update JSON unit test.
    3209108 [Yin Huai] Add unit tests.
    dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
    9168b83 [Yin Huai] Update comments.
    fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
    949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
    7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
    43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
    0266761 [Yin Huai] Format
    03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
    90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
    3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
    16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.