Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    880eabec
    [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD · 880eabec
    Davies Liu authored
    Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
    
    This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
    
    root
     |-- field1: integer (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: struct (nullable = true)
     |    |-- field4: integer (nullable = true)
     |    |-- field5: array (nullable = true)
     |    |    |-- element: integer (containsNull = false)
     |-- field6: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- field7: string (nullable = true)
    
    Then we can access them by row.field3.field5[0]  or row.field6[5].field7
    
    It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
    
    You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
    
    ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
    
    Or you could use Row to create a class just like namedtuple, for example:
    
    Person = Row("name", "age")
    ctx.inferSchema(rdd.map(lambda x: Person(*x)))
    
    Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
    
    schema = StructType([StructField("name, StringType, True),
                                        StructType("age", IntegerType, True)])
    ctx.applySchema(rdd, schema)
    
    PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1598 from davies/nested and squashes the following commits:
    
    f1d15b6 [Davies Liu] verify schema with the first few rows
    8852aaf [Davies Liu] check type of schema
    abe9e6e [Davies Liu] address comments
    61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
    1e5b801 [Davies Liu] improve cache of classes
    51aa135 [Davies Liu] use Row to infer schema
    e9c0d5c [Davies Liu] remove string typed schema
    353a3f2 [Davies Liu] fix code style
    63de8f8 [Davies Liu] fix typo
    c79ca67 [Davies Liu] fix serialization of nested data
    6b258b5 [Davies Liu] fix pep8
    9d8447c [Davies Liu] apply schema provided by string of names
    f5df97f [Davies Liu] refactor, address comments
    9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
    84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
    0eaaf56 [Davies Liu] fix doc tests
    b3559b4 [Davies Liu] use generated Row instead of namedtuple
    c4ddc30 [Davies Liu] fix conflict between name of fields and variables
    7f6f251 [Davies Liu] address all comments
    d69d397 [Davies Liu] refactor
    2cc2d45 [Davies Liu] refactor
    182fb46 [Davies Liu] refactor
    bc6e9e1 [Davies Liu] switch to new Schema API
    547bf3e [Davies Liu] Merge branch 'master' into nested
    a435b5a [Davies Liu] add docs and code refactor
    2c8debc [Davies Liu] Merge branch 'master' into nested
    644665a [Davies Liu] use tuple and namedtuple for schemardd
    880eabec
    History
    [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD
    Davies Liu authored
    Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
    
    This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
    
    root
     |-- field1: integer (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: struct (nullable = true)
     |    |-- field4: integer (nullable = true)
     |    |-- field5: array (nullable = true)
     |    |    |-- element: integer (containsNull = false)
     |-- field6: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- field7: string (nullable = true)
    
    Then we can access them by row.field3.field5[0]  or row.field6[5].field7
    
    It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
    
    You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
    
    ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
    
    Or you could use Row to create a class just like namedtuple, for example:
    
    Person = Row("name", "age")
    ctx.inferSchema(rdd.map(lambda x: Person(*x)))
    
    Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
    
    schema = StructType([StructField("name, StringType, True),
                                        StructType("age", IntegerType, True)])
    ctx.applySchema(rdd, schema)
    
    PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #1598 from davies/nested and squashes the following commits:
    
    f1d15b6 [Davies Liu] verify schema with the first few rows
    8852aaf [Davies Liu] check type of schema
    abe9e6e [Davies Liu] address comments
    61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
    1e5b801 [Davies Liu] improve cache of classes
    51aa135 [Davies Liu] use Row to infer schema
    e9c0d5c [Davies Liu] remove string typed schema
    353a3f2 [Davies Liu] fix code style
    63de8f8 [Davies Liu] fix typo
    c79ca67 [Davies Liu] fix serialization of nested data
    6b258b5 [Davies Liu] fix pep8
    9d8447c [Davies Liu] apply schema provided by string of names
    f5df97f [Davies Liu] refactor, address comments
    9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
    84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
    0eaaf56 [Davies Liu] fix doc tests
    b3559b4 [Davies Liu] use generated Row instead of namedtuple
    c4ddc30 [Davies Liu] fix conflict between name of fields and variables
    7f6f251 [Davies Liu] address all comments
    d69d397 [Davies Liu] refactor
    2cc2d45 [Davies Liu] refactor
    182fb46 [Davies Liu] refactor
    bc6e9e1 [Davies Liu] switch to new Schema API
    547bf3e [Davies Liu] Merge branch 'master' into nested
    a435b5a [Davies Liu] add docs and code refactor
    2c8debc [Davies Liu] Merge branch 'master' into nested
    644665a [Davies Liu] use tuple and namedtuple for schemardd