Skip to content
Snippets Groups Projects
  • goldmedal's avatar
    1fdfe693
    [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark · 1fdfe693
    goldmedal authored
    ## What changes were proposed in this pull request?
    We added a method to the scala API for creating a `DataFrame` from `DataSet[String]` storing CSV in [SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark doesn't have `Dataset` to support this feature. Therfore, I add an API to create a `DataFrame` from `RDD[String]` storing csv and it's also consistent with PySpark's `spark.read.json`.
    
    For example as below
    ```
    >>> rdd = sc.textFile('python/test_support/sql/ages.csv')
    >>> df2 = spark.read.csv(rdd)
    >>> df2.dtypes
    [('_c0', 'string'), ('_c1', 'string')]
    ```
    ## How was this patch tested?
    add unit test cases.
    
    Author: goldmedal <liugs963@gmail.com>
    
    Closes #19339 from goldmedal/SPARK-22112.
    1fdfe693
    History
    [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark
    goldmedal authored
    ## What changes were proposed in this pull request?
    We added a method to the scala API for creating a `DataFrame` from `DataSet[String]` storing CSV in [SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark doesn't have `Dataset` to support this feature. Therfore, I add an API to create a `DataFrame` from `RDD[String]` storing csv and it's also consistent with PySpark's `spark.read.json`.
    
    For example as below
    ```
    >>> rdd = sc.textFile('python/test_support/sql/ages.csv')
    >>> df2 = spark.read.csv(rdd)
    >>> df2.dtypes
    [('_c0', 'string'), ('_c1', 'string')]
    ```
    ## How was this patch tested?
    add unit test cases.
    
    Author: goldmedal <liugs963@gmail.com>
    
    Closes #19339 from goldmedal/SPARK-22112.
readwriter.py 47.59 KiB