Skip to content
Snippets Groups Projects
  • zero323's avatar
    d7d9fa0b
    [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b
    zero323 authored
    Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
    
    At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
    
    A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
    
    It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
    
    Author: zero323 <matthew.szymkiewicz@gmail.com>
    
    Closes #9099 from zero323/SPARK-11086.
    d7d9fa0b
    History
    [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame
    zero323 authored
    Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
    
    At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
    
    A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
    
    It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
    
    Author: zero323 <matthew.szymkiewicz@gmail.com>
    
    Closes #9099 from zero323/SPARK-11086.