Skip to content
  • Timothy Hunter's avatar
    769a909d
    [SPARK-7264][ML] Parallel lapply for sparkR · 769a909d
    Timothy Hunter authored
    ## What changes were proposed in this pull request?
    
    This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.
    
    TODO:
     - [x] check documentation
     - [ ] check tests
    
    Trivial example in SparkR:
    
    ```R
    sparkLapply(1:5, function(x) { 2 * x })
    ```
    
    Output:
    
    ```
    [[1]]
    [1] 2
    
    [[2]]
    [1] 4
    
    [[3]]
    [1] 6
    
    [[4]]
    [1] 8
    
    [[5]]
    [1] 10
    ```
    
    Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
    
    ```R
    library("MASS")
    data(menarche)
    families <- c("gaussian", "poisson")
    train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
    results <- sparkLapply(families, train)
    ```
    
    ## How was this patch tested?
    
    This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
    
    cc falaki davies
    
    Author: Timothy Hunter <timhunter@databricks.com>
    
    Closes #12426 from thunterdb/7264.
    769a909d
    [SPARK-7264][ML] Parallel lapply for sparkR
    Timothy Hunter authored
    ## What changes were proposed in this pull request?
    
    This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.
    
    TODO:
     - [x] check documentation
     - [ ] check tests
    
    Trivial example in SparkR:
    
    ```R
    sparkLapply(1:5, function(x) { 2 * x })
    ```
    
    Output:
    
    ```
    [[1]]
    [1] 2
    
    [[2]]
    [1] 4
    
    [[3]]
    [1] 6
    
    [[4]]
    [1] 8
    
    [[5]]
    [1] 10
    ```
    
    Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
    
    ```R
    library("MASS")
    data(menarche)
    families <- c("gaussian", "poisson")
    train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
    results <- sparkLapply(families, train)
    ```
    
    ## How was this patch tested?
    
    This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
    
    cc falaki davies
    
    Author: Timothy Hunter <timhunter@databricks.com>
    
    Closes #12426 from thunterdb/7264.
Loading