Skip to content
Snippets Groups Projects
  1. May 26, 2016
    • Xin Ren's avatar
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh'... · 6ab973ec
      Xin Ren authored
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh' when R is missing on Mac
      
      https://issues.apache.org/jira/browse/SPARK-15542
      
      ## What changes were proposed in this pull request?
      
      When running`./R/install-dev.sh` in **Mac OS EI Captain** environment, I got
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      usage: dirname path
      ```
      This message is very confusing to me, and then I found R is not properly configured on my Mac when this script is using `$(which R)` to get R home.
      
      I tried similar situation on CentOS with R missing, and it's giving me very clear error message while MacOS is not.
      on CentOS:
      ```
      [rootip-xxx-31-9-xx spark]# which R
      /usr/bin/which: no R in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin)
      ```
      but on Mac, if not found then nothing returned and this is causing the confusing message for R build failure and running R/install-dev.sh:
      ```
      mbp185-xr:spark xin$ which R
      mbp185-xr:spark xin$
      ```
      
      Here I just added a clear message for this miss configuration for R when running `R/install-dev.sh`.
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      Cannot find R home by running 'which R', please make sure R is properly installed.
      ```
      
      ## How was this patch tested?
      Manually tested on local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #13308 from keypointt/SPARK-15542.
      6ab973ec
    • felixcheung's avatar
      [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton · c76457c8
      felixcheung authored
      Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
      
      Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
      
      Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9192 from felixcheung/rsqlcontext.
      c76457c8
  2. May 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-15439][SPARKR] Failed to run unit test in SparkR · 06bae8af
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      There are some failures when running SparkR unit tests.
      In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
      The first one is due to different masked name. I added missed names in the expected arrays.
      The second one is because one PR removed the logic of a previous fix of missing subset method.
      
      The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
      test_that("pipeRDD() on RDDs", {
        actual <- collect(pipeRDD(rdd, "more"))
      When using run-test script, it complains no such directories as below:
      cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually test it
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13284 from wangmiao1981/R.
      06bae8af
  3. May 24, 2016
    • Daoyuan Wang's avatar
      [SPARK-15397][SQL] fix string udf locate as hive · d642b273
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
      
      ## How was this patch tested?
      
      tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #13186 from adrian-wang/locate.
      d642b273
  4. May 23, 2016
    • hyukjinkwon's avatar
      [MINOR][SPARKR][DOC] Add a description for running unit tests in Windows · a8e97d17
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the description for running unit tests in Windows.
      
      ## How was this patch tested?
      
      On a bare machine (Window 7, 32bits), this was manually built and tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13217 from HyukjinKwon/minor-r-doc.
      a8e97d17
  5. May 18, 2016
  6. May 12, 2016
    • Sun Rui's avatar
      [SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR. · b3930f74
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
      ```
      dapplyCollect(df, function(ldf) {...})
      ```
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #12989 from sun-rui/SPARK-15202.
      b3930f74
  7. May 09, 2016
    • Yanbo Liang's avatar
      [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader · ee3b1715
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
      * Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13005 from yanboliang/r-df-examples.
      ee3b1715
  8. May 08, 2016
  9. May 05, 2016
    • Sun Rui's avatar
      [SPARK-11395][SPARKR] Support over and window specification in SparkR. · 157a49aa
      Sun Rui authored
      This PR:
      1. Implement WindowSpec S4 class.
      2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
      3. Implement over() of Column class.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #10094 from sun-rui/SPARK-11395.
      157a49aa
    • NarineK's avatar
      [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames · 22226fcc
      NarineK authored
      ## What changes were proposed in this pull request?
      
      Implement repartitionByColumn on DataFrame.
      This will allow us to run R functions on each partition identified by column groups with dapply() method.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: NarineK <narine.kokhlikyan@us.ibm.com>
      
      Closes #12887 from NarineK/repartitionByColumns.
      22226fcc
  10. May 03, 2016
  11. Apr 30, 2016
    • Yanbo Liang's avatar
      [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR · 19a6d192
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * ```RFormula``` supports empty response variable like ```~ x + y```.
      * Support formula in ```spark.kmeans``` in SparkR.
      * Fix some outdated docs for SparkR.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12813 from yanboliang/spark-15030.
      19a6d192
    • Xiangrui Meng's avatar
      [SPARK-14831][.2][ML][R] rename ml.save/ml.load to write.ml/read.ml · b3ea5793
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.
      
      I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      cc: yanboliang thunterdb
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12807 from mengxr/SPARK-14831.
      b3ea5793
    • Timothy Hunter's avatar
      [SPARK-14831][SPARKR] Make the SparkR MLlib API more consistent with Spark · bc36fe6e
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This PR splits the MLlib algorithms into two flavors:
       - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
       - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages
      
      In practice, the former calls the latter.
      
      ## How was this patch tested?
      
      The tests for the various algorithms were adapted to be run against both interfaces.
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #12789 from thunterdb/14831.
      bc36fe6e
  12. Apr 29, 2016
    • Sun Rui's avatar
      [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. · 4ae9fe09
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
      
      The function signature is:
      
      	dapply(df, function(localDF) {}, schema = NULL)
      
      R function input: local data.frame from the partition on local node
      R function output: local data.frame
      
      Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
      If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #12493 from sun-rui/SPARK-12919.
      4ae9fe09
    • Yanbo Liang's avatar
      [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans) · 87ac84d4
      Yanbo Liang authored
      SparkR ```glm``` and ```kmeans``` model persistence.
      
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Gayathri Murali <gayathri.m.softie@gmail.com>
      
      Closes #12778 from yanboliang/spark-14311.
      Closes #12680
      Closes #12683
      87ac84d4
    • Timothy Hunter's avatar
      [SPARK-7264][ML] Parallel lapply for sparkR · 769a909d
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.
      
      TODO:
       - [x] check documentation
       - [ ] check tests
      
      Trivial example in SparkR:
      
      ```R
      sparkLapply(1:5, function(x) { 2 * x })
      ```
      
      Output:
      
      ```
      [[1]]
      [1] 2
      
      [[2]]
      [1] 4
      
      [[3]]
      [1] 6
      
      [[4]]
      [1] 8
      
      [[5]]
      [1] 10
      ```
      
      Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
      
      ```R
      library("MASS")
      data(menarche)
      families <- c("gaussian", "poisson")
      train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
      results <- sparkLapply(families, train)
      ```
      
      ## How was this patch tested?
      
      This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
      
      cc falaki davies
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #12426 from thunterdb/7264.
      769a909d
  13. Apr 28, 2016
    • Sun Rui's avatar
      [SPARK-12235][SPARKR] Enhance mutate() to support replace existing columns. · 9e785079
      Sun Rui authored
      Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns.
      1. Throw error message when there are duplicated column names in the DataFrame being mutated.
      2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10220 from sun-rui/SPARK-12235.
      9e785079
  14. Apr 27, 2016
    • Oscar D. Lara Yejas's avatar
      [SPARK-13436][SPARKR] Added parameter drop to subsetting operator [ · e4bfb4aa
      Oscar D. Lara Yejas authored
      Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it.
      
      In R:
      ```
      > name <- "Sepal_Length"
      > class(iris[, name])
      [1] "numeric"
      ```
      Currently, in SparkR:
      ```
      > name <- "Sepal_Length"
      > class(irisDF[, name])
      [1] "DataFrame"
      ```
      
      Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this:
      
      ```
      > name <- "Sepal_Length"
      > class(irisDF[, name, drop=T])
      [1] "Column"
      
      > class(irisDF[, name, drop=F])
      [1] "DataFrame"
      ```
      
      This is consistent with R:
      
      ```
      > name <- "Sepal_Length"
      > class(iris[, name])
      [1] "numeric"
      > class(iris[, name, drop=F])
      [1] "data.frame"
      ```
      
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #11318 from olarayej/SPARK-13436.
      e4bfb4aa
  15. Apr 26, 2016
    • Oscar D. Lara Yejas's avatar
      [SPARK-13734][SPARKR] Added histogram function · 0c99c23b
      Oscar D. Lara Yejas authored
      ## What changes were proposed in this pull request?
      
      Added method histogram() to compute the histogram of a Column
      
      Usage:
      
      ```
      ## Create a DataFrame from the Iris dataset
      irisDF <- createDataFrame(sqlContext, iris)
      
      ## Render a histogram for the Sepal_Length column
      histogram(irisDF, "Sepal_Length", nbins=12)
      
      ```
      ![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png)
      
      Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name
      
      ## How was this patch tested?
      
      All unit tests pass. I added specific unit cases for different scenarios.
      
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #11569 from olarayej/SPARK-13734.
      0c99c23b
    • Yanbo Liang's avatar
      [SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR · 92f66331
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12685 from yanboliang/spark-14313.
      92f66331
  16. Apr 25, 2016
    • Yanbo Liang's avatar
      [SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR · 9cb3ba10
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
      ```
      df <- createDataFrame(sqlContext, infert)
      model <- naiveBayes(education ~ ., df, laplace = 0)
      ml.save(model, path)
      model2 <- ml.load(path)
      ```
      
      ## How was this patch tested?
      Add unit tests.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12573 from yanboliang/spark-14312.
      9cb3ba10
    • Dongjoon Hyun's avatar
      [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date · 6ab4d9e0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
      
      - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
      - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
      - Fix datatypes in `sparkr.md`.
      - Update a data result in `sparkr.md`.
      - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
      - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
      - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
      - Other minor syntax fixes and a typo.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12649 from dongjoon-hyun/SPARK-14883.
      6ab4d9e0
  17. Apr 23, 2016
    • felixcheung's avatar
      [SPARK-12148][SPARKR] fix doc after renaming DataFrame to SparkDataFrame · 1b7eab74
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      Fixed inadvertent roxygen2 doc changes, added class name change to programming guide
      Follow up of #12621
      
      ## How was this patch tested?
      
      manually checked
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #12647 from felixcheung/rdataframe.
      1b7eab74
    • Reynold Xin's avatar
      [SPARK-14869][SQL] Don't mask exceptions in ResolveRelations · 890abd12
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.
      
      ## How was this patch tested?
      I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12634 from rxin/SPARK-14869.
      890abd12
    • felixcheung's avatar
      [SPARK-14594][SPARKR] check execution return status code · 39d3bc62
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous.
      
      ```
      Error in if (returnStatus != 0) { : argument is of length zero
      ```
      
      This change attempts to make it more clear (however, one would still need to investigate why JVM fails)
      
      ## How was this patch tested?
      
      manually
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #12622 from felixcheung/rreturnstatus.
      39d3bc62
    • felixcheung's avatar
      [SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame · a55fbe2a
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.
      
      Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").
      
      Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.
      
      ## How was this patch tested?
      
      SparkR tests, manually loading S4Vector then SparkR package
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #12621 from felixcheung/rdataframe.
      a55fbe2a
  18. Apr 22, 2016
  19. Apr 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14780] [R] Add `setLogLevel` to SparkR · 41145447
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to add `setLogLevel` function to SparkR shell.
      
      **Spark Shell**
      ```scala
      scala> sc.setLogLevel("ERROR")
      ```
      
      **PySpark**
      ```python
      >>> sc.setLogLevel("ERROR")
      ```
      
      **SparkR (this PR)**
      ```r
      > setLogLevel(sc, "ERROR")
      NULL
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including a new R testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12547 from dongjoon-hyun/SPARK-14780.
      41145447
  20. Apr 20, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. · 14869ae6
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue aims to expose Scala `bround` function in Python/R API.
      `bround` function is implemented in SPARK-14614 by extending current `round` function.
      We used the following semantics from Hive.
      ```java
      public static double bround(double input, int scale) {
          if (Double.isNaN(input) || Double.isInfinite(input)) {
            return input;
          }
          return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
      }
      ```
      
      After this PR, `pyspark` and `sparkR` also support `bround` function.
      
      **PySpark**
      ```python
      >>> from pyspark.sql.functions import bround
      >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
      [Row(r=2.0)]
      ```
      
      **SparkR**
      ```r
      > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
      > head(collect(select(df, bround(df$x, 0))))
        bround(x, 0)
      1            2
      2            4
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12509 from dongjoon-hyun/SPARK-14639.
      14869ae6
  21. Apr 19, 2016
    • Sun Rui's avatar
      [SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package. · 8eedf0b5
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.
      
      ## How was this patch tested?
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #11811 from sun-rui/SPARK-13905.
      8eedf0b5
    • felixcheung's avatar
      [SPARK-12224][SPARKR] R support for JDBC source · ecd877e8
      felixcheung authored
      Add R API for `read.jdbc`, `write.jdbc`.
      
      Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.
      
      Refactored some code into util so they could be tested.
      
      Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.
      
      Tested:
      ```
      # with postgresql
      ../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar
      
      # read.jdbc
      df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
      df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)
      
      # partitionColumn and numPartitions test
      df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
      a <- SparkR:::toRDD(df)
      SparkR:::getNumPartitions(a)
      [1] 4
      SparkR:::collectPartition(a, 2L)
      
      # defaultParallelism test
      df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
      SparkR:::getNumPartitions(a)
      [1] 2
      
      # predicates test
      df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
      count(df) == 1
      
      # write.jdbc, default save mode "error"
      irisDf <- as.DataFrame(sqlContext, iris)
      write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
      "error, already exists"
      
      write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
      ```
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10480 from felixcheung/rreadjdbc.
      ecd877e8
  22. Apr 15, 2016
    • Yanbo Liang's avatar
      [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for... · 83af297a
      Yanbo Liang authored
      [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
      
      ## What changes were proposed in this pull request?
      Expose R-like summary statistics in SparkR::glm for more family and link functions.
      Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
      
      ## How was this patch tested?
      Unit tests.
      
      SparkR Output:
      ```
      Deviance Residuals:
      (Note: These are approximate quantiles with relative error <= 0.01)
           Min        1Q    Median        3Q       Max
      -0.95096  -0.16585  -0.00232   0.17410   0.72918
      
      Coefficients:
                          Estimate  Std. Error  t value  Pr(>|t|)
      (Intercept)         1.6765    0.23536     7.1231   4.4561e-11
      Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
      Species_versicolor  -0.98339  0.072075    -13.644  0
      Species_virginica   -1.0075   0.093306    -10.798  0
      
      (Dispersion parameter for gaussian family taken to be 0.08351462)
      
          Null deviance: 28.307  on 149  degrees of freedom
      Residual deviance: 12.193  on 146  degrees of freedom
      AIC: 59.22
      
      Number of Fisher Scoring iterations: 1
      ```
      R output:
      ```
      Deviance Residuals:
           Min        1Q    Median        3Q       Max
      -0.95096  -0.16522   0.00171   0.18416   0.72918
      
      Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
      (Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
      Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
      Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
      Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      (Dispersion parameter for gaussian family taken to be 0.08351462)
      
          Null deviance: 28.307  on 149  degrees of freedom
      Residual deviance: 12.193  on 146  degrees of freedom
      AIC: 59.217
      
      Number of Fisher Scoring iterations: 2
      ```
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12393 from yanboliang/spark-13925.
      83af297a
  23. Apr 12, 2016
    • Yanbo Liang's avatar
      [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in SparkR:::glm · 75e05a5a
      Yanbo Liang authored
      * SparkR glm supports families and link functions which match R's signature for family.
      * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
      * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
      * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
      
      Unit tests.
      
      cc mengxr jkbradley hhbyyh
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12294 from yanboliang/spark-12566.
      75e05a5a
  24. Apr 10, 2016
  25. Apr 05, 2016
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for R · 1146c534
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the R API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python and R, users can access all APIs above, but in addition they can do
       - In R:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12141 from brkyvz/R-windows.
      1146c534
  26. Apr 01, 2016
    • Yanbo Liang's avatar
      [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans · 22249afb
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.
      
      ## How was this patch tested?
      Existing tests.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12039 from yanboliang/spark-14059.
      22249afb
  27. Mar 28, 2016
Loading