Skip to content
Snippets Groups Projects
  1. Jan 24, 2018
  2. Jan 03, 2018
  3. Jan 01, 2018
  4. Dec 30, 2017
    • Felix Cheung's avatar
      [SPARK-22924][SPARKR] R API for sortWithinPartitions · ea0a5eef
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add to `arrange` the option to sort only within partition
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #20118 from felixcheung/rsortwithinpartition.
      ea0a5eef
  5. Dec 29, 2017
    • Felix Cheung's avatar
      [SPARK-22920][SPARKR] sql functions for current_date, current_timestamp,... · 66a7d6b3
      Felix Cheung authored
      [SPARK-22920][SPARKR] sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString
      
      ## What changes were proposed in this pull request?
      
      Add sql functions
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #20105 from felixcheung/rsqlfuncs.
      66a7d6b3
  6. Dec 28, 2017
    • hyukjinkwon's avatar
      [SPARK-21208][R] Adds setLocalProperty and getLocalProperty in R · 1eebfbe1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `setLocalProperty` and `getLocalProperty`in R.
      
      ```R
      > df <- createDataFrame(iris)
      > setLocalProperty("spark.job.description", "Hello world!")
      > count(df)
      > setLocalProperty("spark.job.description", "Hi !!")
      > count(df)
      ```
      
      <img width="775" alt="2017-12-25 4 18 07" src="https://user-images.githubusercontent.com/6477701/34335213-60655a7c-e990-11e7-88aa-12debe311627.png">
      
      ```R
      > print(getLocalProperty("spark.job.description"))
      NULL
      > setLocalProperty("spark.job.description", "Hello world!")
      > print(getLocalProperty("spark.job.description"))
      [1] "Hello world!"
      > setLocalProperty("spark.job.description", "Hi !!")
      > print(getLocalProperty("spark.job.description"))
      [1] "Hi !!"
      ```
      
      ## How was this patch tested?
      
      Manually tested and a test in `R/pkg/tests/fulltests/test_context.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20075 from HyukjinKwon/SPARK-21208.
      1eebfbe1
    • hyukjinkwon's avatar
      [SPARK-22843][R] Adds localCheckpoint in R · 76e8a1d7
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add `localCheckpoint(..)` in R API.
      
      ```r
      df <- localCheckpoint(createDataFrame(iris))
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20073 from HyukjinKwon/SPARK-22843.
      76e8a1d7
  7. Dec 23, 2017
    • hyukjinkwon's avatar
      [SPARK-22844][R] Adds date_trunc in R API · aeb45df6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `date_trunc` in R API as below:
      
      ```r
      > df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"))))
      > head(select(df, date_trunc("hour", df$a)))
        date_trunc(hour, a)
      1 2012-12-13 12:00:00
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20031 from HyukjinKwon/r-datetrunc.
      aeb45df6
  8. Nov 26, 2017
  9. Nov 12, 2017
    • hyukjinkwon's avatar
      [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build · 3d90b2cb
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.
      
      The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.
      
      So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.
      
      After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.
      
      After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.
      
      ## How was this patch tested?
      
      Manually tested the elapsed times.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19722 from HyukjinKwon/SPARK-21693-test.
      3d90b2cb
  10. Nov 11, 2017
    • gatorsmile's avatar
      [SPARK-22488][SQL] Fix the view resolution issue in the SparkSession internal table() API · d6ee69e7
      gatorsmile authored
      ## What changes were proposed in this pull request?
      The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.
      
      Users might get the strange error caused by view resolution when the default database is different.
      ```
      Table or view not found: t1; line 1 pos 14
      org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      ```
      
      This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.
      
      ## How was this patch tested?
      Added a test case and modified the existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19713 from gatorsmile/viewResolution.
      d6ee69e7
    • hyukjinkwon's avatar
      [SPARK-22476][R] Add dayofweek function to R · 223d83ee
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `dayofweek` to R API:
      
      ```r
      data <- list(list(d = as.Date("2012-12-13")),
                   list(d = as.Date("2013-12-14")),
                   list(d = as.Date("2014-12-15")))
      df <- createDataFrame(data)
      collect(select(df, dayofweek(df$d)))
      ```
      
      ```
        dayofweek(d)
      1            5
      2            7
      3            2
      ```
      
      ## How was this patch tested?
      
      Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19706 from HyukjinKwon/add-dayofweek.
      223d83ee
  11. Nov 10, 2017
  12. Nov 09, 2017
  13. Oct 26, 2017
  14. Oct 11, 2017
    • Zhenhua Wang's avatar
      [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError... · 655f6f86
      Zhenhua Wang authored
      [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0
      
      ## What changes were proposed in this pull request?
      
      Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
      
      For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
      
      Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
      
      ## How was this patch tested?
      
      Added a new test case and fix existing test cases.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #19438 from wzhfy/improve_percentile_approx.
      655f6f86
  15. Oct 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns · ae61f187
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19436 from viirya/fix-flatmapinr-distribution.
      ae61f187
  16. Oct 01, 2017
  17. Sep 25, 2017
    • Zhenhua Wang's avatar
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and... · 365a29bd
      Zhenhua Wang authored
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type
      
      ## What changes were proposed in this pull request?
      
      The `percentile_approx` function previously accepted numeric type input and output double type results.
      
      But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
      
      After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
      
      This change is also required when we generate equi-height histograms for these types.
      
      ## How was this patch tested?
      
      Added a new test and modified some existing tests.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #19321 from wzhfy/approx_percentile_support_types.
      365a29bd
  18. Sep 21, 2017
    • hyukjinkwon's avatar
      [SPARK-21780][R] Simpler Dataset.sample API in R · a8d9ec8a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.
      
      In short, the following examples are allowed:
      
      ```r
      > df <- createDataFrame(as.list(seq(10)))
      > count(sample(df, fraction=0.5, seed=3))
      [1] 4
      > count(sample(df, fraction=1.0))
      [1] 10
      ```
      
      In addition, this PR also adds some type checking logics as below:
      
      ```r
      > sample(df, fraction = "a")
      Error in sample(df, fraction = "a") :
        fraction must be numeric; however, got character
      > sample(df, fraction = 1, seed = NULL)
      Error in sample(df, fraction = 1, seed = NULL) :
        seed must not be NULL or NA; however, got NULL
      > sample(df, list(1), 1.0)
      Error in sample(df, list(1), 1) :
        withReplacement must be logical; however, got list
      > sample(df, fraction = -1.0)
      ...
      Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
      ```
      
      ## How was this patch tested?
      
      Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19243 from HyukjinKwon/SPARK-21780.
      a8d9ec8a
  19. Sep 14, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9
      goldmedal authored
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
      
      ## What changes were proposed in this pull request?
      In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
      
      ### For PySpark
      ```
      >>> data = [(1, {"name": "Alice"})]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'{"name":"Alice")']
      >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
      ```
      ### For SparkR
      ```
      # Converts a map into a JSON object
      df2 <- sql("SELECT map('name', 'Bob')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      # Converts an array of maps into a JSON array
      df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      ```
      ## How was this patch tested?
      Add unit test cases.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
      a28728a9
  20. Sep 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R · 07fd68a2
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
      
      **Python**
      
      ```python
      df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
      df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
      df1.unionByName(df2).show()
      ```
      
      ```
      +----+----+----+
      |col0|col1|col3|
      +----+----+----+
      |   1|   2|   3|
      |   6|   4|   5|
      +----+----+----+
      ```
      
      **R**
      
      ```R
      df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
      df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
      head(unionByName(limit(df1, 2), limit(df2, 2)))
      ```
      
      ```
        carb am gear
      1    4  1    4
      2    4  1    4
      3    4  1    4
      4    4  1    4
      ```
      
      ## How was this patch tested?
      
      Doctests for Python and unit test added in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19105 from HyukjinKwon/unionByName-r-python.
      07fd68a2
  21. Aug 29, 2017
  22. Aug 22, 2017
    • Andrew Ray's avatar
      [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation · 5c9b3017
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
      
      This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
      
      ## How was this patch tested?
      
      Modified and additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18786 from aray/summary-r.
      5c9b3017
  23. Aug 06, 2017
  24. Aug 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21602][R] Add map_keys and map_values functions to R · 97ba4918
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `map_values` and `map_keys` to R API.
      
      ```r
      > df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
      > tmp <- mutate(df, v = create_map(df$model, df$cyl))
      > head(select(tmp, map_keys(tmp$v)))
      ```
      ```
              map_keys(v)
      1         Mazda RX4
      2     Mazda RX4 Wag
      3        Datsun 710
      4    Hornet 4 Drive
      5 Hornet Sportabout
      6           Valiant
      ```
      ```r
      > head(select(tmp, map_values(tmp$v)))
      ```
      ```
        map_values(v)
      1             6
      2             6
      3             4
      4             6
      5             8
      6             6
      ```
      
      ## How was this patch tested?
      
      Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18809 from HyukjinKwon/map-keys-values-r.
      97ba4918
  25. Jul 31, 2017
    • wangmiao1981's avatar
      [SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms · 9570e81a
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
      
      This is a followup PR for SPARK-20307.
      
      ## How was this patch tested?
      
      New Unit tests are added.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18605 from wangmiao1981/class.
      9570e81a
  26. Jul 15, 2017
    • Yanbo Liang's avatar
      [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both... · 69e5282d
      Yanbo Liang authored
      [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
      
      ## What changes were proposed in this pull request?
      ```RFormula``` should handle invalid for both features and label column.
      #18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
      
      ## How was this patch tested?
      Add test cases.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18613 from yanboliang/spark-20307.
      69e5282d
  27. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  28. Jul 10, 2017
    • hyukjinkwon's avatar
      [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
      
      Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
      
      **Python**
      
      `from_json`
      
      ```python
      from pyspark.sql.functions import from_json
      
      data = [(1, '''{"a": 1}''')]
      df = spark.createDataFrame(data, ("key", "value"))
      df.select(from_json(df.value, "a INT").alias("json")).show()
      ```
      
      **R**
      
      `from_json`
      
      ```R
      df <- sql("SELECT named_struct('name', 'Bob') as people")
      df <- mutate(df, people_json = to_json(df$people))
      head(select(df, from_json(df$people_json, "name STRING")))
      ```
      
      `structType.character`
      
      ```R
      structType("a STRING, b INT")
      ```
      
      `dapply`
      
      ```R
      dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
      ```
      
      `gapply`
      
      ```R
      gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
      ```
      
      ## How was this patch tested?
      
      Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18498 from HyukjinKwon/SPARK-21266.
      2bfd5acc
  29. Jul 08, 2017
    • wangmiao1981's avatar
      [SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib... · a7b46c62
      wangmiao1981 authored
      [SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
      
      ## What changes were proposed in this pull request?
      
      For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
      
      This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
      
      ## How was this patch tested?
      
      Add a new unit test based on the error case in the JIRA.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18496 from wangmiao1981/handle.
      a7b46c62
  30. Jun 28, 2017
    • hyukjinkwon's avatar
      [SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R · db44f5f3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support a DDL-formetted string as schema as below:
      
      ```r
      mockLines <- c("{\"name\":\"Michael\"}",
                     "{\"name\":\"Andy\", \"age\":30}",
                     "{\"name\":\"Justin\", \"age\":19}")
      jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
      writeLines(mockLines, jsonPath)
      df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
      collect(df)
      ```
      
      ## How was this patch tested?
      
      Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18431 from HyukjinKwon/r-ddl-schema.
      db44f5f3
  31. Jun 23, 2017
  32. Jun 21, 2017
  33. Jun 18, 2017
  34. Jun 11, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move · 9f4ff955
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      clean up after big test move
      
      ## How was this patch tested?
      
      unit tests, jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18267 from felixcheung/rtestset2.
      9f4ff955
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Move all existing tests to non-installed directory so that it will never run by installing SparkR package
      
      For a follow-up PR:
      - remove all skip_on_cran() calls in tests
      - clean up test timer
      - improve or change basic tests that do run on CRAN (if anyone has suggestion)
      
      It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
      
      ## How was this patch tested?
      
      - [x] unit tests, Jenkins
      - [x] AppVeyor
      - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18264 from felixcheung/rtestset.
      dc4c3518
Loading