Skip to content
Snippets Groups Projects
  1. Jul 10, 2017
    • hyukjinkwon's avatar
      [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
      
      Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
      
      **Python**
      
      `from_json`
      
      ```python
      from pyspark.sql.functions import from_json
      
      data = [(1, '''{"a": 1}''')]
      df = spark.createDataFrame(data, ("key", "value"))
      df.select(from_json(df.value, "a INT").alias("json")).show()
      ```
      
      **R**
      
      `from_json`
      
      ```R
      df <- sql("SELECT named_struct('name', 'Bob') as people")
      df <- mutate(df, people_json = to_json(df$people))
      head(select(df, from_json(df$people_json, "name STRING")))
      ```
      
      `structType.character`
      
      ```R
      structType("a STRING, b INT")
      ```
      
      `dapply`
      
      ```R
      dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
      ```
      
      `gapply`
      
      ```R
      gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
      ```
      
      ## How was this patch tested?
      
      Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18498 from HyukjinKwon/SPARK-21266.
      2bfd5acc
  2. Jun 28, 2017
    • hyukjinkwon's avatar
      [SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R · db44f5f3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support a DDL-formetted string as schema as below:
      
      ```r
      mockLines <- c("{\"name\":\"Michael\"}",
                     "{\"name\":\"Andy\", \"age\":30}",
                     "{\"name\":\"Justin\", \"age\":19}")
      jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
      writeLines(mockLines, jsonPath)
      df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
      collect(df)
      ```
      
      ## How was this patch tested?
      
      Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18431 from HyukjinKwon/r-ddl-schema.
      db44f5f3
  3. Jun 18, 2017
  4. Jun 11, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move · 9f4ff955
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      clean up after big test move
      
      ## How was this patch tested?
      
      unit tests, jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18267 from felixcheung/rtestset2.
      9f4ff955
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Move all existing tests to non-installed directory so that it will never run by installing SparkR package
      
      For a follow-up PR:
      - remove all skip_on_cran() calls in tests
      - clean up test timer
      - improve or change basic tests that do run on CRAN (if anyone has suggestion)
      
      It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
      
      ## How was this patch tested?
      
      - [x] unit tests, Jenkins
      - [x] AppVeyor
      - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18264 from felixcheung/rtestset.
      dc4c3518
  5. May 31, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][WIP] add timestamps to test runs · 382fefd1
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      to investigate how long they run
      
      ## How was this patch tested?
      
      Jenkins, AppVeyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18104 from felixcheung/rtimetest.
      382fefd1
  6. May 23, 2017
    • Shivaram Venkataraman's avatar
      [SPARK-20727] Skip tests that use Hadoop utils on CRAN Windows · d06610f9
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This change skips tests that use the Hadoop libraries while running
      on CRAN check with Windows as the operating system. This is to handle
      cases where the Hadoop winutils binaries are missing on the target
      system. The skipped tests consist of
      1. Tests that save, load a model in MLlib
      2. Tests that save, load CSV, JSON and Parquet files in SQL
      3. Hive tests
      
      ## How was this patch tested?
      
      Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #17966 from shivaram/sparkr-windows-cran.
      d06610f9
  7. May 14, 2017
    • zero323's avatar
      [SPARK-20726][SPARKR] wrapper for SQL broadcast · 5a799fd8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Adds R wrapper for `o.a.s.sql.functions.broadcast`.
      - Renames `broadcast` to `broadcast_`.
      
      ## How was this patch tested?
      
      Unit tests, check `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17965 from zero323/SPARK-20726.
      5a799fd8
  8. May 12, 2017
    • Felix Cheung's avatar
      [SPARK-20704][SPARKR] change CRAN test to run single thread · 888b84ab
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      - [x] need to test by running R CMD check --as-cran
      - [x] sanity check vignettes
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17945 from felixcheung/rchangesforpackage.
      888b84ab
  9. May 09, 2017
  10. May 08, 2017
  11. May 07, 2017
    • zero323's avatar
      [SPARK-20550][SPARKR] R wrapper for Dataset.alias · 1f73d358
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add SparkR wrapper for `Dataset.alias`.
      - Adjust roxygen annotations for `functions.alias` (including example usage).
      
      ## How was this patch tested?
      
      Unit tests, `check_cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17825 from zero323/SPARK-20550.
      1f73d358
    • Felix Cheung's avatar
      [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor · 7087e011
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      add environment
      
      ## How was this patch tested?
      
      wait for appveyor run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17878 from felixcheung/appveyorrcran.
      7087e011
  12. May 04, 2017
    • zero323's avatar
      [SPARK-20544][SPARKR] R wrapper for input_file_name · f21897fc
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrapper for `o.a.s.sql.functions.input_file_name`
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests, `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17818 from zero323/SPARK-20544.
      f21897fc
    • zero323's avatar
      [SPARK-20585][SPARKR] R generic hint support · 9c36aa27
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds support for generic hints on `SparkDataFrame`
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17851 from zero323/SPARK-20585.
      9c36aa27
  13. May 03, 2017
    • Felix Cheung's avatar
      [SPARK-20543][SPARKR] skip tests when running on CRAN · fc472bdd
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      General rule on skip or not:
      skip if
      - RDD tests
      - tests could run long or complicated (streaming, hivecontext)
      - tests on error conditions
      - tests won't likely change/break
      
      ## How was this patch tested?
      
      unit tests, `R CMD check --as-cran`, `R CMD check`
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17817 from felixcheung/rskiptest.
      fc472bdd
  14. May 01, 2017
    • zero323's avatar
      [SPARK-20532][SPARKR] Implement grouping and grouping_id · 90d77e97
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds R wrappers for:
      
      - `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
      - `o.a.s.sql.functions.grouping_id`
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests. `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17807 from zero323/SPARK-20532.
      90d77e97
    • zero323's avatar
      [SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not · 80e9cf1b
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
      - Add boolean negation operator `!` and function `not `.
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests, `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17783 from zero323/SPARK-20490.
      80e9cf1b
  15. Apr 30, 2017
    • zero323's avatar
      [SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer · ae3df4e9
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Ad R wrappers for
      
      - `o.a.s.sql.functions.explode_outer`
      - `o.a.s.sql.functions.posexplode_outer`
      
      ## How was this patch tested?
      
      Additional unit tests, manual testing.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17809 from zero323/SPARK-20535.
      ae3df4e9
  16. Apr 29, 2017
    • hyukjinkwon's avatar
      [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R · 70f1bcd7
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
      
      They look similar DDL-like type definitions as below:
      
      ```scala
      scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
      ```
      ```
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      ```
      
      ```scala
      scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
      ```
      ```
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      ```
      
      Such type strings looks identical when R’s one as below:
      
      ```R
      > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
      > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
        struct
      1      a
      ```
      
      R’s one is stricter because we are checking the types via regular expressions in R side ahead.
      
      Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
      
      ## How was this patch tested?
      
      Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17785 from HyukjinKwon/SPARK-20493.
      70f1bcd7
  17. Apr 26, 2017
    • zero323's avatar
      [SPARK-20437][R] R wrappers for rollup and cube · df58a95a
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add `rollup` and `cube` methods and corresponding generics.
      - Add short description to the vignette.
      
      ## How was this patch tested?
      
      - Existing unit tests.
      - Additional unit tests covering new features.
      - `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17728 from zero323/SPARK-20437.
      df58a95a
  18. Apr 24, 2017
    • zero323's avatar
      [SPARK-20438][R] SparkR wrappers for split and repeat · 8a272ddc
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add wrappers for `o.a.s.sql.functions`:
      
      - `split` as `split_string`
      - `repeat` as `repeat_string`
      
      ## How was this patch tested?
      
      Existing tests, additional unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17729 from zero323/SPARK-20438.
      8a272ddc
  19. Apr 21, 2017
    • zero323's avatar
      [SPARK-20371][R] Add wrappers for collect_list and collect_set · fd648bff
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrappers for `collect_list` and `collect_set`.
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17672 from zero323/SPARK-20371.
      fd648bff
  20. Apr 19, 2017
    • zero323's avatar
      [SPARK-20375][R] R wrappers for array and map · 46c57497
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17674 from zero323/SPARK-20375.
      46c57497
  21. Apr 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in from_json function in R · 24f09b39
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
      
      After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
      
      So, this PR proposes to rename `asJsonArray` to `as.json.array`.
      
      ## How was this patch tested?
      
      Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17653 from HyukjinKwon/SPARK-19828-followup.
      24f09b39
  22. Apr 12, 2017
  23. Apr 06, 2017
  24. Apr 02, 2017
  25. Mar 27, 2017
  26. Mar 20, 2017
    • Wenchen Fan's avatar
      [SPARK-19949][SQL] unify bad record handling in CSV and JSON · 68d65fae
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
      
      The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
      
      Behavior changes:
      1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
      2. all logging is removed as they are not very useful in practice.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17315 from cloud-fan/bad-record2.
      68d65fae
    • Felix Cheung's avatar
      [SPARK-20020][SPARKR] DataFrame checkpoint API · c4059772
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add checkpoint, setCheckpointDir API to R
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17351 from felixcheung/rdfcheckpoint.
      c4059772
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  27. Mar 19, 2017
    • Felix Cheung's avatar
      [SPARK-18817][SPARKR][SQL] change derby log output to temp dir · 422aa67d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
      
      ## How was this patch tested?
      
      Manually, unit tests
      
      With this, these are relocated to under /tmp
      ```
      # ls /tmp/RtmpG2M0cB/
      derby.log
      ```
      And they are removed automatically when the R session is ended.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16330 from felixcheung/rderby.
      422aa67d
  28. Mar 14, 2017
    • hyukjinkwon's avatar
      [SPARK-19828][R] Support array type in from_json in R · d1f6c64c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:
      
      ```R
      jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
      df <- as.DataFrame(list(list("people" = jsonArr)))
      collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
      ```
      
      prints
      
      ```R
            arrcol
      1 Bob, Alice
      ```
      
      ## How was this patch tested?
      
      Unit tests in `test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17178 from HyukjinKwon/SPARK-19828.
      d1f6c64c
  29. Mar 08, 2017
    • Xiao Li's avatar
      [SPARK-19601][SQL] Fix CollapseRepartition rule to preserve shuffle-enabled Repartition · 9a6ac722
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      Observed by felixcheung  in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later.
      
      Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result.
      
      ```Scala
          val df = spark.range(0, 10000, 1, 5)
          val df2 = df.repartition(10)
          assert(df2.coalesce(13).rdd.getNumPartitions == 5)
          assert(df2.coalesce(7).rdd.getNumPartitions == 5)
          assert(df2.coalesce(3).rdd.getNumPartitions == 3)
      ```
      
      This PR is to fix the issue. We preserve shuffle-enabled Repartition.
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16933 from gatorsmile/CollapseRepartition.
      9a6ac722
  30. Mar 06, 2017
  31. Mar 05, 2017
    • Felix Cheung's avatar
      [SPARK-19795][SPARKR] add column functions to_json, from_json · 80d5338b
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add column functions: to_json, from_json, and tests covering error cases.
      
      ## How was this patch tested?
      
      unit tests, manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17134 from felixcheung/rtojson.
      80d5338b
  32. Mar 01, 2017
    • actuaryzhang's avatar
      [DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames · 2ff1467d
      actuaryzhang authored
      Update R doc:
      1. columns, names and colnames returns a vector of strings, not **list** as in current doc.
      2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`.
      
      felixcheung
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17115 from actuaryzhang/sparkRMinorDoc.
      2ff1467d
  33. Feb 23, 2017
    • actuaryzhang's avatar
      [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index · 7bf09433
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
      
      Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
      
      ## How was this patch tested?
      new tests
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17017 from actuaryzhang/sparkRSubsetter.
      7bf09433
Loading