Skip to content
Snippets Groups Projects
  1. Feb 06, 2018
  2. Jan 31, 2018
  3. Jan 30, 2018
    • Henry Robinson's avatar
      [SPARK-23157][SQL] Explain restriction on column expression in withColumn() · 8b983243
      Henry Robinson authored
      ## What changes were proposed in this pull request?
      
      It's not obvious from the comments that any added column must be a
      function of the dataset that we are adding it to. Add a comment to
      that effect to Scala, Python and R Data* methods.
      
      Author: Henry Robinson <henry@cloudera.com>
      
      Closes #20429 from henryr/SPARK-23157.
      8b983243
  4. Jan 24, 2018
  5. Jan 17, 2018
    • Henry Robinson's avatar
      [SPARK-23062][SQL] Improve EXCEPT documentation · 1f3d933e
      Henry Robinson authored
      ## What changes were proposed in this pull request?
      
      Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more
      explicit in the documentation, and call out the change in behavior
      from 1.x.
      
      Author: Henry Robinson <henry@cloudera.com>
      
      Closes #20254 from henryr/spark-23062.
      1f3d933e
  6. Jan 16, 2018
  7. Jan 14, 2018
  8. Jan 12, 2018
  9. Jan 11, 2018
  10. Jan 10, 2018
    • sethah's avatar
      [SPARK-22993][ML] Clarify HasCheckpointInterval param doc · 70bcc9d5
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context.
      
      ## How was this patch tested?
      
      No tests necessary, just a doc update.
      
      Author: sethah <shendrickson@cloudera.com>
      
      Closes #20188 from sethah/als_checkpoint_doc.
      70bcc9d5
  11. Jan 09, 2018
  12. Jan 03, 2018
  13. Jan 01, 2018
  14. Dec 30, 2017
    • Felix Cheung's avatar
      [SPARK-22924][SPARKR] R API for sortWithinPartitions · ea0a5eef
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add to `arrange` the option to sort only within partition
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #20118 from felixcheung/rsortwithinpartition.
      ea0a5eef
    • Takeshi Yamamuro's avatar
      [SPARK-22771][SQL] Concatenate binary inputs into a binary output · f2b3525c
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr modified `concat` to concat binary inputs into a single binary output.
      `concat` in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, `concat` also outputs binary.
      
      ## How was this patch tested?
      Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19977 from maropu/SPARK-22771.
      f2b3525c
  15. Dec 29, 2017
    • Felix Cheung's avatar
      [SPARK-22920][SPARKR] sql functions for current_date, current_timestamp,... · 66a7d6b3
      Felix Cheung authored
      [SPARK-22920][SPARKR] sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString
      
      ## What changes were proposed in this pull request?
      
      Add sql functions
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #20105 from felixcheung/rsqlfuncs.
      66a7d6b3
  16. Dec 28, 2017
    • hyukjinkwon's avatar
      [SPARK-21208][R] Adds setLocalProperty and getLocalProperty in R · 1eebfbe1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `setLocalProperty` and `getLocalProperty`in R.
      
      ```R
      > df <- createDataFrame(iris)
      > setLocalProperty("spark.job.description", "Hello world!")
      > count(df)
      > setLocalProperty("spark.job.description", "Hi !!")
      > count(df)
      ```
      
      <img width="775" alt="2017-12-25 4 18 07" src="https://user-images.githubusercontent.com/6477701/34335213-60655a7c-e990-11e7-88aa-12debe311627.png">
      
      ```R
      > print(getLocalProperty("spark.job.description"))
      NULL
      > setLocalProperty("spark.job.description", "Hello world!")
      > print(getLocalProperty("spark.job.description"))
      [1] "Hello world!"
      > setLocalProperty("spark.job.description", "Hi !!")
      > print(getLocalProperty("spark.job.description"))
      [1] "Hi !!"
      ```
      
      ## How was this patch tested?
      
      Manually tested and a test in `R/pkg/tests/fulltests/test_context.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20075 from HyukjinKwon/SPARK-21208.
      1eebfbe1
    • hyukjinkwon's avatar
      [SPARK-22843][R] Adds localCheckpoint in R · 76e8a1d7
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add `localCheckpoint(..)` in R API.
      
      ```r
      df <- localCheckpoint(createDataFrame(iris))
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20073 from HyukjinKwon/SPARK-22843.
      76e8a1d7
  17. Dec 23, 2017
    • Shivaram Venkataraman's avatar
      [SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests · 1219d7a4
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests.
      
      ## How was this patch tested?
      
      Test manually by running `R CMD check --as-cran`
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #20060 from shivaram/sparkr-overwrite-cran.
      1219d7a4
    • hyukjinkwon's avatar
      [SPARK-22844][R] Adds date_trunc in R API · aeb45df6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `date_trunc` in R API as below:
      
      ```r
      > df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"))))
      > head(select(df, date_trunc("hour", df$a)))
        date_trunc(hour, a)
      1 2012-12-13 12:00:00
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20031 from HyukjinKwon/r-datetrunc.
      aeb45df6
  18. Nov 26, 2017
  19. Nov 12, 2017
    • hyukjinkwon's avatar
      [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build · 3d90b2cb
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.
      
      The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.
      
      So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.
      
      After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.
      
      After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.
      
      ## How was this patch tested?
      
      Manually tested the elapsed times.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19722 from HyukjinKwon/SPARK-21693-test.
      3d90b2cb
  20. Nov 11, 2017
    • gatorsmile's avatar
      [SPARK-22488][SQL] Fix the view resolution issue in the SparkSession internal table() API · d6ee69e7
      gatorsmile authored
      ## What changes were proposed in this pull request?
      The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.
      
      Users might get the strange error caused by view resolution when the default database is different.
      ```
      Table or view not found: t1; line 1 pos 14
      org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      ```
      
      This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.
      
      ## How was this patch tested?
      Added a test case and modified the existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19713 from gatorsmile/viewResolution.
      d6ee69e7
    • hyukjinkwon's avatar
      [SPARK-22476][R] Add dayofweek function to R · 223d83ee
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `dayofweek` to R API:
      
      ```r
      data <- list(list(d = as.Date("2012-12-13")),
                   list(d = as.Date("2013-12-14")),
                   list(d = as.Date("2014-12-15")))
      df <- createDataFrame(data)
      collect(select(df, dayofweek(df$d)))
      ```
      
      ```
        dayofweek(d)
      1            5
      2            7
      3            2
      ```
      
      ## How was this patch tested?
      
      Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19706 from HyukjinKwon/add-dayofweek.
      223d83ee
  21. Nov 10, 2017
  22. Nov 09, 2017
  23. Nov 07, 2017
    • Felix Cheung's avatar
      [SPARK-22281][SPARKR] Handle R method breaking signature changes · 2ca5aae4
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      This is to fix the code for the latest R changes in R-devel, when running CRAN check
      ```
      checking for code/documentation mismatches ... WARNING
      Codoc mismatches from documentation object 'attach':
      attach
      Code: function(what, pos = 2L, name = deparse(substitute(what),
      backtick = FALSE), warn.conflicts = TRUE)
      Docs: function(what, pos = 2L, name = deparse(substitute(what)),
      warn.conflicts = TRUE)
      Mismatches in argument default values:
      Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))
      
      Codoc mismatches from documentation object 'glm':
      glm
      Code: function(formula, family = gaussian, data, weights, subset,
      na.action, start = NULL, etastart, mustart, offset,
      control = list(...), model = TRUE, method = "glm.fit",
      x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
      NULL, ...)
      Docs: function(formula, family = gaussian, data, weights, subset,
      na.action, start = NULL, etastart, mustart, offset,
      control = list(...), model = TRUE, method = "glm.fit",
      x = FALSE, y = TRUE, contrasts = NULL, ...)
      Argument names in code not in docs:
      singular.ok
      Mismatches in argument names:
      Position: 16 Code: singular.ok Docs: contrasts
      Position: 17 Code: contrasts Docs: ...
      ```
      
      With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.
      
      With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)
      
      Also with attach we are changing to call dynamically.
      
      ## How was this patch tested?
      
      Manually.
      - [x] check documentation output - yes
      - [x] check help `?attach` `?glm` - yes
      - [x] check on other platforms, r-hub, on r-devel etc..
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19557 from felixcheung/rattachglmdocerror.
      2ca5aae4
  24. Nov 06, 2017
    • Shivaram Venkataraman's avatar
      [SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext · 65a8bf60
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.
      
      This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)
      
      ## How was this patch tested?
      
      Manually by changing the `DESCRIPTION` file
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #19624 from shivaram/sparkr-version-check.
      65a8bf60
  25. Oct 30, 2017
    • Felix Cheung's avatar
      [SPARK-22327][SPARKR][TEST] check for version warning · ded3ed97
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Will need to port to this to branch-1.6, -2.0, -2.1, -2.2
      
      ## How was this patch tested?
      
      manually
      Jenkins, AppVeyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19549 from felixcheung/rcranversioncheck.
      ded3ed97
  26. Oct 29, 2017
    • Shivaram Venkataraman's avatar
      [SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests · 1fe27612
      Shivaram Venkataraman authored
      This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`
      
      ## How was this patch tested?
      Tested manually on a clean EC2 machine
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #19589 from shivaram/sparkr-tmpdir-clean.
      1fe27612
  27. Oct 26, 2017
  28. Oct 11, 2017
    • Zhenhua Wang's avatar
      [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError... · 655f6f86
      Zhenhua Wang authored
      [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0
      
      ## What changes were proposed in this pull request?
      
      Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
      
      For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
      
      Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
      
      ## How was this patch tested?
      
      Added a new test case and fix existing test cases.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #19438 from wzhfy/improve_percentile_approx.
      655f6f86
  29. Oct 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns · ae61f187
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19436 from viirya/fix-flatmapinr-distribution.
      ae61f187
  30. Oct 02, 2017
    • Holden Karau's avatar
      [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc · 8fab7995
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up
      
      ## How was this patch tested?
      
      set -x in the SparkR install scripts.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
      8fab7995
  31. Oct 01, 2017
  32. Sep 25, 2017
    • Zhenhua Wang's avatar
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and... · 365a29bd
      Zhenhua Wang authored
      [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type
      
      ## What changes were proposed in this pull request?
      
      The `percentile_approx` function previously accepted numeric type input and output double type results.
      
      But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
      
      After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
      
      This change is also required when we generate equi-height histograms for these types.
      
      ## How was this patch tested?
      
      Added a new test and modified some existing tests.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #19321 from wzhfy/approx_percentile_support_types.
      365a29bd
  33. Sep 21, 2017
    • hyukjinkwon's avatar
      [SPARK-21780][R] Simpler Dataset.sample API in R · a8d9ec8a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.
      
      In short, the following examples are allowed:
      
      ```r
      > df <- createDataFrame(as.list(seq(10)))
      > count(sample(df, fraction=0.5, seed=3))
      [1] 4
      > count(sample(df, fraction=1.0))
      [1] 10
      ```
      
      In addition, this PR also adds some type checking logics as below:
      
      ```r
      > sample(df, fraction = "a")
      Error in sample(df, fraction = "a") :
        fraction must be numeric; however, got character
      > sample(df, fraction = 1, seed = NULL)
      Error in sample(df, fraction = 1, seed = NULL) :
        seed must not be NULL or NA; however, got NULL
      > sample(df, list(1), 1.0)
      Error in sample(df, list(1), 1) :
        withReplacement must be logical; however, got list
      > sample(df, fraction = -1.0)
      ...
      Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
      ```
      
      ## How was this patch tested?
      
      Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19243 from HyukjinKwon/SPARK-21780.
      a8d9ec8a
  34. Sep 20, 2017
  35. Sep 14, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9
      goldmedal authored
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
      
      ## What changes were proposed in this pull request?
      In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
      
      ### For PySpark
      ```
      >>> data = [(1, {"name": "Alice"})]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'{"name":"Alice")']
      >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
      ```
      ### For SparkR
      ```
      # Converts a map into a JSON object
      df2 <- sql("SELECT map('name', 'Bob')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      # Converts an array of maps into a JSON array
      df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      ```
      ## How was this patch tested?
      Add unit test cases.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
      a28728a9
Loading