Commits · 87293c746e19d66f475d506d0adb43421f496843 · cs525-sp18-g07 / spark

Feb 06, 2018

[SPARK-23327][SQL] Update the description and tests of three external API or functions · c36fecc3

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
Update the description and tests of three external API or functions `createFunction `, `length` and `repartitionByRange `

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20495 from gatorsmile/updateFunc.

c36fecc3

Jan 31, 2018
- [SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment · f470df2f
  Henry Robinson authored 7 years ago
  
  Author: Henry Robinson <henry@cloudera.com> Closes #20443 from henryr/SPARK-23157.
  f470df2f
Jan 30, 2018

[SPARK-23157][SQL] Explain restriction on column expression in withColumn() · 8b983243

Henry Robinson authored 7 years ago

## What changes were proposed in this pull request?

It's not obvious from the comments that any added column must be a
function of the dataset that we are adding it to. Add a comment to
that effect to Scala, Python and R Data* methods.

Author: Henry Robinson <henry@cloudera.com>

Closes #20429 from henryr/SPARK-23157.

8b983243

Jan 24, 2018

[SPARK-20906][SPARKR] Add API doc example for Constrained Logistic Regression · e18d6f53

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

doc only changes

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20380 from felixcheung/rclrdoc.

e18d6f53

[SPARK-21727][R] Allow multi-element atomic vector as column type in SparkR DataFrame · f54b65c1

neilalex authored 7 years ago

## What changes were proposed in this pull request?

A fix to https://issues.apache.org/jira/browse/SPARK-21727, "Operating on an ArrayType in a SparkR DataFrame throws error"

## How was this patch tested?

- Ran tests at R\pkg\tests\run-all.R (see below attached results)
- Tested the following lines in SparkR, which now seem to execute without error:

```
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))
mySparkDf <- as.DataFrame(myDf)
collect(mySparkDf)
```

[2018-01-22 SPARK-21727 Test Results.txt](https://github.com/apache/spark/files/1653535/2018-01-22.SPARK-21727.Test.Results.txt)

felixcheung yanboliang sun-rui shivaram

_The contribution is my original work and I license the work to the project under the project’s open source license_

Author: neilalex <neil@neilalex.com>

Closes #20352 from neilalex/neilalex-sparkr-arraytype.

f54b65c1

Jan 17, 2018

[SPARK-23062][SQL] Improve EXCEPT documentation · 1f3d933e

Henry Robinson authored 7 years ago

## What changes were proposed in this pull request?

Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more
explicit in the documentation, and call out the change in behavior
from 1.x.

Author: Henry Robinson <henry@cloudera.com>

Closes #20254 from henryr/spark-23062.

1f3d933e

Jan 16, 2018

[SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. · 4371466b

Bago Amirbekian authored 7 years ago

## What changes were proposed in this pull request?

RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming.

## How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Bago Amirbekian <bago@databricks.com>

Closes #20229 from MrBago/rFormula.

4371466b

Jan 14, 2018

[SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text · 66738d29

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

fix doc truncated

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20263 from felixcheung/r23docfix.

66738d29

Jan 12, 2018

[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT · 651f7615

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
This patch bumps the master branch version to `2.4.0-SNAPSHOT`.

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20222 from gatorsmile/bump24.

651f7615

Jan 11, 2018

[SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline · 186bf8fb

Bago Amirbekian authored 7 years ago

## What changes were proposed in this pull request?

Including VectorSizeHint in RFormula piplelines will allow them to be applied to streaming dataframes.

## How was this patch tested?

Unit tests.

Author: Bago Amirbekian <bago@databricks.com>

Closes #20238 from MrBago/rFormulaVectorSize.

186bf8fb

Jan 10, 2018

[SPARK-22993][ML] Clarify HasCheckpointInterval param doc · 70bcc9d5

sethah authored 7 years ago

## What changes were proposed in this pull request?

Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context.

## How was this patch tested?

No tests necessary, just a doc update.

Author: sethah <shendrickson@cloudera.com>

Closes #20188 from sethah/als_checkpoint_doc.

70bcc9d5

Jan 09, 2018

[SPARK-21293][SPARKR][DOCS] structured streaming doc update · 02214b09

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

doc update

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20197 from felixcheung/rwadoc.

02214b09

Jan 03, 2018

[SPARK-22933][SPARKR] R Structured Streaming API for withWatermark, trigger, partitionBy · df95a908

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

R Structured Streaming API for withWatermark, trigger, partitionBy

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20129 from felixcheung/rwater.

df95a908

Jan 01, 2018

[SPARK-21616][SPARKR][DOCS] update R migration guide and vignettes · 7a702d8d

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

update R migration guide and vignettes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20106 from felixcheung/rreleasenote23.

7a702d8d

Dec 30, 2017

[SPARK-22924][SPARKR] R API for sortWithinPartitions · ea0a5eef

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

Add to `arrange` the option to sort only within partition

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20118 from felixcheung/rsortwithinpartition.

ea0a5eef

[SPARK-22771][SQL] Concatenate binary inputs into a binary output · f2b3525c

Takeshi Yamamuro authored 7 years ago

## What changes were proposed in this pull request?
This pr modified `concat` to concat binary inputs into a single binary output.
`concat` in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, `concat` also outputs binary.

## How was this patch tested?
Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #19977 from maropu/SPARK-22771.

f2b3525c

Dec 29, 2017

[SPARK-22920][SPARKR] sql functions for current_date, current_timestamp,... · 66a7d6b3

Felix Cheung authored 7 years ago

[SPARK-22920][SPARKR] sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString

## What changes were proposed in this pull request?

Add sql functions

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20105 from felixcheung/rsqlfuncs.

66a7d6b3

Dec 28, 2017

[SPARK-21208][R] Adds setLocalProperty and getLocalProperty in R · 1eebfbe1

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR adds `setLocalProperty` and `getLocalProperty`in R.

```R
> df <- createDataFrame(iris)
> setLocalProperty("spark.job.description", "Hello world!")
> count(df)
> setLocalProperty("spark.job.description", "Hi !!")
> count(df)
```

<img width="775" alt="2017-12-25 4 18 07" src="https://user-images.githubusercontent.com/6477701/34335213-60655a7c-e990-11e7-88aa-12debe311627.png">

```R
> print(getLocalProperty("spark.job.description"))
NULL
> setLocalProperty("spark.job.description", "Hello world!")
> print(getLocalProperty("spark.job.description"))
[1] "Hello world!"
> setLocalProperty("spark.job.description", "Hi !!")
> print(getLocalProperty("spark.job.description"))
[1] "Hi !!"
```

## How was this patch tested?

Manually tested and a test in `R/pkg/tests/fulltests/test_context.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20075 from HyukjinKwon/SPARK-21208.

1eebfbe1

[SPARK-22843][R] Adds localCheckpoint in R · 76e8a1d7

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to add `localCheckpoint(..)` in R API.

```r
df <- localCheckpoint(createDataFrame(iris))
```

## How was this patch tested?

Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20073 from HyukjinKwon/SPARK-22843.

76e8a1d7

Dec 23, 2017

[SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests · 1219d7a4

Shivaram Venkataraman authored 7 years ago

## What changes were proposed in this pull request?

Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests.

## How was this patch tested?

Test manually by running `R CMD check --as-cran`

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #20060 from shivaram/sparkr-overwrite-cran.

1219d7a4

[SPARK-22844][R] Adds date_trunc in R API · aeb45df6

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR adds `date_trunc` in R API as below:

```r
> df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"))))
> head(select(df, date_trunc("hour", df$a)))
  date_trunc(hour, a)
1 2012-12-13 12:00:00
```

## How was this patch tested?

Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20031 from HyukjinKwon/r-datetrunc.

aeb45df6

Nov 26, 2017

[SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions running R worker in few tests to speed up · d49d9e40

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This is a followup to reduce AppVeyor test time. This PR proposes to reduce the number of shuffle partitions to reduce the tasks running R workers in few particular tests.

The symptom is similar as described in `https://github.com/apache/spark/pull/19722`. There are many R processes newly launched on Windows without forking and it makes the differences of elapsed time between Linux and Windows.

Here is the simple comparison for before/after of this change. I manually tested this by disabling `spark.sparkr.use.daemon`. Disabling it resembles the tests on Windows:

**Before**

<img width="672" alt="2017-11-25 12 22 13" src="https://user-images.githubusercontent.com/6477701/33217949-b5528dfa-d17d-11e7-8050-75675c39eb20.png">

**After**

<img width="682" alt="2017-11-25 12 32 00" src="https://user-images.githubusercontent.com/6477701/33217958-c6518052-d17d-11e7-9f8e-1be21a784559.png">

So, this probably will reduce roughly more than 10 minutes.

## How was this patch tested?

AppVeyor tests

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19816 from HyukjinKwon/SPARK-21693-followup.

d49d9e40

Nov 12, 2017

[SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build · 3d90b2cb

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.

The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.

So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.

After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.

After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.

## How was this patch tested?

Manually tested the elapsed times.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19722 from HyukjinKwon/SPARK-21693-test.

3d90b2cb

Nov 11, 2017

[SPARK-22488][SQL] Fix the view resolution issue in the SparkSession internal table() API · d6ee69e7

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.

Users might get the strange error caused by view resolution when the default database is different.
```
Table or view not found: t1; line 1 pos 14
org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.

## How was this patch tested?
Added a test case and modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19713 from gatorsmile/viewResolution.

d6ee69e7

[SPARK-22476][R] Add dayofweek function to R · 223d83ee

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR adds `dayofweek` to R API:

```r
data <- list(list(d = as.Date("2012-12-13")),
             list(d = as.Date("2013-12-14")),
             list(d = as.Date("2014-12-15")))
df <- createDataFrame(data)
collect(select(df, dayofweek(df$d)))
```

```
  dayofweek(d)
1            5
2            7
3            2
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19706 from HyukjinKwon/add-dayofweek.

223d83ee

Nov 10, 2017

[SPARK-22344][SPARKR] clean up install dir if running test as source package · b70aa9e0

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

remove spark if spark downloaded & installed

## How was this patch tested?

manually by building package
Jenkins, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19657 from felixcheung/rinstalldir.

b70aa9e0

Nov 09, 2017

[SPARK-21640][SQL][PYTHON][R][FOLLOWUP] Add errorifexists in SparkR and other documentations · 695647bf

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to add `errorifexists` to SparkR API and fix the rest of them describing the mode, mainly, in API documentations as well.

This PR also replaces `convertToJSaveMode` to `setWriteMode` so that string as is is passed to JVM and executes:

https://github.com/apache/spark/blob/b034f2565f72aa73c9f0be1e49d148bb4cf05153/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L72-L82

and remove the duplication here:

https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L187-L194

## How was this patch tested?

Manually checked the built documentation. These were mainly found by `` grep -r `error` `` and `grep -r 'error'`.

Also, unit tests added in `test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19673 from HyukjinKwon/SPARK-21640-followup.

695647bf

Nov 07, 2017

[SPARK-22281][SPARKR] Handle R method breaking signature changes · 2ca5aae4

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

This is to fix the code for the latest R changes in R-devel, when running CRAN check
```
checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))

Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...
```

With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.

With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)

Also with attach we are changing to call dynamically.

## How was this patch tested?

Manually.
- [x] check documentation output - yes
- [x] check help `?attach` `?glm` - yes
- [x] check on other platforms, r-hub, on r-devel etc..

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19557 from felixcheung/rattachglmdocerror.

2ca5aae4

Nov 06, 2017

[SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext · 65a8bf60

Shivaram Venkataraman authored 7 years ago

## What changes were proposed in this pull request?

This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.

This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)

## How was this patch tested?

Manually by changing the `DESCRIPTION` file

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #19624 from shivaram/sparkr-version-check.

65a8bf60

Oct 30, 2017

[SPARK-22327][SPARKR][TEST] check for version warning · ded3ed97

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

Will need to port to this to branch-1.6, -2.0, -2.1, -2.2

## How was this patch tested?

manually
Jenkins, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19549 from felixcheung/rcranversioncheck.

ded3ed97

Oct 29, 2017

[SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests · 1fe27612

Shivaram Venkataraman authored 7 years ago

This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`

## How was this patch tested?
Tested manually on a clean EC2 machine

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #19589 from shivaram/sparkr-tmpdir-clean.

1fe27612

Oct 26, 2017

[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR · a83d8d5a

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.

Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.

## How was this patch tested?

Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19551 from HyukjinKwon/SPARK-17902.

a83d8d5a

Oct 11, 2017

[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError... · 655f6f86

Zhenhua Wang authored 7 years ago

[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0

## What changes were proposed in this pull request?

Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.

For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.

Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.

## How was this patch tested?

Added a new test case and fix existing test cases.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #19438 from wzhfy/improve_percentile_approx.

655f6f86

Oct 05, 2017

[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns · ae61f187

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19436 from viirya/fix-flatmapinr-distribution.

ae61f187

Oct 02, 2017

[SPARK-22167][R][BUILD] sparkr packaging issue allow zinc · 8fab7995

Holden Karau authored 7 years ago

## What changes were proposed in this pull request?

When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up

## How was this patch tested?

set -x in the SparkR install scripts.

Author: Holden Karau <holden@us.ibm.com>

Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.

8fab7995

Oct 01, 2017

[SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r · 02c91e03

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

Currently, we set lintr to jimhester/lintra769c0b (see [this](https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).

I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72, in my local and fixed the check failures.

It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).

The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.

## How was this patch tested?

Manually, `./dev/lint-r` after manually updating the lintr package.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #19290 from HyukjinKwon/upgrade-r-lint.

02c91e03

Sep 25, 2017

[SPARK-22100][SQL] Make percentile_approx support date/timestamp type and... · 365a29bd

Zhenhua Wang authored 7 years ago

[SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type

## What changes were proposed in this pull request?

The `percentile_approx` function previously accepted numeric type input and output double type results.

But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.

After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

This change is also required when we generate equi-height histograms for these types.

## How was this patch tested?

Added a new test and modified some existing tests.

Author: Zhenhua Wang <wangzhenhua@huawei.com>

Closes #19321 from wzhfy/approx_percentile_support_types.

365a29bd

Sep 21, 2017

[SPARK-21780][R] Simpler Dataset.sample API in R · a8d9ec8a

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.

In short, the following examples are allowed:

```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```

In addition, this PR also adds some type checking logics as below:

```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
  fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
  seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
  withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
```

## How was this patch tested?

Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19243 from HyukjinKwon/SPARK-21780.

a8d9ec8a

Sep 20, 2017

[SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and to_utc_timestamp · e17901d6

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example

## How was this patch tested?

Doc only change / existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19276 from srowen/SPARK-22049.

e17901d6

Sep 14, 2017

[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9

goldmedal authored 7 years ago

[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR

## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.

### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.

cc viirya HyukjinKwon

Author: goldmedal <liugs963@gmail.com>

Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.

a28728a9