Skip to content
Snippets Groups Projects
  1. Nov 15, 2015
    • zero323's avatar
      [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b
      zero323 authored
      Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
      
      At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
      
      A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
      
      It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9099 from zero323/SPARK-11086.
      d7d9fa0b
  2. Nov 12, 2015
  3. Nov 11, 2015
  4. Nov 10, 2015
  5. Nov 09, 2015
  6. Nov 06, 2015
  7. Nov 05, 2015
  8. Oct 30, 2015
  9. Oct 26, 2015
  10. Oct 21, 2015
    • Davies Liu's avatar
      [SPARK-11197][SQL] run SQL on files directly · f8c6bec6
      Davies Liu authored
      This PR introduce a new feature to run SQL directly on files without create a table, for example:
      
      ```
      select id from json.`path/to/json/files` as j
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9173 from davies/source.
      f8c6bec6
  11. Oct 14, 2015
  12. Oct 13, 2015
    • Adrian Zhuang's avatar
      [SPARK-10913] [SPARKR] attach() function support · f7f28ee7
      Adrian Zhuang authored
      Bring the change code up to date.
      
      Author: Adrian Zhuang <adrian555@users.noreply.github.com>
      Author: adrian555 <wzhuang@us.ibm.com>
      
      Closes #9031 from adrian555/attach2.
      f7f28ee7
    • Narine Kokhlikyan's avatar
      [SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame · 1e0aba90
      Narine Kokhlikyan authored
      as.DataFrame is more a R-style like signature.
      Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8952 from NarineK/sparkrasDataFrame.
      1e0aba90
    • Sun Rui's avatar
      [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame · 5e3868ba
      Sun Rui authored
      Two points in this PR:
      
      1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".
      
      2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8794 from sun-rui/SPARK-10051.
      5e3868ba
  13. Oct 10, 2015
    • Sun Rui's avatar
      [SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions. · 864de3bf
      Sun Rui authored
      1.  Add a "col" function into DataFrame.
      2.  Move the current "col" function in Column.R to functions.R, convert it to S4 function.
      3.  Add a s4 "column" function in functions.R.
      4.  Convert the "column" function in Column.R to S4 function. This is for private use.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8864 from sun-rui/SPARK-10079.
      864de3bf
  14. Oct 09, 2015
  15. Oct 08, 2015
    • Narine Kokhlikyan's avatar
      [SPARK-10836] [SPARKR] Added sort(x, decreasing, col, ... ) method to DataFrame · e8f90d9d
      Narine Kokhlikyan authored
      the sort function can be used as an alternative to arrange(... ).
      As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names
      
      for example:
      sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order
      
      sort(df, decreasing=TRUE, "col1")
      sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8920 from NarineK/sparkrsort.
      e8f90d9d
  16. Oct 07, 2015
  17. Oct 04, 2015
  18. Sep 30, 2015
  19. Sep 16, 2015
  20. Sep 12, 2015
    • JihongMa's avatar
      [SPARK-6548] Adding stddev to DataFrame functions · f4a22808
      JihongMa authored
      Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
      
      Author: JihongMa <linlin200605@gmail.com>
      Author: Jihong MA <linlin200605@gmail.com>
      Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
      Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
      
      Closes #6297 from JihongMA/SPARK-SQL.
      f4a22808
  21. Sep 10, 2015
    • Sun Rui's avatar
      [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame. · 45e3be5c
      Sun Rui authored
      this PR :
      1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
      
      2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
      after collection is observed to be of Scala Seq type.
      
      3.  Support ArrayType in createDataFrame().
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8458 from sun-rui/SPARK-10049.
      45e3be5c
  22. Sep 03, 2015
    • CHOIJAEHONG's avatar
      [SPARK-8951] [SPARKR] support Unicode characters in collect() · af0e3125
      CHOIJAEHONG authored
      Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
      I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.
      
      Author: CHOIJAEHONG <redrock07@naver.com>
      
      Closes #7494 from CHOIJAEHONG1/SPARK-8951.
      af0e3125
  23. Aug 28, 2015
    • felixcheung's avatar
      [SPARK-9803] [SPARKR] Add subset and transform + tests · 2a4e00ca
      felixcheung authored
      Add subset and transform
      Also reorganize `[` & `[[` to subset instead of select
      
      Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
      Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #8503 from felixcheung/rsubset_transform.
      2a4e00ca
    • Shivaram Venkataraman's avatar
      [SPARK-10328] [SPARKR] Fix generic for na.omit · 2f99c372
      Shivaram Venkataraman authored
      S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8495 from shivaram/na-omit-fix.
      2f99c372
  24. Aug 27, 2015
  25. Aug 26, 2015
    • Yu ISHIKAWA's avatar
      [MINOR] [SPARKR] Fix some validation problems in SparkR · 773ca037
      Yu ISHIKAWA authored
      Getting rid of some validation problems in SparkR
      https://github.com/apache/spark/pull/7883
      
      cc shivaram
      
      ```
      inst/tests/test_Serde.R:26:1: style: Trailing whitespace is superfluous.
      
      ^~
      inst/tests/test_Serde.R:34:1: style: Trailing whitespace is superfluous.
      
      ^~
      inst/tests/test_Serde.R:37:38: style: Trailing whitespace is superfluous.
        expect_equal(class(x), "character")
                                           ^~
      inst/tests/test_Serde.R:50:1: style: Trailing whitespace is superfluous.
      
      ^~
      inst/tests/test_Serde.R:55:1: style: Trailing whitespace is superfluous.
      
      ^~
      inst/tests/test_Serde.R:60:1: style: Trailing whitespace is superfluous.
      
      ^~
      inst/tests/test_sparkSQL.R:611:1: style: Trailing whitespace is superfluous.
      
      ^~
      R/DataFrame.R:664:1: style: Trailing whitespace is superfluous.
      
      ^~~~~~~~~~~~~~
      R/DataFrame.R:670:55: style: Trailing whitespace is superfluous.
                      df <- data.frame(row.names = 1 : nrow)
                                                            ^~~~~~~~~~~~~~~~
      R/DataFrame.R:672:1: style: Trailing whitespace is superfluous.
      
      ^~~~~~~~~~~~~~
      R/DataFrame.R:686:49: style: Trailing whitespace is superfluous.
                          df[[names[colIndex]]] <- vec
                                                      ^~~~~~~~~~~~~~~~~~
      ```
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8474 from yu-iskw/minor-fix-sparkr.
      773ca037
    • felixcheung's avatar
      [SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for filter / select) · 75d4773a
      felixcheung authored
      Add support for
      ```
         df[df$name == "Smith", c(1,2)]
         df[df$age %in% c(19, 30), 1:2]
      ```
      
      shivaram
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #8394 from felixcheung/rsubset.
      75d4773a
  26. Aug 19, 2015
  27. Aug 18, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR · bf32c1f7
      Yu ISHIKAWA authored
      - Add `when` and `otherwise` as `Column` methods
      - Add `When` as an expression function
      - Add `%otherwise%` infix as an alias of `otherwise`
      
      Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?
      
      ### JIRA
      [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8266 from yu-iskw/SPARK-10075.
      bf32c1f7
  28. Aug 17, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter · 26e76058
      Yu ISHIKAWA authored
      ### Summary
      
      - Add `lit` function
      - Add `concat`, `greatest`, `least` functions
      
      I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue.
      
      ### JIRA
      [[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8194 from yu-iskw/SPARK-9856.
      26e76058
  29. Aug 16, 2015
    • Sun Rui's avatar
      [SPARK-8844] [SPARKR] head/collect is broken in SparkR. · 5f9ce738
      Sun Rui authored
      This is a WIP patch for SPARK-8844  for collecting reviews.
      
      This bug is about reading an empty DataFrame. in readCol(),
            lapply(1:numRows, function(x) {
      does not take into consideration the case where numRows = 0.
      
      Will add unit test case.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #7419 from sun-rui/SPARK-8844.
      5f9ce738
  30. Aug 12, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-9855] [SPARKR] Add expression functions into SparkR whose params are simple · f4bc01f1
      Yu ISHIKAWA authored
      I added lots of expression functions for SparkR. This PR includes only functions whose params  are only `(Column)` or `(Column, Column)`.  And I think we need to improve how to test those functions. However, it would be better to work on another issue.
      
      ## Diff Summary
      
      - Add lots of functions in `functions.R` and their generic in `generic.R`
      - Add aliases for `ceiling` and `sign`
      - Move expression functions from `column.R` to `functions.R`
      - Modify `rdname` from `column` to `functions`
      
      I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way  to define it.
      
      ## New Supported Functions
      
      ```
      approxCountDistinct
      ascii
      base64
      bin
      bitwiseNOT
      ceil (alias: ceiling)
      crc32
      dayofmonth
      dayofyear
      explode
      factorial
      hex
      hour
      initcap
      isNaN
      last_day
      length
      log2
      ltrim
      md5
      minute
      month
      negate
      quarter
      reverse
      round
      rtrim
      second
      sha1
      signum (alias: sign)
      size
      soundex
      to_date
      trim
      unbase64
      unhex
      weekofyear
      year
      
      datediff
      levenshtein
      months_between
      nanvl
      pmod
      ```
      
      ## JIRA
      [[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8123 from yu-iskw/SPARK-9855.
      f4bc01f1
  31. Jul 31, 2015
    • Hossein's avatar
      [SPARK-9318] [SPARK-9320] [SPARKR] Aliases for merge and summary functions on DataFrames · 712f5b7a
      Hossein authored
      This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.
      
      cc shivaram
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7806 from falaki/SPARK-9320 and squashes the following commits:
      
      72600f7 [Hossein] Updated docs
      92a6e75 [Hossein] Fixed merge generic signature issue
      4c2b051 [Hossein] Fixing naming with mllib summary
      0f3a64c [Hossein] Added ... to generic for merge
      30fbaf8 [Hossein] Merged master
      ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
      e8eb86f [Hossein] Add a generic for merge
      fc01f2d [Hossein] Added unit test
      8d92012 [Hossein] Added merge as an alias for join
      5b8bedc [Hossein] Added unit test
      632693d [Hossein] Added summary as an alias for describe for DataFrame
      712f5b7a
    • Hossein's avatar
      [SPARK-9324] [SPARK-9322] [SPARK-9321] [SPARKR] Some aliases for R-like functions in DataFrames · 710c2b5d
      Hossein authored
      Adds following aliases:
      * unique (distinct)
      * rbind (unionAll): accepts many DataFrames
      * nrow (count)
      * ncol
      * dim
      * names (columns): along with the replacement function to change names
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7764 from falaki/sparkR-alias and squashes the following commits:
      
      56016f5 [Hossein] Updated R documentation
      5e4a4d0 [Hossein] Removed extra code
      f51cbef [Hossein] Merge branch 'master' into sparkR-alias
      c1b88bd [Hossein] Moved setGeneric and other comments applied
      d9307f8 [Hossein] Added tests
      b5aa988 [Hossein] Added dim, ncol, nrow, names, rbind, and unique functions to DataFrames
      710c2b5d
    • Shivaram Venkataraman's avatar
      [SPARK-9510] [SPARKR] Remaining SparkR style fixes · 82f47b81
      Shivaram Venkataraman authored
      With the change in this patch, I get no more warnings from `./dev/lint-r` in my machine
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7834 from shivaram/sparkr-style-fixes and squashes the following commits:
      
      716cd8e [Shivaram Venkataraman] Remaining SparkR style fixes
      82f47b81
Loading