Skip to content
Snippets Groups Projects
  1. Nov 15, 2015
    • Sun Rui's avatar
      [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable · 835a79d7
      Sun Rui authored
      The basic idea is that:
      The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.
      
      When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.
      
      sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.
      
      The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.
      
      Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages)  so that these package can be accessed in R.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #9390 from sun-rui/SPARK-10500.
      835a79d7
    • zero323's avatar
      [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b
      zero323 authored
      Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
      
      At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
      
      A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
      
      It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9099 from zero323/SPARK-11086.
      d7d9fa0b
    • Yu Gao's avatar
      [SPARK-10181][SQL] Do kerberos login for credentials during hive client initialization · 72c1d68b
      Yu Gao authored
      On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread.
      
      The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance.
      
       yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this?
      
      Author: Yu Gao <ygao@us.ibm.com>
      Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com>
      Author: Yu Gao <crystalgaoyu@gmail.com>
      
      Closes #9272 from yolandagao/master.
      72c1d68b
    • Yin Huai's avatar
      [SPARK-11738] [SQL] Making ArrayType orderable · 3e2e1873
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-11738
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9718 from yhuai/makingArrayOrderable.
      3e2e1873
    • Xiangrui Meng's avatar
      [SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite · 64e55511
      Xiangrui Meng authored
      The same as #9694, but for Java test suite. yhuai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9719 from mengxr/SPARK-11672.4.
      64e55511
    • Reynold Xin's avatar
      [SPARK-11734][SQL] Rename TungstenProject -> Project, TungstenSort -> Sort · d22fc108
      Reynold Xin authored
      I didn't remove the old Sort operator, since we still use it in randomized tests. I moved it into test module and renamed it ReferenceSort.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9700 from rxin/SPARK-11734.
      d22fc108
  2. Nov 14, 2015
  3. Nov 13, 2015
  4. Nov 12, 2015
    • Yanbo Liang's avatar
      [SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer Perceptron Classification · ea5ae270
      Yanbo Liang authored
      Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9594 from yanboliang/spark-11629.
      ea5ae270
    • Lewuathe's avatar
      [SPARK-11717] Ignore R session and history files from git · 2035ed39
      Lewuathe authored
      see: https://issues.apache.org/jira/browse/SPARK-11717
      
      SparkR generates R session data and history files under current directory.
      It might be useful to ignore these files even running SparkR on spark directory for test or development.
      
      Author: Lewuathe <lewuathe@me.com>
      
      Closes #9681 from Lewuathe/SPARK-11717.
      2035ed39
    • felixcheung's avatar
      [SPARK-11263][SPARKR] lintr Throws Warnings on Commented Code in Documentation · ed04846e
      felixcheung authored
      Clean out hundreds of `style: Commented code should be removed.` from lintr
      
      Like these:
      ```
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
      # sc <- sparkR.init()
        ^~~~~~~~~~~~~~~~~~~
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
      # sqlContext <- sparkRSQL.init(sc)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
      # path <- "path/to/file.json"
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      ```
      
      tried without export or rdname, neither work
      instead, added this `#' noRd` to suppress .Rd file generation
      
      also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
      ![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)
      
      this covers *most* of 'Commented code' but I left out a few that looks legitimate.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9463 from felixcheung/rlintr.
      ed04846e
    • Xiangrui Meng's avatar
      [SPARK-11672][ML] flaky spark.ml read/write tests · e71c0755
      Xiangrui Meng authored
      We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext`  and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites.
      
      cc: yhuai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9677 from mengxr/SPARK-11672.2.
      e71c0755
    • Tathagata Das's avatar
      [SPARK-11681][STREAMING] Correctly update state timestamp even when state is not updated · e4e46b20
      Tathagata Das authored
      Bug: Timestamp is not updated if there is data but the corresponding state is not updated. This is wrong, and timeout is defined as "no data for a while", not "not state update for a while".
      
      Fix: Update timestamp when timestamp when timeout is specified, otherwise no need.
      Also refactored the code for better testability and added unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #9648 from tdas/SPARK-11681.
      e4e46b20
    • Burak Yavuz's avatar
      [SPARK-11419][STREAMING] Parallel recovery for FileBasedWriteAheadLog + minor recovery tweaks · 7786f9cc
      Burak Yavuz authored
      The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway.
      
      However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized.
      
      This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #9373 from brkyvz/par-recovery.
      7786f9cc
    • Shixiong Zhu's avatar
      [SPARK-11663][STREAMING] Add Java API for trackStateByKey · 0f1d00a9
      Shixiong Zhu authored
      TODO
      - [x] Add Java API
      - [x] Add API tests
      - [x] Add a function test
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #9636 from zsxwing/java-track.
      0f1d00a9
    • Michael Armbrust's avatar
      [SPARK-11654][SQL] add reduce to GroupedDataset · 41bbd230
      Michael Armbrust authored
      This PR adds a new method, `reduce`, to `GroupedDataset`, which allows similar operations to `reduceByKey` on a traditional `PairRDD`.
      
      ```scala
      val ds = Seq("abc", "xyz", "hello").toDS()
      ds.groupBy(_.length).reduce(_ + _).collect()  // not actually commutative :P
      
      res0: Array(3 -> "abcxyz", 5 -> "hello")
      ```
      
      While implementing this method and its test cases several more deficiencies were found in our encoder handling.  Specifically, in order to support positional resolution, named resolution and tuple composition, it is important to keep the unresolved encoder around and to use it when constructing new `Datasets` with the same object type but different output attributes.  We now divide the encoder lifecycle into three phases (that mirror the lifecycle of standard expressions) and have checks at various boundaries:
      
       - Unresoved Encoders: all users facing encoders (those constructed by implicits, static methods, or tuple composition) are unresolved, meaning they have only `UnresolvedAttributes` for named fields and `BoundReferences` for fields accessed by ordinal.
       - Resolved Encoders: internal to a `[Grouped]Dataset` the encoder is resolved, meaning all input has been resolved to a specific `AttributeReference`.  Any encoders that are placed into a logical plan for use in object construction should be resolved.
       - BoundEncoder: Are constructed by physical plans, right before actual conversion from row -> object is performed.
      
      It is left to future work to add explicit checks for resolution and provide good error messages when it fails.  We might also consider enforcing the above constraints in the type system (i.e. `fromRow` only exists on a `ResolvedEncoder`), but we should probably wait before spending too much time on this.
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9673 from marmbrus/pr/9628.
      41bbd230
    • Joseph K. Bradley's avatar
      [SPARK-11712][ML] Make spark.ml LDAModel be abstract · dcb896fd
      Joseph K. Bradley authored
      Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.
      
      CC feynmanliang mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9678 from jkbradley/lda-pipelines-2.
      dcb896fd
    • Xiangrui Meng's avatar
      [SPARK-11709] include creation site info in SparkContext.assertNotStopped error message · bc092966
      Xiangrui Meng authored
      This helps debug issues caused by multiple SparkContext instances. JoshRosen andrewor14
      
      ~~~
      scala> sc.stop()
      
      scala> sc.parallelize(0 until 10)
      java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
      This stopped SparkContext was created at:
      
      org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
      org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
      $iwC$$iwC.<init>(<console>:9)
      $iwC.<init>(<console>:18)
      <init>(<console>:20)
      .<init>(<console>:24)
      .<clinit>(<console>)
      .<init>(<console>:7)
      .<clinit>(<console>)
      $print(<console>)
      sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      java.lang.reflect.Method.invoke(Method.java:606)
      org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
      org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
      org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
      org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
      org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
      org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
      
      The active context was created at:
      
      (No active SparkContext.)
      ~~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9675 from mengxr/SPARK-11709.
      bc092966
    • Chris Snow's avatar
      [SPARK-11658] simplify documentation for PySpark combineByKey · 68ef61bb
      Chris Snow authored
      Author: Chris Snow <chsnow123@gmail.com>
      
      Closes #9640 from snowch/patch-3.
      68ef61bb
    • Andrew Or's avatar
      [SPARK-11667] Update dynamic allocation docs to reflect supported cluster managers · 12a0784a
      Andrew Or authored
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9637 from andrewor14/update-da-docs.
      12a0784a
    • Andrew Or's avatar
      [SPARK-11670] Fix incorrect kryo buffer default value in docs · cf38fc75
      Andrew Or authored
      <img width="931" alt="screen shot 2015-11-11 at 1 53 21 pm" src="https://cloud.githubusercontent.com/assets/2133137/11108261/35d183d4-889a-11e5-9572-85e9d6cebd26.png">
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9638 from andrewor14/fix-kryo-docs.
      cf38fc75
    • Jean-Baptiste Onofré's avatar
      [SPARK-2533] Add locality levels on stage summary view · 74c30049
      Jean-Baptiste Onofré authored
      Author: Jean-Baptiste Onofré <jbonofre@apache.org>
      
      Closes #9487 from jbonofre/SPARK-2533-2.
      74c30049
    • Chris Snow's avatar
      [SPARK-11671] documentation code example typo · 380dfcc0
      Chris Snow authored
      Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
      
      Author: Chris Snow <chsnow123@gmail.com>
      
      Closes #9639 from snowch/patch-2.
      380dfcc0
    • Shixiong Zhu's avatar
      [SPARK-11290][STREAMING][TEST-MAVEN] Fix the test for maven build · f0d3b58d
      Shixiong Zhu authored
      Should not create SparkContext in the constructor of `TrackStateRDDSuite`. This is a follow up PR for #9256 to fix the test for maven build.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #9668 from zsxwing/hotfix.
      f0d3b58d
    • Marcelo Vanzin's avatar
      [SPARK-11655][CORE] Fix deadlock in handling of launcher stop(). · 767d288b
      Marcelo Vanzin authored
      The stop() callback was trying to close the launcher connection in the
      same thread that handles connection data, which ended up causing a
      deadlock. So avoid that by dispatching the stop() request in its own
      thread.
      
      On top of that, add some exception safety to a few parts of the code,
      and use "destroyForcibly" from Java 8 if it's available, to force
      kill the child process. The flip side is that "kill()" may not actually
      work if running Java 7.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9633 from vanzin/SPARK-11655.
      767d288b
Loading