Skip to content
Snippets Groups Projects
  1. Mar 19, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script. · 2082a495
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`.
      ```bash
      $ ./bin/sparkR examples/src/main/r/dataframe.R
      Running R applications through 'sparkR' is not supported as of Spark 2.0.
      Use ./bin/spark-submit <R file>
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11842 from dongjoon-hyun/update_r_readme.
      2082a495
    • Reynold Xin's avatar
      [SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegen · 1970d911
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to 500L << 23 and got negative numbers instead.
      
      ## How was this patch tested?
      I'm only modifying test code.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11839 from rxin/SPARK-14018.
      1970d911
    • Sameer Agarwal's avatar
      [SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader · b3959447
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file.
      
      ## How was this patch tested?
      
      N/A (refactoring only)
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11834 from sameeragarwal/rename.
      b3959447
  2. Mar 18, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Update build descriptions and commands · c11ea2e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR updates Scala and Hadoop versions in the build description and commands in `Building Spark` documents.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11838 from dongjoon-hyun/fix_doc_building_spark.
      c11ea2e4
    • Yuhao Yang's avatar
      [SPARK-13629][ML] Add binary toggle Param to CountVectorizer · f43a26ef
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013,
      containing some comment update and style adjustment.
      jkbradley
      
      ## How was this patch tested?
      
      unit tests.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11830 from hhbyyh/cvToggle.
      f43a26ef
    • Sameer Agarwal's avatar
      [SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader · 54794113
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR cleans up the new parquet record reader with the following changes:
      
      1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`.
      2. Removes the non-vectorized column reader code from `ColumnReader`.
      3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader`
      4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED`
      
      ## How was this patch tested?
      
      Refactoring only; Existing tests should reveal any problems.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11799 from sameeragarwal/vectorized-parquet.
      54794113
    • Yin Huai's avatar
      [SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a... · 238fb485
      Yin Huai authored
      [SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis
      
      ## What changes were proposed in this pull request?
      As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by
      ```
                        case NonFatal(e) => fail(
                          s"""Failed to analyze the converted SQL string:
                              |
                              |# Original HiveQL query string:
                              |$queryString
                              |
                              |# Resolved query plan:
                              |${originalQuery.analyzed.treeString}
                              |
                              |# Converted SQL query string:
                              |$convertedSQL
                           """.stripMargin, e)
      ```
      
      Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`).
      ```
                  case e: Throwable =>
                    val errorMessage =
                      s"""
                        |Failed to execute query using catalyst:
                        |Error: ${e.getMessage}
                        |${stackTraceToString(e)}
                        |$queryString
                        |$query
                        |== HIVE - ${hive.size} row(s) ==
                        |${hive.mkString("\n")}
                      """.stripMargin
      ```
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11825 from yhuai/SPARK-13972-follow-up.
      238fb485
    • Sital Kedia's avatar
      [SPARK-13958] Executor OOM due to unbounded growth of pointer array in… · 2e0c5284
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      This change fixes the executor OOM which was recently introduced in PR apache/spark#11095
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      Tested by running a spark job on the cluster.
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      … Sorter
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #11794 from sitalkedia/SPARK-13958.
      2e0c5284
    • jerryshao's avatar
      [SPARK-13885][YARN] Fix attempt id regression for Spark running on Yarn · 35377821
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code.
      
      Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it.
      
      Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now.
      
      ## How was this patch tested?
      
      Test it with unit test and manually test different scenario:
      
      1. application running in yarn-client mode.
      2. application running in yarn-cluster mode.
      3. application running in yarn-cluster mode with multiple attempts.
      
      Checked both the event log file name and url link.
      
      CC vanzin tgravescs , please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11721 from jerryshao/SPARK-13885.
      35377821
    • Davies Liu's avatar
      [SPARK-13977] [SQL] Brings back Shuffled hash join · 9c23c818
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.
      
      ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.
      
      This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).
      
      A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.
      
      ## How was this patch tested?
      
      Added new unit tests for ShuffledHashJoin.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11788 from davies/shuffle_join.
      9c23c818
    • Cheng Lian's avatar
      [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the... · 14c7236d
      Cheng Lian authored
      [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings
      
      ## What changes were proposed in this pull request?
      
      Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`.
      
      This PR fixes this issue by only picking the first qualifiers.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Existing tests should be enough.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11820 from liancheng/spark-14004-single-qualifier.
      14c7236d
    • Wenchen Fan's avatar
      [SPARK-13972][SQ] hive tests should fail if SQL generation failed · 0acb32a3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.
      
      We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11782 from cloud-fan/test.
      0acb32a3
    • Zheng RuiFeng's avatar
      [MINOR][DOC] Fix nits in JavaStreamingTestExample · 53f32a22
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419
      use !rdd.isEmpty instead of rdd.count > 0
      use static instead of AtomicInteger
      remove unneeded "throws Exception"
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11821 from zhengruifeng/je_fix.
      53f32a22
    • Wenchen Fan's avatar
      [SPARK-14001][SQL] support multi-children Union in SQLBuilder · 0f1015ff
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.
      
      ## How was this patch tested?
      
      The re-enabled test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11818 from cloud-fan/bug-fix.
      0f1015ff
    • Yanbo Liang's avatar
      [MINOR][ML] When trainingSummary is None, it should throw RuntimeException. · 7783b6f3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      When trainingSummary is None, it should throw ```RuntimeException```.
      cc mengxr
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11784 from yanboliang/fix-summary.
      7783b6f3
    • Reynold Xin's avatar
      [SPARK-13826][SQL] Addendum: update documentation for Datasets · bb1fda01
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.
      
      ## How was this patch tested?
      Just documentation/api stability update.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11814 from rxin/dataset-docs.
      bb1fda01
    • Liang-Chi Hsieh's avatar
      [SPARK-13930] [SQL] Apply fast serialization on collect limit operator · 750ed64c
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-13930
      
      Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
      
      ## How was this patch tested?
      
      Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.
      
      Without this patch:
      
          model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
          collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
          collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X
      
      With this patch:
      
          model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
          collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
          collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11759 from viirya/execute-take.
      750ed64c
  3. Mar 17, 2016
    • Cheng Lian's avatar
      [SPARK-13826][SQL] Revises Dataset ScalaDoc · 10ef4f3e
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR revises Dataset API ScalaDoc.  All public methods are divided into the following groups
      
      * `groupname basic`: Basic Dataset functions
      * `groupname action`: Actions
      * `groupname untypedrel`: Untyped Language Integrated Relational Queries
      * `groupname typedrel`: Typed Language Integrated Relational Queries
      * `groupname func`: Functional Transformations
      * `groupname rdd`: RDD Operations
      * `groupname output`: Output Operations
      
      `since` tag and sample code are also updated.  We may want to add more sample code for typed APIs.
      
      ## How was this patch tested?
      
      Documentation change.  Checked by building unidoc locally.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11769 from liancheng/spark-13826-ds-api-doc.
      10ef4f3e
    • tedyu's avatar
      [SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 · 90a1d8db
      tedyu authored
      PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure.  This PR fixes this issue by expanding this pattern match into several simpler ones.
      
      Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case.
      
      Compilation against Scala 2.10
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #11798 from yy2016/master.
      90a1d8db
    • Josh Rosen's avatar
      [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore · 6c2d894a
      Josh Rosen authored
      This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.
      
      This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.
      
      This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11748 from JoshRosen/chunked-block-serialization.
      6c2d894a
    • Wenchen Fan's avatar
      [SPARK-13976][SQL] do not remove sub-queries added by user when generate SQL · 6037ed0a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.
      
      an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
      before this PR, we will generate:
      ```
      SELECT attr_1 AS a FROM
        SELECT attr_1 FROM (
          SELECT a AS attr_1 FROM tbl
        ) AS sub_q0
        WHERE attr_1 = 1
      ```
      We missed a sub-query and this SQL string is illegal.
      
      After this PR, we will generate:
      ```
      SELECT attr_1 AS a FROM (
        SELECT attr_1 FROM (
          SELECT a AS attr_1 FROM tbl
        ) AS sub_q0
        WHERE attr_1 = 1
      ) AS t
      ```
      
      TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.
      
      ## How was this patch tested?
      
      `LogicalPlanToSQLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11786 from cloud-fan/bug-fix.
      6037ed0a
    • Wenchen Fan's avatar
      [SPARK-13974][SQL] sub-query names do not need to be globally unique while generate SQL · 453455c4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`.
      
      also addressed 2 minor comments in https://github.com/apache/spark/pull/11696
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11783 from cloud-fan/tmp.
      453455c4
    • sethah's avatar
      [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees · 1614485f
      sethah authored
      Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.
      
      Say there are 3 categories A, B, C. We consider 3 splits:
      
      * A vs. B, C
      * A, B vs. C
      * A, C vs. B
      
      Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).
      
      This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9474 from sethah/SPARK-10788.
      1614485f
    • Joseph K. Bradley's avatar
      [SPARK-13761][ML] Remove remaining uses of validateParams · b39e80d3
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema
      
      ## How was this patch tested?
      
      Existing unit tests, modified to check using transformSchema instead of validateParams
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11790 from jkbradley/SPARK-13761-cleanup.
      b39e80d3
    • Yin Huai's avatar
      Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10" · 4c08e2c0
      Yin Huai authored
      This reverts commit 3ee79961.
      4c08e2c0
    • Xusen Yin's avatar
      [SPARK-11891] Model export/import for RFormula and RFormulaModel · edf8b877
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11891
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #9884 from yinxusen/SPARK-11891.
      edf8b877
    • Bryan Cutler's avatar
      [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable · 828213d4
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
      
      ## How was this patch tested?
      Ran python tests for ML and MLlib.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
      828213d4
    • tedyu's avatar
      [SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 · 3ee79961
      tedyu authored
      ## What changes were proposed in this pull request?
      
      Compilation against Scala 2.10 fails with:
      ```
      [error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for         unreachability.
      [error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.)
      [error] [warn]     private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match {
      ```
      
      ## How was this patch tested?
      
      Compilation against Scala 2.10
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #11787 from yy2016/master.
      3ee79961
    • Liang-Chi Hsieh's avatar
      [SPARK-13838] [SQL] Clear variable code to prevent it to be re-evaluated in BoundAttribute · 5f3bda6f
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13838
      ## What changes were proposed in this pull request?
      
      We should also clear the variable code in `BoundReference.genCode` to prevent it  to be evaluated twice, as we did in `evaluateVariables`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11674 from viirya/avoid-reevaluate.
      5f3bda6f
    • Dilip Biswal's avatar
      [SPARK-13427][SQL] Support USING clause in JOIN. · 637a78f1
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      Support queries that JOIN tables with USING clause.
      SELECT * from table1 JOIN table2 USING <column_list>
      
      USING clause can be used as a means to simplify the join condition
      when :
      
      1) Equijoin semantics is desired and
      2) The column names in the equijoin have the same name.
      
      We already have the support for Natural Join in Spark. This PR makes
      use of the already existing infrastructure for natural join to
      form the join condition and also the projection list.
      
      ## How was the this patch tested?
      
      Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #11297 from dilipbiswal/spark-13427.
      637a78f1
    • Shixiong Zhu's avatar
      [SPARK-13776][WEBUI] Limit the max number of acceptors and selectors for Jetty · 65b75e66
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request.
      
      To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary.
      
      ## How was this patch tested?
      
      Just make sure we don't break any existing tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11615 from zsxwing/SPARK-13776.
      65b75e66
    • Wenchen Fan's avatar
      [SPARK-12719][SQL] SQL generation support for Generate · 1974d1d3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.
      
      This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes.
      
      Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596
      
      ## How was this patch tested?
      
      new tests in `LogicalPlanToSQLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11696 from cloud-fan/generate.
      1974d1d3
    • Wenchen Fan's avatar
      [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging · 8ef3399a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11764 from cloud-fan/logger.
      8ef3399a
    • trueyao's avatar
      [SPARK-13901][CORE] correct the logDebug information when jump to the next locality level · ea9ca6f0
      trueyao authored
      JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901
      In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it.
      
      Author: trueyao <501663994@qq.com>
      
      Closes #11719 from trueyao/logDebug-localityWait.
      ea9ca6f0
    • Yuhao Yang's avatar
      [SPARK-13629][ML] Add binary toggle Param to CountVectorizer · 357d82d8
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
      If set, then all non-zero counts will be set to 1.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11536 from hhbyyh/cvToggle.
      357d82d8
    • Zheng RuiFeng's avatar
      [MINOR][DOC] Add JavaStreamingTestExample · 204c9dec
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add the java example of StreamingTest
      
      ## How was this patch tested?
      
      manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11776 from zhengruifeng/streaming_je.
      204c9dec
    • Davies Liu's avatar
      Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning... · 30c18841
      Davies Liu authored
      Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator"
      
      This reverts commit 99bd2f0e.
      30c18841
    • Josh Rosen's avatar
      [SPARK-13948] MiMa check should catch if the visibility changes to private · 82066a16
      Josh Rosen authored
      MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version.
      
      This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11774 from JoshRosen/SPARK-13948.
      82066a16
    • Ryan Blue's avatar
      [SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors. · 5faba9fa
      Ryan Blue authored
      This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
      
      I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
      5faba9fa
    • Josh Rosen's avatar
      [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types · de1a84e5
      Josh Rosen authored
      Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
      
      This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
      
      In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11755 from JoshRosen/automatically-pick-best-serializer.
      de1a84e5
Loading