Skip to content
Snippets Groups Projects
  1. Jun 05, 2016
    • Kai Jiang's avatar
      [MINOR][R][DOC] Fix R documentation generation instruction. · 8a911051
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      changes in R/README.md
      
      - Make step of generating SparkR document more clear.
      - link R/DOCUMENTATION.md from R/README.md
      - turn on some code syntax highlight in R/README.md
      
      ## How was this patch tested?
      local test
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #13488 from vectorijk/R-Readme.
      8a911051
    • Zheng RuiFeng's avatar
      [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi · 372fa61f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, remove comments `:: Experimental ::` for non-experimental API
      2, add comments `:: Experimental ::` for experimental API
      3, add comments `:: DeveloperApi ::` for developerApi API
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13514 from zhengruifeng/del_experimental.
      372fa61f
    • Brett Randall's avatar
      [SPARK-15723] Fixed local-timezone-brittle test where short-timezone form "EST" is … · 4e767d0f
      Brett Randall authored
      ## What changes were proposed in this pull request?
      
      Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).
      
      ## How was this patch tested?
      
      Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>
      
      and run
      
          $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>
      
      To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.
      
      Author: Brett Randall <javabrett@gmail.com>
      
      Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
      4e767d0f
  2. Jun 04, 2016
    • Weiqing Yang's avatar
      [SPARK-15707][SQL] Make Code Neat - Use map instead of if check. · 0f307db5
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      In forType function of object RandomDataGenerator, the code following:
      if (maybeSqlTypeGenerator.isDefined){
        ....
        Some(generator)
      } else{
       None
      }
      will be changed. Instead, maybeSqlTypeGenerator.map will be used.
      
      ## How was this patch tested?
      All of the current unit tests passed.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #13448 from Sherry302/master.
      0f307db5
    • Josh Rosen's avatar
      [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty · 091f81e1
      Josh Rosen authored
      We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.
      
      We should also cache `StructType.hashCode`.
      
      In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13504 from JoshRosen/metadata-fix.
      091f81e1
    • Sean Owen's avatar
      [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright · 681387b2
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Per conversation on dev list, add missing modernizr license.
      Specify "2014 and onwards" in copyright statement.
      
      ## How was this patch tested?
      
      (none required)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13510 from srowen/ModernizrLicense.
      681387b2
    • Ruifeng Zheng's avatar
      [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score · 2099e05f
      Ruifeng Zheng authored
      ## What changes were proposed in this pull request?
      1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
      2, update user guide for `mlllib.weightedFMeasure`
      
      ## How was this patch tested?
      local build
      
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #13390 from zhengruifeng/clarify_f1.
      2099e05f
    • Lianhui Wang's avatar
      [SPARK-15756][SQL] Support command 'create table stored as orcfile/parquetfile/avrofile' · 2ca563cc
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support  both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'.
      So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13500 from lianhuiwang/SPARK-15756.
      2ca563cc
  3. Jun 03, 2016
    • Subroto Sanyal's avatar
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation... · 61d729ab
      Subroto Sanyal authored
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation tokens to be added in current user credential.
      
      ## What changes were proposed in this pull request?
      The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.
      
      ## How was this patch tested?
      ran dev/run-tests
      
      Author: Subroto Sanyal <ssanyal@datameer.com>
      
      Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
      61d729ab
    • Davies Liu's avatar
      [SPARK-15391] [SQL] manage the temporary memory of timsort · 3074f575
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
      
      This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
      
      This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13318 from davies/fix_timsort.
      3074f575
    • Holden Karau's avatar
      [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier · 67cc89ff
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.
      
      Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )
      
      ## How was this patch tested?
      
      Doc tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
      67cc89ff
    • Andrew Or's avatar
      [SPARK-15722][SQL] Disallow specifying schema in CTAS statement · b1cc7da3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      As of this patch, the following throws an exception because the schemas may not match:
      ```
      CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
      ```
      but this is OK:
      ```
      CREATE TABLE students AS SELECT * FROM boxes
      ```
      
      ## How was this patch tested?
      
      SQLQuerySuite, HiveDDLCommandSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13490 from andrewor14/ctas-no-column.
      b1cc7da3
    • Wenchen Fan's avatar
      [SPARK-15140][SQL] make the semantics of null input object for encoder clear · 11c83f83
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.
      
      This PR explicitly add this constraint and throw exception if users break it.
      
      ## How was this patch tested?
      
      several new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13469 from cloud-fan/null-object.
      11c83f83
    • Xin Wu's avatar
      [SPARK-15681][CORE] allow lowercase or mixed case log level string when calling sc.setLogLevel · 28ad0f7b
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.
      
      This PR is to allow case-insensitive user input for the log level.
      
      ## How was this patch tested?
      A unit testcase is added.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13422 from xwu0226/reset_loglevel.
      28ad0f7b
    • Wenchen Fan's avatar
      [SPARK-15547][SQL] nested case class in encoder can have different number of... · 61b80d55
      Wenchen Fan authored
      [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
      
      ## What changes were proposed in this pull request?
      
      There are 2 kinds of `GetStructField`:
      
      1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
      2. created when we build deserializer expression for nested tuple, no `name` property.
      
      When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.
      
      ## How was this patch tested?
      
      new test in `EncoderResolutionSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13474 from cloud-fan/ordinal-check.
      61b80d55
    • gatorsmile's avatar
      [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED · eb10b481
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Before this PR, the output of EXPLAIN of following SQL is like
      
      ```SQL
      CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
      PARTITIONED BY (ds STRING, hr STRING)
      LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
      ```
      ``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``
      
      After this PR, the output is like
      
      ```
      ExecutedCommand
      :  +- CreateTableCommand CatalogTable(
      	Table:`extTable_with_partitions`
      	Created:Thu Jun 02 21:30:54 PDT 2016
      	Last Access:Wed Dec 31 15:59:59 PST 1969
      	Type:EXTERNAL
      	Schema:[`key` int, `value` string, `ds` string, `hr` string]
      	Partition Columns:[`ds`, `hr`]
      	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
      ```
      
      This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng
      
      #### How was this patch tested?
      Manual testing
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13070 from gatorsmile/betterExplainCatalogTable.
      eb10b481
    • Josh Rosen's avatar
      [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods · e5269139
      Josh Rosen authored
      In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.
      
      For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).
      
      ### Perf results (from #13456 benchmarks)
      
      **Before**
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            19 /   22          0.0    19119858.0       1.0X
      10 select expressions                           23 /   25          0.0    23208774.0       0.8X
      100 select expressions                          55 /   73          0.0    54768402.0       0.3X
      1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
      2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
      ```
      
      **After**
      
      ```
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            15 /   21          0.0    14978203.0       1.0X
      10 select expressions                           22 /   27          0.0    22492262.0       0.7X
      100 select expressions                          48 /   64          0.0    48449834.0       0.3X
      1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
      2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
      ```
      
      ###
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13484 from JoshRosen/treenode-productiterator-map.
      e5269139
    • Devaraj K's avatar
      [SPARK-15665][CORE] spark-submit --kill and --status are not working · efd3b11a
      Devaraj K authored
      ## What changes were proposed in this pull request?
      --kill and --status were not considered while handling in OptionParser and due to that it was failing. Now handling the --kill and --status options as part of OptionParser.handle.
      
      ## How was this patch tested?
      Added a test org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testCliKillAndStatus() and also I have verified these manually by running --kill and --status commands.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13407 from devaraj-kavali/SPARK-15665.
      efd3b11a
    • Ioana Delaney's avatar
      [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws... · 9e2eb13c
      Ioana Delaney authored
      [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException
      
      ## What changes were proposed in this pull request?
      Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw
      UnsupportedOperationException exception.
      
      Problem repro:
      ```SQL
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
      scala> sql("select (select min(c1) from t2) from t1").show()
      
      java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 []
        at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215)
        at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62)
        at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473)
      ```
      The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down
      a scalar-subquery expression to the local tables.
      
      The solution prevents the rule to apply if Project references scalar subqueries.
      
      ## How was this patch tested?
      Added regression tests to SubquerySuite.scala
      
      Author: Ioana Delaney <ioanamdelaney@gmail.com>
      
      Closes #13418 from ioana-delaney/scalarSubV2.
      9e2eb13c
    • bomeng's avatar
      [SPARK-15737][CORE] fix jetty warning · 8fa00dd0
      bomeng authored
      ## What changes were proposed in this pull request?
      
      After upgrading Jetty to 9.2, we always see "WARN org.eclipse.jetty.server.handler.AbstractHandler: No Server set for org.eclipse.jetty.server.handler.ErrorHandler" while running any test cases.
      
      This PR will fix it.
      
      ## How was this patch tested?
      
      The existing test cases will cover it.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13475 from bomeng/SPARK-15737.
      8fa00dd0
    • Imran Rashid's avatar
      [SPARK-15714][CORE] Fix flaky o.a.s.scheduler.BlacklistIntegrationSuite · c2f0cb4f
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      BlacklistIntegrationSuite (introduced by SPARK-10372) is a bit flaky because of some race conditions:
      1. Failed jobs might have non-empty results, because the resultHandler will be invoked for successful tasks (if there are task successes before failures)
      2. taskScheduler.taskIdToTaskSetManager must be protected by a lock on taskScheduler
      
      (1) has failed a handful of jenkins builds recently.  I don't think I've seen (2) in jenkins, but I've run into with some uncommitted tests I'm working on where there are lots more tasks.
      
      While I was in there, I also made an unrelated fix to `runningTasks`in the test framework -- there was a pointless `O(n)` operation to remove completed tasks, could be `O(1)`.
      
      ## How was this patch tested?
      
      I modified the o.a.s.scheduler.BlacklistIntegrationSuite to have it run the tests 1k times on my laptop.  It failed 11 times before this change, and none with it.  (Pretty sure all the failures were problem (1), though I didn't check all of them).
      
      Also the full suite of tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13454 from squito/SPARK-15714.
      c2f0cb4f
    • Wenchen Fan's avatar
      [SPARK-15494][SQL] encoder code cleanup · 190ff274
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
      
      1. move validation logic to analyzer instead of encoder
      2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
      3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
      4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
      
      ## How was this patch tested?
      
      existing test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13269 from cloud-fan/clean-encoder.
      190ff274
    • Dongjoon Hyun's avatar
      [SPARK-15744][SQL] Rename two TungstenAggregation*Suites and update codgen/error messages/comments · b9fcfb3b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      For consistency, this PR updates some remaining `TungstenAggregation/SortBasedAggregate` after SPARK-15728.
      - Update a comment in codegen in `VectorizedHashMapGenerator.scala`.
      - `TungstenAggregationQuerySuite` --> `HashAggregationQuerySuite`
      - `TungstenAggregationQueryWithControlledFallbackSuite` --> `HashAggregationQueryWithControlledFallbackSuite`
      - Update two error messages in `SQLQuerySuite.scala` and `AggregationQuerySuite.scala`.
      - Update several comments.
      
      ## How was this patch tested?
      
      Manual (Only comment changes and test suite renamings).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13487 from dongjoon-hyun/SPARK-15744.
      b9fcfb3b
    • Sameer Agarwal's avatar
      [SPARK-15745][SQL] Use classloader's getResource() for reading resource files in HiveTests · f7288e16
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This is a cleaner approach in general but my motivation behind this change in particular is to be able to run these tests from anywhere without relying on system properties.
      
      ## How was this patch tested?
      
      Test only change
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13489 from sameeragarwal/resourcepath.
      f7288e16
    • Xin Wu's avatar
      [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem · 76aa45d3
      Xin Wu authored
      ## What changes were proposed in this pull request?
      ##### The root cause:
      When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens.
      
      This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects.
      
      ## How was this patch tested?
      Regtest is run. Manual test:
      ```
      scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show
      +-----+---+
      | text| id|
      +-----+---+
      |hello|  0|
      |world|  0|
      |hello|  1|
      |there|  1|
      +-----+---+
      
             spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show
      +-----+---+
      | text| id|
      +-----+---+
      |hello|  0|
      |world|  0|
      |hello|  1|
      |there|  1|
      +-----+---+
      ```
      I also tried it with 2 level of partitioning.
      I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13463 from xwu0226/SPARK-14959.
      76aa45d3
    • Sean Zhong's avatar
      [SPARK-15733][SQL] Makes the explain output less verbose by hiding some... · 6dde2740
      Sean Zhong authored
      [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.
      
      ## What changes were proposed in this pull request?
      
      This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc.
      
      **Before change**:
      
      ```
      == Physical Plan ==
      ExecutedCommand
      :  +- ShowTablesCommand None, None
      ```
      
      **After change**:
      
      ```
      == Physical Plan ==
      ExecutedCommand
      :  +- ShowTablesCommand
      ```
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13470 from clockfly/verbose_breakdown_4.
      6dde2740
  4. Jun 02, 2016
    • Eric Liang's avatar
      [SPARK-15724] Add benchmarks for performance over wide schemas · 901b2e69
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This adds microbenchmarks for tracking performance of queries over very wide or deeply nested DataFrames. It seems performance degrades when DataFrames get thousands of columns wide or hundreds of fields deep.
      
      ## How was this patch tested?
      
      Current results included.
      
      cc rxin JoshRosen
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13456 from ericl/sc-3468.
      901b2e69
    • Wenchen Fan's avatar
      [SPARK-15732][SQL] better error message when use java reserved keyword as field name · 6323e4bd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime.
      
      This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message.
      
      ## How was this patch tested?
      
      new test in DatasetSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13485 from cloud-fan/java.
      6323e4bd
    • Andrew Or's avatar
      [SPARK-15715][SQL] Fix alter partition with storage information in Hive · d1c1fbc3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This command didn't work for Hive tables. Now it does:
      ```
      ALTER TABLE boxes PARTITION (width=3)
          SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
          WITH SERDEPROPERTIES ('compress'='true')
      ```
      
      ## How was this patch tested?
      
      `HiveExternalCatalogSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13453 from andrewor14/alter-partition-storage.
      d1c1fbc3
    • Xiangrui Meng's avatar
      [SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuite · e23370ec
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed.
      
      This PR disables the test. I will leave the JIRA open for a proper fix
      
      ## How was this patch tested?
      
      No new features.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13478 from mengxr/SPARK-15740.
      e23370ec
    • Wenchen Fan's avatar
      [SPARK-15718][SQL] better error message for writing bucketed data · f34aadc5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently we don't support bucketing for `save` and `insertInto`.
      
      For `save`, we just write the data out into a directory users specified, and it's not a table, we don't keep its metadata. When we read it back, we have no idea if the data is bucketed or not, so it doesn't make sense to use `save` to write bucketed data, as we can't use the bucket information anyway.
      
      We can support it in the future, once we have features like bucket discovery, or we save bucket information in the data directory too, so that we don't need to rely on a metastore.
      
      For `insertInto`, it inserts data into an existing table, so it doesn't make sense to specify bucket information, as we should get the bucket information from the existing table.
      
      This PR improves the error message for the above 2  cases.
      ## How was this patch tested?
      
      new test in `BukctedWriteSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13452 from cloud-fan/error-msg.
      f34aadc5
    • Josh Rosen's avatar
      [SPARK-15736][CORE] Gracefully handle loss of DiskStore files · 229f9022
      Josh Rosen authored
      If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.
      
      In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.
      
      This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13473 from JoshRosen/handle-missing-cache-files.
      229f9022
    • Yuhao Yang's avatar
      [SPARK-15668][ML] ml.feature: update check schema to avoid confusion when user... · 5855e005
      Yuhao Yang authored
      [SPARK-15668][ML] ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
      
      ## What changes were proposed in this pull request?
      
      ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
      
      ## How was this patch tested?
      existing ut
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #13411 from hhbyyh/schemaCheck.
      5855e005
    • Nick Pentreath's avatar
      [MINOR] clean up style for storage param setters in ALS · ccd298eb
      Nick Pentreath authored
      Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up).
      
      ## How was this patch tested?
      Existing tests - no functionality change.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13480 from MLnick/als-param-minor-cleanup.
      ccd298eb
    • Sean Zhong's avatar
      [SPARK-15734][SQL] Avoids printing internal row in explain output · 985d5328
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR avoids printing internal rows in explain output for some operators.
      
      **Before change:**
      
      ```
      scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3")
      scala> spark.sql("select * from df3 where 1=2").explain(true)
      ...
      == Analyzed Logical Plan ==
      _1: int, _2: int, _3: int
      Project [_1#37,_2#38,_3#39]
      +- Filter (1 = 2)
         +- SubqueryAlias df3
            +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]
      ...
      == Physical Plan ==
      LocalTableScan [_1#37,_2#38,_3#39]
      ```
      
      **After change:**
      
      ```
      scala> spark.sql("select * from df3 where 1=2").explain(true)
      ...
      == Analyzed Logical Plan ==
      _1: int, _2: int, _3: int
      Project [_1#58,_2#59,_3#60]
      +- Filter (1 = 2)
         +- SubqueryAlias df3
            +- LocalRelation [_1#58,_2#59,_3#60]
      ...
      == Physical Plan ==
      LocalTableScan <empty>, [_1#58,_2#59,_3#60]
      ```
      
      ## How was this patch tested?
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13471 from clockfly/verbose_breakdown_5.
      985d5328
    • Cheng Lian's avatar
      [SPARK-15719][SQL] Disables writing Parquet summary files by default · 43154276
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR disables writing Parquet summary files by default (i.e., when Hadoop configuration "parquet.enable.summary-metadata" is not set).
      
      Please refer to [SPARK-15719][1] for more details.
      
      ## How was this patch tested?
      
      New test case added in `ParquetQuerySuite` to check no summary files are written by default.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-15719
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13455 from liancheng/spark-15719-disable-parquet-summary-files.
      43154276
    • Holden Karau's avatar
      [SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methods · 72353311
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel`
      
      ## How was this patch tested?
      
      Extended doc tests.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
      72353311
    • Sean Zhong's avatar
      [SPARK-15711][SQL] Ban CREATE TEMPORARY TABLE USING AS SELECT · d109a1be
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR bans syntax like `CREATE TEMPORARY TABLE USING AS SELECT`
      
      `CREATE TEMPORARY TABLE ... USING ... AS ...` is not properly implemented, the temporary data is not cleaned up when the session exits. Before a full fix, we probably should ban this syntax.
      
      This PR only impact syntax like `CREATE TEMPORARY TABLE ... USING ... AS ...`.
      Other syntax like `CREATE TEMPORARY TABLE .. USING ...` and `CREATE TABLE ... USING ...` are not impacted.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13451 from clockfly/ban_create_temp_table_using_as.
      d109a1be
    • gatorsmile's avatar
      [SPARK-15515][SQL] Error Handling in Running SQL Directly On Files · 9aff6f3b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to address the following issues:
      
      - **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support:
      ```SQL
      SQL Example:
        select id from `org.apache.spark.sql.hive.orc`.`file_path`
      Error Message:
        Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`
      ```
      Instead, we should issue the error message like:
      ```
      Expected Error Message:
         The ORC data source must be used with Hive support enabled
      ```
      - **ISSUE 2:** For the Avro format, we report the strange error message like:
      
      The example query is like
        ```SQL
      SQL Example:
        select id from `avro`.`file_path`
        select id from `com.databricks.spark.avro`.`file_path`
      Error Message:
        Table or view not found: `com.databricks.spark.avro`.`file_path`
         ```
      The desired message should be like:
      ```
      Expected Error Message:
        Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"
      ```
      
      - ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~
      
      **Update**: The latest code changes contains
      - For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc`
      - Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
      - Added the test cases for all the supported formats.
      
      #### How was this patch tested?
      Added test cases to cover all the above issues
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13283 from gatorsmile/runSQLAgainstFile.
      9aff6f3b
    • Reynold Xin's avatar
      [SPARK-15728][SQL] Rename aggregate operators: HashAggregate and SortAggregate · 8900c8d8
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have two physical aggregate operators: TungstenAggregate and SortBasedAggregate. These names don't make a lot of sense from an end-user point of view. This patch renames them HashAggregate and SortAggregate.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13465 from rxin/SPARK-15728.
      8900c8d8
Loading