Skip to content
Snippets Groups Projects
  1. May 26, 2017
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      
      ## How was this patch tested?
      
      Manual tests, docs build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
      ae33abf7
  2. May 25, 2017
  3. Apr 19, 2017
    • ymahajan's avatar
      Fixed typos in docs · bdc60569
      ymahajan authored
      ## What changes were proposed in this pull request?
      
      Typos at a couple of place in the docs.
      
      ## How was this patch tested?
      
      build including docs
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: ymahajan <ymahajan@snappydata.io>
      
      Closes #17690 from ymahajan/master.
      bdc60569
  4. Apr 12, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] JSON APIs related documentation fixes · bca4259f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes corrections related to JSON APIs as below:
      
      - Rendering links in Python documentation
      - Replacing `RDD` to `Dataset` in programing guide
      - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
      - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
      
      ## How was this patch tested?
      
      Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
      
      Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17602 from HyukjinKwon/minor-json-documentation.
      bca4259f
  5. Apr 11, 2017
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Update supported versions for Hive Metastore · cde9e328
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17612 from dongjoon-hyun/metastore.
      cde9e328
  6. Mar 23, 2017
    • sureshthalamati's avatar
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to... · c7911807
      sureshthalamati authored
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
      
      ## What changes were proposed in this pull request?
      Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
      
      The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
      
      All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.
      
      Example:
      ```Scala
      df.write
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc(url, "TEST.DBCOLTYPETEST", properties)
      ```
      ## How was this patch tested?
      Added new test cases to the JDBCWriteSuite
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
      c7911807
  7. Mar 02, 2017
  8. Feb 25, 2017
    • Boaz Mohar's avatar
      [MINOR][DOCS] Fixes two problems in the SQL programing guide page · 061bcfb8
      Boaz Mohar authored
      ## What changes were proposed in this pull request?
      
      Removed duplicated lines in sql python example and found a typo.
      
      ## How was this patch tested?
      
      Searched for other typo's in the page to minimize PR's.
      
      Author: Boaz Mohar <boazmohar@gmail.com>
      
      Closes #17066 from boazmohar/doc-fix.
      061bcfb8
  9. Feb 14, 2017
  10. Jan 30, 2017
  11. Jan 25, 2017
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · 3fdce814
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      
      ## How was this patch tested?
      
      The patch was tested locally by building the docs. The examples were run as well.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16329 from aokolnychyi/SPARK-16046.
      3fdce814
  12. Jan 07, 2017
  13. Jan 05, 2017
  14. Dec 30, 2016
    • Cheng Lian's avatar
      [SPARK-19016][SQL][DOC] Document scalable partition handling · 871f6114
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR documents the scalable partition handling feature in the body of the programming guide.
      
      Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16424 from liancheng/scalable-partition-handling-doc.
      871f6114
  15. Dec 06, 2016
  16. Dec 05, 2016
  17. Nov 29, 2016
  18. Nov 26, 2016
    • Weiqing Yang's avatar
      [WIP][SQL][DOC] Fix incorrect `code` tag · f4a98e42
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR is to fix incorrect `code` tag in `sql-programming-guide.md`
      
      ## How was this patch tested?
      Manually.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15941 from weiqingy/fixtag.
      Unverified
      f4a98e42
  19. Nov 25, 2016
  20. Nov 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-18413][SQL] Add `maxConnections` JDBCOption · 07beb5d2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.
      
      **Reported Scenario**
      
      For the following cases, the number of connections becomes 200 and database cannot handle all of them.
      
      ```sql
      CREATE OR REPLACE TEMPORARY VIEW resultview
      USING org.apache.spark.sql.jdbc
      OPTIONS (
        url "jdbc:oracle:thin:10.129.10.111:1521:BKDB",
        dbtable "result",
        user "HIVE",
        password "HIVE"
      );
      -- set spark.sql.shuffle.partitions=200
      INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g
      ```
      
      ## How was this patch tested?
      
      Manual. Do the followings and see Spark UI.
      
      **Step 1 (MySQL)**
      ```
      CREATE TABLE t1 (a INT);
      CREATE TABLE data (a INT);
      INSERT INTO data VALUES (1);
      INSERT INTO data VALUES (2);
      INSERT INTO data VALUES (3);
      ```
      
      **Step 2 (Spark)**
      ```scala
      SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
      scala> sql("SET spark.sql.shuffle.partitions=3")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      ```
      
      ![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15868 from dongjoon-hyun/SPARK-18413.
      Unverified
      07beb5d2
  21. Nov 16, 2016
    • Weiqing Yang's avatar
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and... · 241e04bc
      Weiqing Yang authored
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation
      
      ## What changes were proposed in this pull request?
      
      Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation.
      
      ## How was this patch tested?
      Manually.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15886 from weiqingy/fixTypo.
      Unverified
      241e04bc
  22. Oct 27, 2016
  23. Oct 24, 2016
    • Sean Owen's avatar
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but... · 4ecbe1b9
      Sean Owen authored
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
      
      ## What changes were proposed in this pull request?
      
      Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15382 from srowen/SPARK-17810.
      Unverified
      4ecbe1b9
  24. Oct 18, 2016
  25. Oct 14, 2016
  26. Oct 11, 2016
    • hyukjinkwon's avatar
      [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in... · 0c0ad436
      hyukjinkwon authored
      [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options.
      
      This PR includes some changes as below:
      
        - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`.
      
      - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance.
      
      - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url.
      
      - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`).
      
      - Exclude Spark-only options in connection properties.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15292 from HyukjinKwon/SPARK-17719.
      0c0ad436
  27. Oct 10, 2016
    • Wenchen Fan's avatar
      [SPARK-17338][SQL] add global temp view · 23ddff4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
      
      changes for `SessionCatalog`:
      
      1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
      2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
      3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
      4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
      5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
      6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
      7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.
      
      changes for SQL commands:
      
      1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
      2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
      3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.
      
      changes for other public API
      
      1. add a new method `dropGlobalTempView` in `Catalog`
      2. `Catalog.findTable` can find global temp view
      3. add a new method `createGlobalTempView` in `Dataset`
      
      ## How was this patch tested?
      
      new tests in `SQLViewSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14897 from cloud-fan/global-temp-view.
      23ddff4b
  28. Sep 26, 2016
    • Justin Pihony's avatar
      [SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc · 50b89d05
      Justin Pihony authored
      ## What changes were proposed in this pull request?
      
      This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.
      
      ## How was this patch tested?
      
      This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.
      
      ## Additional details
      
      rxin This seems to have been most recently touched by you and was also commented on in the JIRA.
      
      This contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Justin Pihony <justin.pihony@gmail.com>
      Author: Justin Pihony <justin.pihony@typesafe.com>
      
      Closes #12601 from JustinPihony/jdbc_reconciliation.
      Unverified
      50b89d05
  29. Sep 17, 2016
  30. Aug 22, 2016
    • GraceH's avatar
      [SPARK-16968] Document additional options in jdbc Writer · 4b6c2cbc
      GraceH authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      This is the document for previous JDBC Writer options.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Unit test has been added in previous PR.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: GraceH <jhuang1@paypal.com>
      
      Closes #14683 from GraceH/jdbc_options.
      4b6c2cbc
  31. Aug 07, 2016
    • keliang's avatar
      [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu… · 1275f646
      keliang authored
      ## What changes were proposed in this pull request?
      default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
      
      ## How was this patch tested?
      
      not need
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      …ide.md
      
      JIRA_ID:SPARK-16870
      Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
      Test:done
      
      Author: keliang <keliang@cmss.chinamobile.com>
      
      Closes #14477 from biglobster/keliang.
      1275f646
  32. Aug 02, 2016
    • Cheng Lian's avatar
      [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings · 10e1c0e6
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14368 from liancheng/revise-examples.
      10e1c0e6
  33. Jul 25, 2016
  34. Jul 23, 2016
    • Cheng Lian's avatar
      [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding · 53b2456d
      Cheng Lian authored
      This PR is based on PR #14098 authored by wangmiao1981.
      
      ## What changes were proposed in this pull request?
      
      This PR replaces the original Python Spark SQL example file with the following three files:
      
      - `sql/basic.py`
      
        Demonstrates basic Spark SQL features.
      
      - `sql/datasource.py`
      
        Demonstrates various Spark SQL data sources.
      
      - `sql/hive.py`
      
        Demonstrates Spark SQL Hive interaction.
      
      This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14317 from liancheng/py-examples-update.
      53b2456d
  35. Jul 19, 2016
    • WeichenXu's avatar
      [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code · 9674af6f
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update `refreshTable` API in python code of the sql-programming-guide.
      
      This API is added in SPARK-15820
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14220 from WeichenXu123/update_sql_doc_catalog.
      9674af6f
    • Cheng Lian's avatar
      [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update · 1426a080
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
      
      ## How was this patch tested?
      
      Manually verified the generated HTML page.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14245 from liancheng/minor-scala-example-update.
      1426a080
  36. Jul 14, 2016
    • Shivaram Venkataraman's avatar
      [SPARK-16553][DOCS] Fix SQL example file name in docs · 01c4c1fa
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Fixes a typo in the sql programming guide
      
      ## How was this patch tested?
      
      Building docs locally
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #14208 from shivaram/spark-sql-doc-fix.
      01c4c1fa
  37. Jul 13, 2016
  38. Jul 12, 2016
    • Lianhui Wang's avatar
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose... · 5ad68ba5
      Lianhui Wang authored
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
      
      ## What changes were proposed in this pull request?
      when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
      
      Closes #13494 from lianhuiwang/metadata-only.
      5ad68ba5
Loading