Skip to content
Snippets Groups Projects
  1. Nov 01, 2016
  2. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
  3. Sep 26, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing · 8135e0e5
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      When reading file stream with non-globbing path, the results return data with all `null`s for the
      partitioned columns. E.g.,
      
          case class A(id: Int, value: Int)
          val data = spark.createDataset(Seq(
            A(1, 1),
            A(2, 2),
            A(2, 3))
          )
          val url = "/tmp/test"
          data.write.partitionBy("id").parquet(url)
          spark.read.parquet(url).show
      
          +-----+---+
          |value| id|
          +-----+---+
          |    2|  2|
          |    3|  2|
          |    1|  1|
          +-----+---+
      
          val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
          s.writeStream.queryName("test").format("memory").start()
      
          sql("SELECT * FROM test").show
      
          +-----+----+
          |value|  id|
          +-----+----+
          |    2|null|
          |    3|null|
          |    1|null|
          +-----+----+
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #14803 from viirya/filestreamsource-option.
      8135e0e5
  4. Sep 01, 2016
    • Seigneurin, Alexis (CONT)'s avatar
      fixed typos · dd859f95
      Seigneurin, Alexis (CONT) authored
      fixed 2 typos
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14877 from aseigneurin/fix-typo-2.
      dd859f95
  5. Aug 30, 2016
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
  6. Aug 29, 2016
    • Seigneurin, Alexis (CONT)'s avatar
      fixed a typo · 08913ce0
      Seigneurin, Alexis (CONT) authored
      idempotant -> idempotent
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14833 from aseigneurin/fix-typo.
      08913ce0
  7. Aug 23, 2016
  8. Aug 22, 2016
  9. Aug 13, 2016
    • Jagadeesan's avatar
      [SPARK-12370][DOCUMENTATION] Documentation should link to examples … · e46cb78b
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
      
      …from its own release version] [Streaming programming guide]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #14596 from jagadeesanas2/SPARK-12370.
      e46cb78b
  10. Aug 11, 2016
    • hyukjinkwon's avatar
      [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation · 7186e8c3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
      
      This PR fixes three things below:
      
       - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
      
      - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
      
      - Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
      
      ## How was this patch tested?
      
      N/A
      
      Closes https://github.com/apache/spark/pull/14491
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
      
      Closes #14564 from HyukjinKwon/SPARK-16886.
      7186e8c3
  11. Jul 21, 2016
  12. Jul 19, 2016
    • Ahmed Mahran's avatar
      [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar · 6caa2205
      Ahmed Mahran authored
      ## What changes were proposed in this pull request?
      
      Minor fixes correcting some typos, punctuations, grammar.
      Adding more anchors for easy navigation.
      Fixing minor issues with code snippets.
      
      ## How was this patch tested?
      
      `jekyll serve`
      
      Author: Ahmed Mahran <ahmed.mahran@mashin.io>
      
      Closes #14234 from ahmed-mahran/b-struct-streaming-docs.
      6caa2205
  13. Jul 13, 2016
    • James Thomas's avatar
      [SPARK-16114][SQL] updated structured streaming guide · 51a6706b
      James Thomas authored
      ## What changes were proposed in this pull request?
      
      Updated structured streaming programming guide with new windowed example.
      
      ## How was this patch tested?
      
      Docs
      
      Author: James Thomas <jamesjoethomas@gmail.com>
      
      Closes #14183 from jjthomas/ss_docs_update.
      51a6706b
  14. Jun 30, 2016
  15. Jun 29, 2016
Loading