- Nov 01, 2016
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html). - Before: `Scala/ Java/ Python` - After: `Scala/Java/Python` ## How was this patch tested? Manual by the followings because this is documentation update. ``` cd docs SKIP_API=1 jekyll build ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15686 from dongjoon-hyun/minor_doc_space.
-
- Oct 05, 2016
-
-
Shixiong Zhu authored
## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column | Type ---- | ---- key | binary value | binary topic | string partition | int offset | long timestamp | long timestampType | int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options | value | default | meaning ------ | ------- | ------ | ----- startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.
-
- Sep 26, 2016
-
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? When reading file stream with non-globbing path, the results return data with all `null`s for the partitioned columns. E.g., case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/tmp/test" data.write.partitionBy("id").parquet(url) spark.read.parquet(url).show +-----+---+ |value| id| +-----+---+ | 2| 2| | 3| 2| | 1| 1| +-----+---+ val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url) s.writeStream.queryName("test").format("memory").start() sql("SELECT * FROM test").show +-----+----+ |value| id| +-----+----+ | 2|null| | 3|null| | 1|null| +-----+----+ ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #14803 from viirya/filestreamsource-option.
-
- Sep 01, 2016
-
-
Seigneurin, Alexis (CONT) authored
fixed 2 typos Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14877 from aseigneurin/fix-typo-2.
-
- Aug 30, 2016
-
-
Dmitriy Sokolov authored
## What changes were proposed in this pull request? Fix minor typos python example code in streaming programming guide ## How was this patch tested? N/A Author: Dmitriy Sokolov <silentsokolov@gmail.com> Closes #14805 from silentsokolov/fix-typos.
-
- Aug 29, 2016
-
-
Seigneurin, Alexis (CONT) authored
idempotant -> idempotent Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14833 from aseigneurin/fix-typo.
-
- Aug 23, 2016
-
-
hyukjinkwon authored
[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides ## What changes were proposed in this pull request? This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. ## How was this patch tested? Manually built. This will change some examples to be correctly marked down as below:  to  Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes.
-
- Aug 22, 2016
-
-
Jagadeesan authored
Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085.
-
- Aug 13, 2016
-
-
Jagadeesan authored
## What changes were proposed in this pull request? When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT``` …from its own release version] [Streaming programming guide] Author: Jagadeesan <as2@us.ibm.com> Closes #14596 from jagadeesanas2/SPARK-12370.
-
- Aug 11, 2016
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments. This PR fixes three things below: - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java. - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples. - Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially). ## How was this patch tested? N/A Closes https://github.com/apache/spark/pull/14491 Author: hyukjinkwon <gurwls223@gmail.com> Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local> Closes #14564 from HyukjinKwon/SPARK-16886.
-
- Jul 21, 2016
-
-
Holden Karau authored
## What changes were proposed in this pull request? Fix parquet to csv in a comment to match the input format being read. ## How was this patch tested? N/A (doc change only) Author: Holden Karau <holden@us.ibm.com> Closes #14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.
-
- Jul 19, 2016
-
-
Ahmed Mahran authored
## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #14234 from ahmed-mahran/b-struct-streaming-docs.
-
- Jul 13, 2016
-
-
James Thomas authored
## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <jamesjoethomas@gmail.com> Closes #14183 from jjthomas/ss_docs_update.
-
- Jun 30, 2016
-
-
Tathagata Das authored
Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14001 from tdas/SPARK-16256-2.
-
Tathagata Das authored
Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13978 from tdas/SPARK-16256-1.
-
- Jun 29, 2016
-
-
Tathagata Das authored
Title defines all. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13945 from tdas/SPARK-16256.
-