Skip to content
Snippets Groups Projects
Commit dd859f95 authored by Seigneurin, Alexis (CONT)'s avatar Seigneurin, Alexis (CONT) Committed by Sean Owen
Browse files

fixed typos

fixed 2 typos

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #14877 from aseigneurin/fix-typo-2.
parent a18c169f
No related branches found
No related tags found
No related merge requests found
...@@ -400,7 +400,7 @@ data, thus relieving the users from reasoning about it. As an example, let’s ...@@ -400,7 +400,7 @@ data, thus relieving the users from reasoning about it. As an example, let’s
see how this model handles event-time based processing and late arriving data. see how this model handles event-time based processing and late arriving data.
## Handling Event-time and Late Data ## Handling Event-time and Late Data
Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of event every minute) to be just a special type of grouping and aggregation on the even-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier. Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the even-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
Furthermore, this model naturally handles data that has arrived later than expected based on its event-time. Since Spark is updating the Result Table, it has full control over updating/cleaning up the aggregates when there is late data. While not yet implemented in Spark 2.0, event-time watermarking will be used to manage this data. These are explained later in more details in the [Window Operations](#window-operations-on-event-time) section. Furthermore, this model naturally handles data that has arrived later than expected based on its event-time. Since Spark is updating the Result Table, it has full control over updating/cleaning up the aggregates when there is late data. While not yet implemented in Spark 2.0, event-time watermarking will be used to manage this data. These are explained later in more details in the [Window Operations](#window-operations-on-event-time) section.
...@@ -535,7 +535,7 @@ ds.filter(_.signal > 10).map(_.device) // using typed APIs ...@@ -535,7 +535,7 @@ ds.filter(_.signal > 10).map(_.device) // using typed APIs
df.groupBy("type").count() // using untyped API df.groupBy("type").count() // using untyped API
// Running average signal for each device type // Running average signal for each device type
Import org.apache.spark.sql.expressions.scalalang.typed._ import org.apache.spark.sql.expressions.scalalang.typed._
ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API
{% endhighlight %} {% endhighlight %}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment