Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from Kafka/Kinesis.
4. Complexities in stream processing
4
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
Complex Systems
Diverse storage
systems and formats
(SQL, NoSQL, parquet, ... )
System failures
Complex Workloads
Event time processing
Combining streaming
with interactive queries,
machine learning
5. Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
5
8. Treat Streams as Unbounded Tables
8
data stream unbounded input table
new data in the
data stream
=
new rows appended
to a unbounded table
9. New Model Trigger: every 1 sec
Time
Input data up
to t = 3
Query
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
t=1 t=2 t=3
data up
to t = 1
data up
to t = 2
10. New Model
result up
to t = 1
Result
Query
Time
data up
to t = 1
Input data up
to t = 2
result up
to t = 2
data up
to t = 3
result up
to t = 3
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every trigger
Complete output: write full result table every time
Output
[complete mode]
write all rows in result table to storage
t=1 t=2 t=3
11. New Model
t=1 t=2 t=3
Result
Query
Time
Input data up
to t = 3
result up
to t = 3
Output
[append mode] write new rows since last trigger to storage
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every trigger
Complete output: write full result table every time
Append output: write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
data up
to t = 1
data up
to t = 2
result up
to t = 1
result up
to t = 2
12. New Model
t=1 t=2 t=3
Result
Query
Time
Input data up
to t = 3
result up
to t = 3
Output
[append mode] write new rows since last trigger to storage
Conceptual model that guides
how to think of a streaming
query as a simple table query
Engine does not need to keep
the full input table in memory
once it has streamified it
data up
to t = 1
data up
to t = 2
result up
to t = 1
result up
to t = 2
13. static data =
bounded table
streaming data =
unbounded table
API - Dataset/DataFrame
Single API !
14. Batch Queries with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file
15. Streaming Queries with DataFrames
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Read from Json file stream
Replace read with readStream
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with start()
16. DataFrames,
Datasets, SQL
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical Plan
Streaming
Source
Project
device, signal
Filter
signal > 15
Streaming
Sink
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Series of Incremental
Execution Plans
process
newfiles
t = 1 t = 2 t = 3
process
newfiles
process
newfiles
17. Fault-tolerance with Checkpointing
Checkpointing - metadata
(e.g. offsets) of current batch stored in
a write ahead log in HDFS/S3
Query can be restarted from the log
Streaming sources can replay the
exact data range in case of failure
Streaming sinks can dedup reprocessed
data when writing, idempotent by design
end-to-end
exactly-once
guarantees
process
newfiles
t = 1 t = 2 t = 3
process
newfiles
process
newfiles
write
ahead
log
19. Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw data to
structured data ready for further analytics
19
file
dump
seconds hours
table
10101010
20. Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence
[intrusion detection, anomaly detection, etc.]
20
file
dump
seconds hours
table
10101010
21. Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as possible
21
table
seconds10101010
22. Streaming ETL w/ Structured Streaming
22
Example
- Json data being received in Kafka
- Parse nested json and flatten it
- Store in structured Parquet table
- Get end-to-end failure
guarantees
val rawData = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.option("kafka.boostrap.servers",...)
.load()
val parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json").as("data"))
.select("data.*")
val query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.start("/parquetTable/")
23. Reading from Kafka [Spark 2.1]
23
Support Kafka 0.10.0.1
Specify options to configure
How?
kafka.boostrap.servers => broker1
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
24. Reading from Kafka
24
val rawData = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.option("kafka.boostrap.servers",...)
.load()
rawData dataframe has
the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
25. Transforming Data
Cast binary value to string
Name it column json
25
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
26. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
26
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
27. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
27
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
timestamp device …
1486087873 devA …
1486086721 devX …
select("data.*")
(not nested)
28. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
28
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)
29. Writing to Parquet table
Save parsed data as Parquet
table in the given path
Partition files by date so that
future queries on time slices
of data is fast
e.g. query on last 48 hours of data
29
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
30. Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset logs
start actually starts a
continuous running
StreamingQuery in the
Spark cluster
30
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
31. Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the execution
31
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
32. Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrity
Even if distributed, ad-hoc queries will see either all updates from
streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
32
seconds!
complex, ad-hoc
queries on
latest
data
34. Event time Aggregations
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
34
35. Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Support UDAFs!
35
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
36. Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between triggers
Each trigger reads previous state and
writes updated state
State stored in memory, backed by
write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
36
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahead
log
state updates
are written to
log for checkpointing
state
37. Watermarking and Late Data
Watermark [Spark 2.1] - threshold
on how late an event is expected to
be in event time
Trails behind max seen event time
Trailing gap is configurable
37
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
38. Watermarking and Late Data
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too
late" and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
38
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
39. Watermarking and Late Data
39
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
40. Watermarking to Limit State [Spark 2.1]
40
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
more details in online
programming guide
41. Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
supports timeouts
fault-tolerant, exactly-once
supports Scala and Java
41
dataset
.groupByKey(groupingFunc)
.mapGroupsWithState(mappingFunc)
def mappingFunc(
key: K,
values: Iterator[V],
state: KeyedState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
42. Many more updates!
StreamingQueryListener [Spark 2.1]
Receive of regular progress heartbeats for health and perf monitoring
Automatic in Databricks!!
Streaming Deduplication [Spark 2.2]
Automatically eliminate duplicate data from Kafka/Kinesis/etc.
More Kafka Integration [Spark 2.2]
Run batch queries on Kafka, and write to Kafka from batch/streaming queries
Kinesis Source
Read from Amazon Kinesis
42
43. Future Directions
Stability, stability, stability
Needed to remove the Experimental tag
More supported operations
Stream-stream joins, …
Stable source and sink APIs
Connect to your streams and stores
More sources and sinks
JDBC, …
43
44. More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussions
https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
and more to come, stay tuned!!
44