© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dr. Steffen Hausmann
Sr. Solutions Architect, Amazon Web Services
Deep Dive into Concepts and Tools for
Analyzing Streaming Data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data originates in real-time
Creek 1 by mountainamoeba / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Analytics is done in batches
Königsee by andresumida / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Insights are Perishable
Chillis by Lucas Cobb / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Analyzing Streaming Data on AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Challenges of Stream Processing
Lines by FollowYour Nose / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Comparing Streams and Relations
𝑅 ⊆ 𝐼𝑑 × 𝐶𝑜𝑙𝑜𝑟
Relation
𝑆 ⊆ 𝐼𝑑 × 𝐶𝑜𝑙𝑜𝑟 × 𝑇𝑖𝑚𝑒
Stream
7
now
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Querying Streams and Relations
Relation Stream
Fixed data and ad-hoc queries
Fixed queries and
continuously ingested data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Challenges of Querying Infinite Streams
SELECT * FROM S WHERE color = ‘black’
SELECT * FROM S JOIN S’
SELECT color, COUNT(1) FROM S GROUP BY color
... NOT EXISTS (SELECT * FROM S WHERE color = ‘red’)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Analyzing Streaming Data on AWS
• Runs standard SQL queries on
top of streaming data
• Fully managed and scales
automatically
• Only pay for the resources your
queries consume
Amazon Kinesis Analytics
• Open-source stream processing
framework
• Included in Amazon Elastic Map
Reduce (EMR)
• Flexible APIs with Java and
Scalar, SQL, and CEP support
Apache Flink
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluating Queries over Streams
Windows by Brad Greenlee / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluating Non-monotonic Operators
Tumbling Windows
SELECT STREAM color, COUNT(1)
FROM ...
GROUP BY STEP(rowtime BY INTERVAL ‘10’ SECOND), color;
t1 t3 t5 t6 t9
10 sec
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluating Non-monotonic Operators
Sliding Windows
SELECT STREAM color, COUNT(1) OVER w
FROM ...
GROUP BY color
WINDOW w AS (RANGE INTERVAL ’10’ SECOND PRECEDING);
t1 t3 t5 t6 t9
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluating Non-monotonic Operators
Session Windows
t5 t6t1 t3 t8 t9
stream
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
session gap
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SELECT STREAM *
FROM S AS s JOIN S’ AS t
ON s.color = t.color
SELECT STREAM *
FROM S OVER w AS s JOIN S’ OVER w AS t
ON s.color = t.color
WINDOW w AS (RANGE INTERVAL ‘10’ SECOND PRECEDING);
Evaluating Unbounded Queries
t2 t4 t8t7
t1 t3 t5 t6 t9
S
S‘
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different Time Semantics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Maintaining Order of Events
t1 t3 t8t7
Event Time
t1 t3 t8 7
Processing Time
t7
t11
t11
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Maintaining Order of Events
Using processing time based windows
t1 t3 t8 t7
Processing
Time
processing
time
count
0
processing
time
count
10
t11
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Maintaining Order of Events
Using multiple time-windows
SELECT STREAM
STEP(rowtime BY INTERVAL ’10’ SECOND) AS processing_time,
STEP(event_time BY INTERVAL ’10’ SECOND) AS event_time,
color,
COUNT(1)
FROM ...
GROUP BY processing_time, event_time, color;
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Maintaining Order of Events
Using multiple time-windows
t1 t3 t8 t7
Processing
Time
processing
time
event time count
0 0
processing
time
event time count
10 0
10 10
t11
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Maintaining Order of Events
Using event time and watermarks
t1 t3 t8 t7
10 20
event time count
0
event time count
10
0
Processing
Time
t11
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adding Watermarks to a Stream
- Periodic watermarks
- Assuming ascending timestamps
- Punctuated watermarks
stream.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<MyEvent>() {
@Override
public long extractAscendingTimestamp(MyEvent element) {
return element.getCreationTime();
}
});
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Watermarks and Allowed Lateness
t3 t1 t8 t4
80
Processing
Time
stream
.keyBy(<key selector>)
.window(<window assigner>)
.allowedLateness(<time>)
.sideOutputLateData(lateOutputTag)
t5
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different Processing Semantics
Kaseki 2010 by Dominic Alves / cc by 2.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Consuming Data from a Stream
Consumer
Output sink
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different Processing Semantics
At-most Once Semantics
Consumer
Output sink
Offset store
pos 561
pos 561
pos 1105
pos 1105
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different Processing Semantics
At-least Once Semantics
Consumer
Output sink
Offset store
pos 561
pos 0
pos 0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different Processing Semantics
Exactly-once Semantics
• At-least-once event delivery plus
message deduplication
• Keep a transaction log of
processed messages
• On failure, replay events and
remove duplicated events for
every operator
Message Deduplication
• State for each operator is
periodically checkpointed
• On failure, rewind operator to
the previous consistent state
Distributed Snapshots
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Go Build!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session survey in
the summit mobile app.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS

  • 1.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Dr. Steffen Hausmann Sr. Solutions Architect, Amazon Web Services Deep Dive into Concepts and Tools for Analyzing Streaming Data
  • 2.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Data originates in real-time Creek 1 by mountainamoeba / cc by 2.0
  • 3.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Analytics is done in batches Königsee by andresumida / cc by 2.0
  • 4.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Insights are Perishable Chillis by Lucas Cobb / cc by 2.0
  • 5.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Analyzing Streaming Data on AWS
  • 6.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Challenges of Stream Processing Lines by FollowYour Nose / cc by 2.0
  • 7.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Comparing Streams and Relations 𝑅 ⊆ 𝐼𝑑 × 𝐶𝑜𝑙𝑜𝑟 Relation 𝑆 ⊆ 𝐼𝑑 × 𝐶𝑜𝑙𝑜𝑟 × 𝑇𝑖𝑚𝑒 Stream 7 now
  • 8.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Querying Streams and Relations Relation Stream Fixed data and ad-hoc queries Fixed queries and continuously ingested data
  • 9.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Challenges of Querying Infinite Streams SELECT * FROM S WHERE color = ‘black’ SELECT * FROM S JOIN S’ SELECT color, COUNT(1) FROM S GROUP BY color ... NOT EXISTS (SELECT * FROM S WHERE color = ‘red’)
  • 10.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 11.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Analyzing Streaming Data on AWS • Runs standard SQL queries on top of streaming data • Fully managed and scales automatically • Only pay for the resources your queries consume Amazon Kinesis Analytics • Open-source stream processing framework • Included in Amazon Elastic Map Reduce (EMR) • Flexible APIs with Java and Scalar, SQL, and CEP support Apache Flink SQL
  • 12.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Evaluating Queries over Streams Windows by Brad Greenlee / cc by 2.0
  • 13.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Evaluating Non-monotonic Operators Tumbling Windows SELECT STREAM color, COUNT(1) FROM ... GROUP BY STEP(rowtime BY INTERVAL ‘10’ SECOND), color; t1 t3 t5 t6 t9 10 sec SQL
  • 14.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Evaluating Non-monotonic Operators Sliding Windows SELECT STREAM color, COUNT(1) OVER w FROM ... GROUP BY color WINDOW w AS (RANGE INTERVAL ’10’ SECOND PRECEDING); t1 t3 t5 t6 t9 SQL
  • 15.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Evaluating Non-monotonic Operators Session Windows t5 t6t1 t3 t8 t9 stream .keyBy(<key selector>) .window(EventTimeSessionWindows.withGap(Time.minutes(10))) .<windowed transformation>(<window function>); session gap
  • 16.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. SELECT STREAM * FROM S AS s JOIN S’ AS t ON s.color = t.color SELECT STREAM * FROM S OVER w AS s JOIN S’ OVER w AS t ON s.color = t.color WINDOW w AS (RANGE INTERVAL ‘10’ SECOND PRECEDING); Evaluating Unbounded Queries t2 t4 t8t7 t1 t3 t5 t6 t9 S S‘ SQL
  • 17.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Different Time Semantics
  • 18.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Maintaining Order of Events t1 t3 t8t7 Event Time t1 t3 t8 7 Processing Time t7 t11 t11
  • 19.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Maintaining Order of Events Using processing time based windows t1 t3 t8 t7 Processing Time processing time count 0 processing time count 10 t11
  • 20.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Maintaining Order of Events Using multiple time-windows SELECT STREAM STEP(rowtime BY INTERVAL ’10’ SECOND) AS processing_time, STEP(event_time BY INTERVAL ’10’ SECOND) AS event_time, color, COUNT(1) FROM ... GROUP BY processing_time, event_time, color; SQL
  • 21.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Maintaining Order of Events Using multiple time-windows t1 t3 t8 t7 Processing Time processing time event time count 0 0 processing time event time count 10 0 10 10 t11
  • 22.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Maintaining Order of Events Using event time and watermarks t1 t3 t8 t7 10 20 event time count 0 event time count 10 0 Processing Time t11
  • 23.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Adding Watermarks to a Stream - Periodic watermarks - Assuming ascending timestamps - Punctuated watermarks stream.assignTimestampsAndWatermarks( new AscendingTimestampExtractor<MyEvent>() { @Override public long extractAscendingTimestamp(MyEvent element) { return element.getCreationTime(); } });
  • 24.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Watermarks and Allowed Lateness t3 t1 t8 t4 80 Processing Time stream .keyBy(<key selector>) .window(<window assigner>) .allowedLateness(<time>) .sideOutputLateData(lateOutputTag) t5
  • 25.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Different Processing Semantics Kaseki 2010 by Dominic Alves / cc by 2.0
  • 26.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Consuming Data from a Stream Consumer Output sink
  • 27.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Different Processing Semantics At-most Once Semantics Consumer Output sink Offset store pos 561 pos 561 pos 1105 pos 1105
  • 28.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Different Processing Semantics At-least Once Semantics Consumer Output sink Offset store pos 561 pos 0 pos 0
  • 29.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Different Processing Semantics Exactly-once Semantics • At-least-once event delivery plus message deduplication • Keep a transaction log of processed messages • On failure, replay events and remove duplicated events for every operator Message Deduplication • State for each operator is periodically checkpointed • On failure, rewind operator to the previous consistent state Distributed Snapshots
  • 30.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Go Build!
  • 31.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the summit mobile app.
  • 32.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Thank you!