Have Your Cake & Eat It Too
Further Dispelling the Myths of the Lambda Architecture
Tyler Akidau
Staff Software Engineer
Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/millwheel
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
MillWheel
Streaming Flume
Cloud Dataflow
- Stream Processing System
- High-level API
- Data Processing Service
Google Cloud Dataflow
Optimize
Schedule
GCS GCS
- Slava Chernyak, Josh Haberman, Reuven Lax,
Daniel Mills, Paul Nordstrom, Sam McVeety,
Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,
and more...
- Robert Bradshaw, Craig Chambers, Reuven
Lax, Daniel Mills, Frances Perry, and
more...
MillWheel
Streaming Flume
Cloud Dataflow
Cloud Dataflow is unreleased.
Things may change.
Lambda vs Streaming
Strong Consistency
Reasoning About Time
1
2
3
Agenda
Lambda vs Streaming1
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
The Lambda Architecture
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
The Evolution of Streaming
Strong Consistency
Tools for Reasoning About Time
What does it take?
Strong Consistency2
Consistent Storage
Storage
• Mostly correct is not good enough
• Required for exactly-once processing
• Required for repeatable results
• Cannot replace batch without it
Why consistency is important
• Sequencers (e.g. BigTable)
• Leases (e.g. Spanner)
• Federation of storage silos (e.g. Samza,
Dataflow)
• RDDs (e.g. Spark)
How?
http://research.google.com/pubs/pub41378.html
Reasoning About Time3
Event Time vs Stream Time
Batch vs Streaming
Approaches
Dataflow API
Event Time - When Events Happened
Stream Time - When Events Are Processed
Batch vs Streaming
MapReduce
Batch
MapReduce
[10:00 - 11:00)[10:00 - 11:00) [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch: Fixed Windows
MapReduce
[10:00 - 11:00)[11:00 - 12:00)
Batch: User Sessions
Joan
Larry
Ingo
Amanda
Cheryl
Arthur
[11:00 - 12:00)[10:00 - 11:00)
Streaming
11:00 10:0016:00 15:00 14:00 13:00 12:00
Unordered
Unbounded
Of Varying Event Time Skew
Confounding characteristics of data streams
Event Time Skew
StreamTime
Event Time
Skew
Approaches
1. Time-Agnostic Processing
2. Approximation
3. Stream Time Windowing
4. Event Time Windowing
Approaches to reasoning about time
1. Time-Agnostic Processing - Filters
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Web server traffic logs
All traffic from specific domains
Straightforward
Efficient
Limited utility
1. Time-Agnostic Processing - Hash Join
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Query & Click traffic
Joined stream of Query + Click pairs
Straightforward
Efficient
Limited utility
2. Approximation via Online Algorithms
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Approximate top N hashtags per prefix
Efficient
Inexact
Complicated Algorithms
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Web server request traffic
Per-minute rate of received requests
Straightforward
Results reflect contents of stream
Results don’t reflect events as they happened
If approximating event time, usefulness varies
Example Input:
Example Output:
Pros:
Cons:
3. Windowing by Stream Time
11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Top N hashtags by prefix per hour.
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows
11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
User activity stream
Per-session group of activities
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions
Dataflow API
What are you computing?
Where in event time?
When in stream time?
What = Aggregation API
Where = Windowing API
When = Watermarks + Triggers API
Aggregation API
PCollection<KV<String, Double>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(new Sum());
Aggregation API
2
4
7
0
1
6
33
8
9
18
9
16
Sum
Streaming Mode
10:02 10:0010:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0 2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
10:02 10:0010:06 10:04 Event Time
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2
Windowing API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new FixedWindows(2, MINUTE)));
.apply(new Sum());
Windowing API
10:02 10:0010:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0
10:02 10:0010:06 10:04 Event Time
10:02 10:0010:06 10:04 Event Time
2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
FixedWindows
Sum
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2
13 12616 1835 415
Watermarks
● f(S) -> E
● S = a point in stream time (i.e. now)
● E = the point in event time up to
which input data is complete as of S
Event Time Skew
StreamTime
Event Time
FixedWindows
Sum
10:02 10:00
13616 1835 415 12
10:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0
10:02 10:0010:06 10:04 Event Time
10:01 10:0010:03 10:02 Event Time
2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2
Watermarks
Watermark Caveats
Too slow = more latency
Too fast = late data
Triggers
When in stream time to emit?
Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new FixedWindows(2, MINUTES))
.trigger(new AtWatermark());
.apply(new Sum());
2
3
1
8
4
8
7
6 3
1313
Event Time10:05 10:0610:0110:00
10:0310:0010:06
2
12 5
52020
99
10:0110:0210:0510:04
10:02 10:03 10:04
5Late datum
A Better Strategy
1. Once per stream time minute
2. At watermark
3. Once per record for two weeks
1325 5
2
3
1
8
4
8
7
6 3
5
Event Time10:05 10:0610:0110:00
10:0310:0010:06
2
12 5
52020
99
10:0110:0210:0510:04
10:02 10:03 10:04
12
20
1320
5
1320
13
1325
2
12
13
9
20 520
25Late datum
91325
Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new FixedWindows(2, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
Lambda vs Streaming
Low-latency, approximate results
Complete, correct results as soon as possible
Ability to deal with changes upstream
One Last Thing...
What if I want sessions?
Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new Sessions(1, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
2
8
4
8
7
6 3
Event Time10:05 10:0610:0110:00
10:0310:0010:0610:0110:0210:0510:04
10:02 10:03 10:04
22
3
1
9
1 minute2 7
1 minute2
2 7
1 minute
9 39 3
39 1 48
89 7 2525
25
5Late datum
25 83333
33
335 3838
6 3938 9
20 5
13
2
12
25
9
Summary
Lambda is great
Streaming by itself is better :-)
Strong Consistency = Correctness
Streaming = Aggregation + Windowing + Triggers
Tools For Reasoning About Time = Power + Flexibility
Thank you!
Questions?
Questions about this talk:
Questions about Cloud Dataflow:
takidau@google.com (Tyler Akidau)
cloude@google.com (Eric Schmidt)
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/millwheel

Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda Architecture

  • 1.
    Have Your Cake& Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /millwheel
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
    MillWheel Streaming Flume Cloud Dataflow -Stream Processing System - High-level API - Data Processing Service
  • 5.
  • 6.
    - Slava Chernyak,Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more... - Robert Bradshaw, Daniel Mills, and more... - Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more... MillWheel Streaming Flume Cloud Dataflow
  • 7.
    Cloud Dataflow isunreleased. Things may change.
  • 8.
    Lambda vs Streaming StrongConsistency Reasoning About Time 1 2 3 Agenda
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Strong Consistency Tools forReasoning About Time What does it take?
  • 15.
  • 16.
  • 17.
    • Mostly correctis not good enough • Required for exactly-once processing • Required for repeatable results • Cannot replace batch without it Why consistency is important
  • 18.
    • Sequencers (e.g.BigTable) • Leases (e.g. Spanner) • Federation of storage silos (e.g. Samza, Dataflow) • RDDs (e.g. Spark) How?
  • 19.
  • 20.
  • 21.
    Event Time vsStream Time Batch vs Streaming Approaches Dataflow API
  • 22.
    Event Time -When Events Happened Stream Time - When Events Are Processed
  • 23.
  • 24.
  • 25.
    MapReduce [10:00 - 11:00)[10:00- 11:00) [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00) Batch: Fixed Windows
  • 26.
    MapReduce [10:00 - 11:00)[11:00- 12:00) Batch: User Sessions Joan Larry Ingo Amanda Cheryl Arthur [11:00 - 12:00)[10:00 - 11:00)
  • 27.
  • 28.
    Unordered Unbounded Of Varying EventTime Skew Confounding characteristics of data streams
  • 29.
  • 30.
  • 31.
    1. Time-Agnostic Processing 2.Approximation 3. Stream Time Windowing 4. Event Time Windowing Approaches to reasoning about time
  • 32.
    1. Time-Agnostic Processing- Filters 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time Example Input: Example Output: Pros: Cons: Web server traffic logs All traffic from specific domains Straightforward Efficient Limited utility
  • 33.
    1. Time-Agnostic Processing- Hash Join 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time Example Input: Example Output: Pros: Cons: Query & Click traffic Joined stream of Query + Click pairs Straightforward Efficient Limited utility
  • 34.
    2. Approximation viaOnline Algorithms 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time Example Input: Example Output: Pros: Cons: Twitter hashtags Approximate top N hashtags per prefix Efficient Inexact Complicated Algorithms
  • 35.
    11:00 10:0016:00 15:0014:00 13:00 12:00 Stream Time Web server request traffic Per-minute rate of received requests Straightforward Results reflect contents of stream Results don’t reflect events as they happened If approximating event time, usefulness varies Example Input: Example Output: Pros: Cons: 3. Windowing by Stream Time
  • 36.
    11:00 10:0016:00 15:0014:00 13:00 12:00 Event Time Example Input: Example Output: Pros: Cons: Twitter hashtags Top N hashtags by prefix per hour. Reflects events as they occurred More complicated buffering Completeness issues 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time 4. Windowing by Event Time - Fixed Windows
  • 37.
    11:00 10:0016:00 15:0014:00 13:00 12:00 Event Time Example Input: Example Output: Pros: Cons: User activity stream Per-session group of activities Reflects events as they occurred More complicated buffering Completeness issues 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time 4. Windowing by Event Time - Sessions
  • 38.
  • 39.
    What are youcomputing? Where in event time? When in stream time?
  • 40.
    What = AggregationAPI Where = Windowing API When = Watermarks + Triggers API
  • 41.
    Aggregation API PCollection<KV<String, Double>>sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());
  • 42.
  • 43.
    Streaming Mode 10:02 10:0010:0610:04 Stream Time 2 4 3 1 6 3 3 8 7 0 2 4 1 6 3 3 8 9 0 4 7 0 3 3 2 10:02 10:0010:06 10:04 Event Time 1 3 8 9 0 46 1 8 7 0 23 432 74 3 6 3 0 3 3 2
  • 44.
    Windowing API PCollection<KV<String, Long>>sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());
  • 45.
    Windowing API 10:02 10:0010:0610:04 Stream Time 2 4 3 1 6 3 3 8 7 0 10:02 10:0010:06 10:04 Event Time 10:02 10:0010:06 10:04 Event Time 2 4 1 6 3 3 8 9 0 4 7 0 3 3 2 FixedWindows Sum 1 3 8 9 0 46 1 8 7 0 23 432 74 3 6 3 0 3 3 2 13 12616 1835 415
  • 46.
    Watermarks ● f(S) ->E ● S = a point in stream time (i.e. now) ● E = the point in event time up to which input data is complete as of S
  • 47.
  • 48.
    FixedWindows Sum 10:02 10:00 13616 1835415 12 10:06 10:04 Stream Time 2 4 3 1 6 3 3 8 7 0 10:02 10:0010:06 10:04 Event Time 10:01 10:0010:03 10:02 Event Time 2 4 1 6 3 3 8 9 0 4 7 0 3 3 2 1 3 8 9 0 46 1 8 7 0 23 432 74 3 6 3 0 3 3 2 Watermarks
  • 49.
    Watermark Caveats Too slow= more latency Too fast = late data
  • 50.
  • 51.
    Triggers API PCollection<KV<String, Long>>sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());
  • 52.
    2 3 1 8 4 8 7 6 3 1313 Event Time10:0510:0610:0110:00 10:0310:0010:06 2 12 5 52020 99 10:0110:0210:0510:04 10:02 10:03 10:04 5Late datum
  • 53.
    A Better Strategy 1.Once per stream time minute 2. At watermark 3. Once per record for two weeks
  • 54.
    1325 5 2 3 1 8 4 8 7 6 3 5 EventTime10:05 10:0610:0110:00 10:0310:0010:06 2 12 5 52020 99 10:0110:0210:0510:04 10:02 10:03 10:04 12 20 1320 5 1320 13 1325 2 12 13 9 20 520 25Late datum 91325
  • 55.
    Triggers API PCollection<KV<String, Long>>sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
  • 56.
    Lambda vs Streaming Low-latency,approximate results Complete, correct results as soon as possible Ability to deal with changes upstream
  • 57.
    One Last Thing... Whatif I want sessions?
  • 58.
    Triggers API PCollection<KV<String, Long>>sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
  • 59.
    2 8 4 8 7 6 3 Event Time10:0510:0610:0110:00 10:0310:0010:0610:0110:0210:0510:04 10:02 10:03 10:04 22 3 1 9 1 minute2 7 1 minute2 2 7 1 minute 9 39 3 39 1 48 89 7 2525 25 5Late datum 25 83333 33 335 3838 6 3938 9 20 5 13 2 12 25 9
  • 60.
    Summary Lambda is great Streamingby itself is better :-) Strong Consistency = Correctness Streaming = Aggregation + Windowing + Triggers Tools For Reasoning About Time = Power + Flexibility
  • 61.
    Thank you! Questions? Questions aboutthis talk: Questions about Cloud Dataflow: takidau@google.com (Tyler Akidau) cloude@google.com (Eric Schmidt)
  • 62.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/millwheel