Have your cake and eat it too, further dispelling the myths of the lambda architecture

Have Your Cake & Eat It Too
Further Dispelling the Myths of the Lambda Architecture
Tyler Akidau
Staff Software Engineer

MillWheel
Streaming Flume
Cloud Dataflow
- Stream Processing System
- High-level API
- Data Processing Service

Google Cloud Dataflow
Optimize
Schedule
GCS GCS

- Slava Chernyak, Josh Haberman, Reuven Lax,
Daniel Mills, Paul Nordstrom, Sam McVeety,
Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,
and more...
- Robert Bradshaw, Craig Chambers, Reuven
Lax, Daniel Mills, Frances Perry, and
more...
MillWheel
Streaming Flume
Cloud Dataflow

Cloud Dataflow is unreleased.
Things may change.

Lambda vs Streaming
Strong Consistency
Reasoning About Time
1
2
3
Agenda

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Strong Consistency
Tools for Reasoning About Time
What does it take?

• Mostly correct is not good enough
• Required for exactly-once processing
• Required for repeatable results
• Cannot replace batch without it
Why consistency is important

• Sequencers (e.g. BigTable)
• Leases (e.g. Spanner)
• Federation of storage silos (e.g. Samza,
Dataflow)
• RDDs (e.g. Spark)
How?

http://research.google.com/pubs/pub41378.html

Event Time vs Stream Time
Batch vs Streaming
Approaches
Dataflow API

Event Time - When Events Happened
Stream Time - When Events Are Processed

MapReduce
[10:00 - 11:00)[10:00 - 11:00) [11:00 -
12:00)
[12:00 -
13:00)
[13:00 -
14:00)
[14:00 -
15:00)
[15:00 -
16:00)
[16:00 -
17:00)
[18:00 -
19:00)
[19:00 -
20:00)
[21:00 -
22:00)
[22:00 -
23:00)
[23:00 - 0:00)
Batch: Fixed Windows

MapReduce
[10:00 - 11:00)[11:00 - 12:00)
Batch: User Sessions
Joan
Larry
Ingo
Amanda
Cheryl
Arthur
[11:00 - 12:00)[10:00 - 11:00)

Streaming
11:00 10:0016:00 15:00 14:00 13:00 12:00

Unordered
Unbounded
Of Varying Event Time Skew
Confounding characteristics of data streams

Event Time Skew
StreamTime
Event Time
Ske
w

1.Time-Agnostic Processing
2.Approximation
3.Stream Time Windowing
4.Event Time Windowing
Approaches to reasoning about time

1. Time-Agnostic Processing - Filters
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Web server traffic logs
All traffic from specific domains
Straightforward
Efficient
Limited utility

1. Time-Agnostic Processing - Hash Join
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Query & Click traffic
Joined stream of Query + Click pairs
Straightforward
Efficient
Limited utility

2. Approximation via Online Algorithms
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Approximate top N hashtags per prefix
Efficient
Inexact
Complicated Algorithms

11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Web server request traffic
Per-minute rate of received requests
Straightforward
Results reflect contents of stream
Results don’t reflect events as they happened
If approximating event time, usefulness varies
Example Input:
Example Output:
Pros:
Cons:
3. Windowing by Stream
Time

11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Top N hashtags by prefix per hour.
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows

11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
User activity stream
Per-session group of activities
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions

What are you computing?
Where in event time?
When in stream time?

What = Aggregation API
Where = Windowing API
When = Watermarks + Triggers API

Aggregation API
PCollection<KV<String, Double>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(new Sum());

Aggregation API
2
4
7
0
1
6
33
8
9
18
9
16
Sum

Streaming Mode
10:02 10:0010:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0 2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
10:02 10:0010:06 10:04 Event Time
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2

Windowing API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.apply(Window.into(new FixedWindows(2, MINUTE)));
.apply(new Sum());

Windowing API
10:02 10:0010:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0
10:02 10:0010:06 10:04 Event Time
10:02 10:0010:06 10:04 Event Time
2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
FixedWindows
Sum
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2
13 12616 1835 415

Watermarks
●f(S) -> E
●S = a point in stream time (i.e. now)
●E = the point in event time up to
which input data is complete as of S

Event Time Skew
StreamTime
Event Time

FixedWindows
Sum
10:02 10:00
13616 1835 415 12
10:06 10:04 Stream Time
2
4 3 1
6
3
3 8
7
0
10:02 10:0010:06 10:04 Event Time
10:01 10:0010:03 10:02 Event Time
2
4
1
6
3
3
8
9
0
4
7
0 3
3
2
1
3
8
9
0
46
1
8
7
0 23
432
74 3
6
3
0 3
3
2
Watermarks

Watermark Caveats
Too slow = more latency
Too fast = late data

Triggers
When in stream time to emit?

Triggers API
.begin()
.apply(Window.into(new FixedWindows(2, MINUTES))
.trigger(new AtWatermark());
.apply(new Sum());

2
3
1
8
4
8
7
6 3
1313
Event Time10:05 10:0610:0110:00
10:0310:0010:06
2
12 5
52020
99
10:0110:0210:0510:04
10:02 10:03 10:04
5Late datum

A Better Strategy
1.Once per stream time minute
2.At watermark
3.Once per record for two
weeks

1325 5
2
3
1
8
4
8
7
6 3
5
Event Time10:05 10:0610:0110:00
10:0310:0010:06
2
12 5
52020
99
10:0110:0210:0510:04
10:02 10:03 10:04
12
20
1320
5
1320
13
1325
2
12
13
9
20 520
25Late datum
91325

Triggers API
.begin()
.apply(Window.into(new FixedWindows(2, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());

Lambda vs Streaming
Low-latency, approximate results
Complete, correct results as soon as possible
Ability to deal with changes upstream

One Last Thing...
What if I want sessions?

Triggers API
.begin()
.apply(Window.into(new Sessions(1, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());

2
8
4
8
7
6 3
Event Time10:05 10:0610:0110:00
10:0310:0010:0610:0110:0210:0510:04
10:02 10:03 10:04
22
3
1
9
1 minute2 7
1 minute2
2 7
1 minute
9 39 3
39 1 48
89 7 2525
25
5Late datum
25 83333
33
335 3838
6 3938 9
20 5
13
2
12
25
9

Summary
Lambda is great
Streaming by itself is better :-)
Strong Consistency = Correctness
Streaming = Aggregation + Windowing + Triggers
Tools For Reasoning About Time = Power + Flexibility

Thank you!
Questions?
Questions about this talk:
Questions about Cloud Dataflow:
takidau@google.com (Tyler Akidau)
cloude@google.com (Eric Schmidt)

Have your cake and eat it too, further dispelling the myths of the lambda architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Have your cake and eat it too, further dispelling the myths of the lambda architecture

Similar to Have your cake and eat it too, further dispelling the myths of the lambda architecture (20)

Recently uploaded

Recently uploaded (20)

Have your cake and eat it too, further dispelling the myths of the lambda architecture