3. 3
Original creators of Apache
Flink®
Providers of the
dA Platform, a supported
Flink distribution
4. Outline
What is data streaming
Myth 1: The Lambda architecture
Myth 2: The throughput/latency tradeoff
Myth 3: Exactly once not possible
Myth 4: Streaming is for (near) real-time
Myth 5: Batching and buffering
Myth 6: Streaming is hard
4
6. 6
Reconsideration of data architecture
Better app isolation
More real-time reaction to events
Robust continuous applications
Process both real-time and historical data
8. What is (distributed) streaming
Computations on never-
ending “streams” of data
records (“events”)
A stream processor
distributes the
computation in a cluster
8
Your
code
Your
code
Your
code
Your
code
9. What is stateful streaming
Computation and state
• E.g., counters, windows of past
events, state machines, trained ML
models
Result depends on history of
stream
A stateful stream processor gives
the tools to manage state
• Recover, roll back, version,
upgrade, etc
9
Your
code
state
10. What is event-time streaming
Data records associated with
timestamps (time series data)
Processing depends on timestamps
An event-time stream processor gives
you the tools to reason about time
• E.g., handle streams that are out of
order
• Core feature is watermarks – a clock
to measure event time
10
Your
code
state
t3 t1 t2t4 t1-t2 t3-t4
11. What is streaming
Continuous processing on data that is
continuously generated
I.e., pretty much all “big” data
It’s all about state and time
11
14. Myth variations
Stream processing is approximate
Stream processing is for transient data
Stream processing cannot handle high data
volume
Hence, stream processing needs to be
coupled with batch processing
14
16. Lambda no longer needed
Lambda was useful in the first days of stream
processing (beginning of Apache Storm)
Not any more
• Stream processors can handle very large volumes
• Stream processors can compute accurate results
Good news is I don’t hear Lambda so often
anymore
16
18. Myth flavors
Low latency systems cannot support high
throughput
In general, you need to trade off one for the
other
There is a “high throughput” category and a
“low-latency” category (naming varies)
18
19. Physical limits
Most stream processing pipelines are
network bottlenecked
The network dictates both (1) what is the
latency and (2) what is the throughput
A well-engineered system achieves the
physical limits allowed by the network
19
20. Buffering
It is natural to handle many records together
• All software and hardware systems do that
• E.g., network bundles bytes into frames
Every streaming system buffers records for
performance (Flink certainly does)
• You don’t want to send single records over the
network
• "Record-at-a-time" does not exist at the physical level
20
21. Buffering (2)
Buffering is a performance optimization
• Should be opaque to the user
• Should not dictate system behavior in any other
way
• Should not impose artificial boundaries
• Should not limit what you can do with the system
• Etc...
21
25. What is “exactly once”
Under failures, system computes result as if there
was no failure
In contrast to:
• At most once: no guarantees
• At least once: duplicates possible
Exactly once state versus exactly once delivery
25
26. Myth variations
Exactly once is not possible in nature
Exactly once is not possible end-to-end
Exactly once is not needed
You need to trade off performance for exactly once
(Usually perpetuated by folks until they implement
exactly once )
26
27. Transactions
“Exactly once” is transactions: either all
actions succeed or none succeed
Transactions are possible
Transactions are useful
Let’s not start eventual consistency all over
again…
27
28. Flink checkpoints
Periodic asynchronous consistent snapshots of
application state
Provide exactly-once state guarantees under failures
28
9/2/2016 stream_barriers.svg
checkpoint
barrier n1
data stream
stream record
(event)
checkpoint
barrier n
newer records
part of
checkpoint n1
part of
checkpoint n
part of
checkpoint n+1
older records
29. End-to-end exactly once
Checkpoints double as transaction coordination mechanism
Source and sink operators can take part in checkpoints
Exactly once internally, "effectively once" end to end: e.g.,
Flink + Cassandra with idempotent updates
29
transactional sinks
30. State management
Checkpoints triple as state
versioning mechanism
(savepoints)
Go back and forth in time while
maintaining state consistency
Ease code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
30
32. Myth variations
I don’t have low latency applications hence I
don’t need stream processing
Stream processing is only relevant for data
before storing them
We need a batch processor to do heavy
offline computations
32
33. Low latency and high latency streams
33
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream (low latency)
Batch
(bounded stream)
Stream (high latency)
35. Accurate computation
Batch processing is not an accurate
computation model for continuous data
• Misses the right concepts and primitives
• Time handling, state across batch boundaries
Stateful stream processing a better model
• Real-time/low-latency is the icing on the cake
35
37. Myth variations
There is a "mini-batch" category between
batch and streaming
“Record-at-a-time” versus “mini-batching” or
similar "choices"
Mini-batch systems can get better throughput
37
38. Myth variations (2)
The difference between mini-batching and
streaming is latency
I don’t need low latency hence I need mini-
batching
I have a mini-batching use case
38
39. We have answered this already
Can get throughput and latency (myth #2)
• Every system buffers data, from the network to
the OS to Flink
Streaming is a model, not just fast (myth #4)
• Time and state
• Low latency is the icing on the cake
39
40. Continuous operation
Data is continuously produced
Computation should track data production
• With dynamic scaling, pause-and-resume
Restarting our pipelines every second is not a
great idea, and not just for latency reasons
40
42. Myth variations
Streaming is hard to learn
Streaming is hard to reason about
Windows? Event time? Triggers? Oh, my!!
Streaming needs to be coupled with batch
I know batch already
42
43. It's about your data and code
What's the form of your data?
• Unbounded (e.g., clicks, sensors, logs), or
• Bounded (e.g., ???*)
What changes more often?
• My code changes faster than my data
• My data changes faster than my code
43
* Please help me find a great example of naturally static data
44. It's about your data and code
If your data changes faster than your code
you have a streaming problem
• You may be solving it with hourly batch jobs
depending on someone else to create the
hourly batches
• You are probably living with inaccurate results
without knowing it
44
45. It's about your data and code
If your code changes faster than your data
you have an exploration problem
• Using notebooks or other tools for quick data
exploration is a good idea
• Once your code stabilizes you will have a
streaming problem, so you might as well think
of it as such from the beginning
45
47. Flink community
> 240 contributors, 95 contributors in Flink 1.1
42 meetups around the world with > 15,000 members
2x-3x growth in 2015, similar in 2016
47
48. Powered by Flink
48
Zalando, one of the largest ecommerce
companies in Europe, uses Flink for real-
time business process monitoring.
King, the creators of Candy Crush Saga,
uses Flink to provide data science teams
with real-time analytics.
Bouygues Telecom uses Flink for real-time
event processing over billions of Kafka
messages per day.
Alibaba, the world's largest retailer, built a
Flink-based system (Blink) to optimize
search rankings in real time.
See more at flink.apache.org/poweredby.html
49. 30 Flink applications in production for more than one
year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7,
processing 30 billion events daily, maintaining state
of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
49
53. Flink's unique combination of features
53
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Consistency
Works on real-time
and historic data
Performance Event Time
APIs
Libraries
Stateful
Streaming
Savepoints
(replays, A/B testing,
upgrades, versioning)
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
Fluent API
Out-of-order events
Fast and large
out-of-core state
55. Flink 1.1 + ongoing development
55
Connectors
Session
Windows
(Stream) SQL
Library
enhancements
Metric
System
Metrics &
Visualization
Dynamic Scaling
Savepoint
compatibility Checkpoints
to savepoints
More connectors Stream SQL
Windows
Large state
Maintenance
Fine grained
recovery
Side in-/outputs
Window DSL
Security
Mesos &
others
Dynamic Resource
Management
Authentication
Queryable State
56. Flink 1.1 + ongoing development
56
Connectors
Session
Windows
(Stream) SQL
Library
enhancements
Metric
System
Operations
Ecosystem
Application
Features
Metrics &
Visualization
Dynamic Scaling
Savepoint
compatibility Checkpoints
to savepoints
More connectors Stream SQL
Windows
Large state
Maintenance
Fine grained
recovery
Side in-/outputs
Window DSL
Broader
Audience
Security
Mesos &
others
Dynamic Resource
Management
Authentication
Queryable State
58. Streaming use cases
Application
(Near) real-time apps
Continuous apps
Analytics on historical
data
Request/response apps
Technology
Low-latency streaming
High-latency streaming
Batch as special case of
streaming
Large queryable state
58
59. Request/response applications
Queryable state: query Flink state directly instead
of pushing results in a database
Large state support and query API coming in Flink
59
queries
60. In summary
The need for streaming comes from a rethinking of data infra
architecture
• Stream processing then just becomes natural
Debunking 5 myths
• Myth 1: The Lambda architecture
• Myth 2: The throughput/latency tradeoff
• Myth 3: Exactly once not possible
• Myth 4: Streaming is for (near) real-time
• Myth 5: Batching and buffering
• Myth 6: Streaming is hard
60