Apache Flink at Strata San Jose 2016

Kostas Tzoumas
@kostas_tzoumas
Apache FlinkTM
Counting elements in streams

3
Data streaming is becoming
increasingly popular*
*Biggest understatement of 2016

4
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced

5
Streaming is the next programming
paradigm for data applications, and you
need to start thinking in terms of streams

Continuous counting
 A seemingly simple application,
but generally an unsolved
problem
 E.g., count visitors, impressions,
interactions, clicks, etc
 Aggregations and OLAP cube
operations are generalizations
of counting
7

Counting in batch architecture
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic
batch jobs
8

Problems with batch architecture
 High latency
 Too many moving parts
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
9

Counting in λ architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
10

Problems with batch and λ
 Way too many moving parts (and code dup)
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
11

Counting in streaming architecture
 Message queue ensures stream durability and replay
 Stream processor ensures consistent counting
12

Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka
.keyBy("country"); // group by country
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window

Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs

15
Continuous counting

16
Continuous counting

17
Continuous counting

18
Continuous counting

19
Continuous counting
... accurate and
repeatable,

20
Continuous counting
... accurate and
repeatable,
... queryable
1.1+

Rest of this talk
21
Continuous counting
... accurate and
repeatable,
... queryable

Yahoo! Streaming Benchmark
23
 Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
 Focus on measuring end-to-end latency at low
throughputs
 First benchmark that was modeled after a real
application
 Read more:
• https://yahooeng.tumblr.com/post/135321837876/benchma
rking-streaming-computation-engines-at

Benchmark task: counting!
24
 Count ad impressions
grouped by campaign
 Compute aggregates
over last 10 seconds
 Make aggregates
available for queries
(Redis)

Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/sec

Efficiency, and scalability
26

Handling high-volume streams
 Scalability: how many events/sec can a
system scale to, with infinite resources?
 Scalability comes at a cost (systems add
overhead to be scalable)
 Efficiency: how many events/sec can a
system scale to, with limited resources?
27

Extending the Yahoo! benchmark
28
 Yahoo! benchmark is a great starting point to
understand engine behavior.
 However, benchmark stops at low write throughput
and programs are not fault tolerant.
 We extended the benchmark to (1) high volumes,
and (2) to use Flink's built-in state.
• http://data-artisans.com/extending-the-yahoo-
streaming-benchmark/

Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec

Fault tolerance and repeatability
30

Stateful streaming applications
 Most interesting stream applications are stateful
 How to ensure that state is correct after failures?
31

Fault tolerance definitions
 At least once
• May over-count
• In some cases even under-count with different definition
 Exactly once
• Counts are the same after failure
 End-to-end exactly once
• Counts appear the same in an external sink (database, file
system) after failure
32

Fault tolerance in Flink
 Flink guarantees exactly once
 End-to-end exactly once supported with
specific sources and sinks
• E.g., Kafka  Flink  HDFS
 Internally, Flink periodically takes consistent
snapshots of the state without ever stopping
the computation
33

Savepoints
 Maintaining stateful
applications in production is
challenging
 Flink savepoints: externally
triggered, durable checkpoints
 Easy code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
34

Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI:
Return of
the Jedi
Episode I:
The Phantom
Menace
Episode II:
Attach of
the Clones
Episode III:
Revenge of
the Sith
Episode VII:
The Force
Awakens
This is called event time
This is called processing time

Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events

Why event time
 Most stream processors are limited to
processing/arrival time, Flink can operate on
event time as well
 Benefits of event time
• Accurate results for out of order data
• Sessions and unaligned windows
• Time travel (backstreaming)
38

Evolution of streaming in Flink
 Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
 Flink 0.10 (Nov 2015): Event time support, windowing
mechanism based on Dataflow/Beam model, graduated
DataStream API, high availability, state interface,
new/updated connectors (Kafka, Nifi, Elastic, ...), improved
monitoring
 Flink 1.0 (Mar 2015): DataStream API stability, out of core
state, savepoints, CEP library, improved monitoring, Kafka 0.9
support
40

Upcoming features
 SQL: ongoing work in collaboration with Apache
Calcite
 Dynamic scaling: adapt resources to stream volume,
historical stream processing
 Queryable state: ability to query the state inside the
stream processor
 Mesos support
 More sources and sinks (e.g., Kinesis, Cassandra)
41

Queryable state
42
Using the
stream
processor as a
database

Summary
 Stream processing gaining momentum, the right
paradigm for continuous data applications
 Even seemingly simple applications can be complex at
scale and in production – choice of framework crucial
44
 Flink: unique
combination of
capabilities,
performance, and
robustness

Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at

Join the community!
46
 Follow: @ApacheFlink, @dataArtisans
 Read: flink.apache.org/blog, data-artisans.com/blog
 Subscribe: (news | user | dev) @ flink.apache.org

2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flink
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Apache Flink at Strata San Jose 2016

More Related Content

What's hot

Similar to Apache Flink at Strata San Jose 2016

Recently uploaded

Apache Flink at Strata San Jose 2016

Editor's Notes