Kostas Tzoumas
@kostas_tzoumas
Apache FlinkTM
Counting elements in streams
Introduction
2
3
Data streaming is becoming
increasingly popular*
*Biggest understatement of 2016
4
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced
5
Streaming is the next programming
paradigm for data applications, and you
need to start thinking in terms of streams
Counting
6
Continuous counting
 A seemingly simple application,
but generally an unsolved
problem
 E.g., count visitors, impressions,
interactions, clicks, etc
 Aggregations and OLAP cube
operations are generalizations
of counting
7
Counting in batch architecture
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic
batch jobs
8
Problems with batch architecture
 High latency
 Too many moving parts
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
9
Counting in λ architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
10
Problems with batch and λ
 Way too many moving parts (and code dup)
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
11
Counting in streaming architecture
 Message queue ensures stream durability and replay
 Stream processor ensures consistent counting
12
Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka
.keyBy("country"); // group by country
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window
Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs
Counting hierarchy of needs
15
Continuous counting
Counting hierarchy of needs
16
Continuous counting
... with low latency,
Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
1.1+
Rest of this talk
21
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Latency
22
Yahoo! Streaming Benchmark
23
 Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
 Focus on measuring end-to-end latency at low
throughputs
 First benchmark that was modeled after a real
application
 Read more:
• https://yahooeng.tumblr.com/post/135321837876/benchma
rking-streaming-computation-engines-at
Benchmark task: counting!
24
 Count ad impressions
grouped by campaign
 Compute aggregates
over last 10 seconds
 Make aggregates
available for queries
(Redis)
Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/sec
Efficiency, and scalability
26
Handling high-volume streams
 Scalability: how many events/sec can a
system scale to, with infinite resources?
 Scalability comes at a cost (systems add
overhead to be scalable)
 Efficiency: how many events/sec can a
system scale to, with limited resources?
27
Extending the Yahoo! benchmark
28
 Yahoo! benchmark is a great starting point to
understand engine behavior.
 However, benchmark stops at low write throughput
and programs are not fault tolerant.
 We extended the benchmark to (1) high volumes,
and (2) to use Flink's built-in state.
• http://data-artisans.com/extending-the-yahoo-
streaming-benchmark/
Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
Fault tolerance and repeatability
30
Stateful streaming applications
 Most interesting stream applications are stateful
 How to ensure that state is correct after failures?
31
Fault tolerance definitions
 At least once
• May over-count
• In some cases even under-count with different definition
 Exactly once
• Counts are the same after failure
 End-to-end exactly once
• Counts appear the same in an external sink (database, file
system) after failure
32
Fault tolerance in Flink
 Flink guarantees exactly once
 End-to-end exactly once supported with
specific sources and sinks
• E.g., Kafka  Flink  HDFS
 Internally, Flink periodically takes consistent
snapshots of the state without ever stopping
the computation
33
Savepoints
 Maintaining stateful
applications in production is
challenging
 Flink savepoints: externally
triggered, durable checkpoints
 Easy code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
34
Explicit handling of time
35
Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI:
Return of
the Jedi
Episode I:
The Phantom
Menace
Episode II:
Attach of
the Clones
Episode III:
Revenge of
the Sith
Episode VII:
The Force
Awakens
This is called event time
This is called processing time
Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
Why event time
 Most stream processors are limited to
processing/arrival time, Flink can operate on
event time as well
 Benefits of event time
• Accurate results for out of order data
• Sessions and unaligned windows
• Time travel (backstreaming)
38
What's coming up in Flink
39
Evolution of streaming in Flink
 Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
 Flink 0.10 (Nov 2015): Event time support, windowing
mechanism based on Dataflow/Beam model, graduated
DataStream API, high availability, state interface,
new/updated connectors (Kafka, Nifi, Elastic, ...), improved
monitoring
 Flink 1.0 (Mar 2015): DataStream API stability, out of core
state, savepoints, CEP library, improved monitoring, Kafka 0.9
support
40
Upcoming features
 SQL: ongoing work in collaboration with Apache
Calcite
 Dynamic scaling: adapt resources to stream volume,
historical stream processing
 Queryable state: ability to query the state inside the
stream processor
 Mesos support
 More sources and sinks (e.g., Kinesis, Cassandra)
41
Queryable state
42
Using the
stream
processor as a
database
Closing
43
Summary
 Stream processing gaining momentum, the right
paradigm for continuous data applications
 Even seemingly simple applications can be complex at
scale and in production – choice of framework crucial
44
 Flink: unique
combination of
capabilities,
performance, and
robustness
Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at
Join the community!
46
 Follow: @ApacheFlink, @dataArtisans
 Read: flink.apache.org/blog, data-artisans.com/blog
 Subscribe: (news | user | dev) @ flink.apache.org
2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flink
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Apache Flink at Strata San Jose 2016

  • 1.
  • 2.
  • 3.
    3 Data streaming isbecoming increasingly popular* *Biggest understatement of 2016
  • 4.
    4 Streaming technology isenabling the obvious: continuous processing on data that is continuously produced
  • 5.
    5 Streaming is thenext programming paradigm for data applications, and you need to start thinking in terms of streams
  • 6.
  • 7.
    Continuous counting  Aseemingly simple application, but generally an unsolved problem  E.g., count visitors, impressions, interactions, clicks, etc  Aggregations and OLAP cube operations are generalizations of counting 7
  • 8.
    Counting in batcharchitecture  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 8
  • 9.
    Problems with batcharchitecture  High latency  Too many moving parts  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 9
  • 10.
    Counting in λarchitecture  "Batch layer": what we had before  "Stream layer": approximate early results 10
  • 11.
    Problems with batchand λ  Way too many moving parts (and code dup)  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 11
  • 12.
    Counting in streamingarchitecture  Message queue ensures stream durability and replay  Stream processor ensures consistent counting 12
  • 13.
    Counting in FlinkDataStream API Number of visitors in last hour by country 13 DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
  • 14.
    Counting hierarchy ofneeds 14 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable Based on Maslow's hierarchy of needs
  • 15.
    Counting hierarchy ofneeds 15 Continuous counting
  • 16.
    Counting hierarchy ofneeds 16 Continuous counting ... with low latency,
  • 17.
    Counting hierarchy ofneeds 17 Continuous counting ... with low latency, ... efficiently on high volume streams,
  • 18.
    Counting hierarchy ofneeds 18 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once),
  • 19.
    Counting hierarchy ofneeds 19 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable,
  • 20.
    Counting hierarchy ofneeds 20 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable 1.1+
  • 21.
    Rest of thistalk 21 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable
  • 22.
  • 23.
    Yahoo! Streaming Benchmark 23 Storm, Spark Streaming, and Flink benchmark by the Storm team at Yahoo!  Focus on measuring end-to-end latency at low throughputs  First benchmark that was modeled after a real application  Read more: • https://yahooeng.tumblr.com/post/135321837876/benchma rking-streaming-computation-engines-at
  • 24.
    Benchmark task: counting! 24 Count ad impressions grouped by campaign  Compute aggregates over last 10 seconds  Make aggregates available for queries (Redis)
  • 25.
    Results (lower isbetter) 25 Flink and Storm at sub-second latencies Spark has a latency- throughput tradeoff 170k events/sec
  • 26.
  • 27.
    Handling high-volume streams Scalability: how many events/sec can a system scale to, with infinite resources?  Scalability comes at a cost (systems add overhead to be scalable)  Efficiency: how many events/sec can a system scale to, with limited resources? 27
  • 28.
    Extending the Yahoo!benchmark 28  Yahoo! benchmark is a great starting point to understand engine behavior.  However, benchmark stops at low write throughput and programs are not fault tolerant.  We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state. • http://data-artisans.com/extending-the-yahoo- streaming-benchmark/
  • 29.
    Results (higher isbetter) 29 Also: Flink jobs are correct under failures (exactly once), Storm jobs are not 500k events/sec 3mi events/sec 15mi events/sec
  • 30.
    Fault tolerance andrepeatability 30
  • 31.
    Stateful streaming applications Most interesting stream applications are stateful  How to ensure that state is correct after failures? 31
  • 32.
    Fault tolerance definitions At least once • May over-count • In some cases even under-count with different definition  Exactly once • Counts are the same after failure  End-to-end exactly once • Counts appear the same in an external sink (database, file system) after failure 32
  • 33.
    Fault tolerance inFlink  Flink guarantees exactly once  End-to-end exactly once supported with specific sources and sinks • E.g., Kafka  Flink  HDFS  Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation 33
  • 34.
    Savepoints  Maintaining stateful applicationsin production is challenging  Flink savepoints: externally triggered, durable checkpoints  Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 34
  • 35.
  • 36.
    Notions of time 36 19771980 1983 1999 2002 2005 2015 Episode IV: A New Hope Episode V: The Empire Strikes Back Episode VI: Return of the Jedi Episode I: The Phantom Menace Episode II: Attach of the Clones Episode III: Revenge of the Sith Episode VII: The Force Awakens This is called event time This is called processing time
  • 37.
    Out of orderstreams 37 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  • 38.
    Why event time Most stream processors are limited to processing/arrival time, Flink can operate on event time as well  Benefits of event time • Accurate results for out of order data • Sessions and unaligned windows • Time travel (backstreaming) 38
  • 39.
    What's coming upin Flink 39
  • 40.
    Evolution of streamingin Flink  Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once guarantees via checkpoiting  Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring  Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support 40
  • 41.
    Upcoming features  SQL:ongoing work in collaboration with Apache Calcite  Dynamic scaling: adapt resources to stream volume, historical stream processing  Queryable state: ability to query the state inside the stream processor  Mesos support  More sources and sinks (e.g., Kinesis, Cassandra) 41
  • 42.
  • 43.
  • 44.
    Summary  Stream processinggaining momentum, the right paradigm for continuous data applications  Even seemingly simple applications can be complex at scale and in production – choice of framework crucial 44  Flink: unique combination of capabilities, performance, and robustness
  • 45.
    Flink in thewild 45 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at
  • 46.
    Join the community! 46 Follow: @ApacheFlink, @dataArtisans  Read: flink.apache.org/blog, data-artisans.com/blog  Subscribe: (news | user | dev) @ flink.apache.org
  • 47.
    2 meetups nextweek in Bay Area! 47 April 6, San JoseApril 5, San Francisco What's new in Flink 1.0 & recent performance benchmarks with Flink http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Editor's Notes

  • #10 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?
  • #12 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?