Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Kostas Tzoumas
@kostas_tzoumas
Apache FlinkTM
Counting elements in streams
Introduction
2
3
Data streaming is becoming
increasingly popular*
*Biggest understatement of 2016
4
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced
5
Streaming is the next programming
paradigm for data applications, and you
need to start thinking in terms of streams
Counting
6
Continuous counting
 A seemingly simple application,
but generally an unsolved
problem
 E.g., count visitors, impression...
Counting in batch architecture
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic
batch jobs
8
Problems with batch architecture
 High latency
 Too many moving parts
 Implicit treatment of time
 Out of order event ...
Counting in λ architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
10
Problems with batch and λ
 Way too many moving parts (and code dup)
 Implicit treatment of time
 Out of order event han...
Counting in streaming architecture
 Message queue ensures stream durability and replay
 Stream processor ensures consist...
Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSourc...
Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault...
Counting hierarchy of needs
15
Continuous counting
Counting hierarchy of needs
16
Continuous counting
... with low latency,
Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault...
Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault...
Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault...
Rest of this talk
21
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant ...
Latency
22
Yahoo! Streaming Benchmark
23
 Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
 Focus on measuri...
Benchmark task: counting!
24
 Count ad impressions
grouped by campaign
 Compute aggregates
over last 10 seconds
 Make a...
Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/...
Efficiency, and scalability
26
Handling high-volume streams
 Scalability: how many events/sec can a
system scale to, with infinite resources?
 Scalabil...
Extending the Yahoo! benchmark
28
 Yahoo! benchmark is a great starting point to
understand engine behavior.
 However, b...
Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/s...
Fault tolerance and repeatability
30
Stateful streaming applications
 Most interesting stream applications are stateful
 How to ensure that state is correct ...
Fault tolerance definitions
 At least once
• May over-count
• In some cases even under-count with different definition
 ...
Fault tolerance in Flink
 Flink guarantees exactly once
 End-to-end exactly once supported with
specific sources and sin...
Savepoints
 Maintaining stateful
applications in production is
challenging
 Flink savepoints: externally
triggered, dura...
Explicit handling of time
35
Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI...
Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst...
Why event time
 Most stream processors are limited to
processing/arrival time, Flink can operate on
event time as well
 ...
What's coming up in Flink
39
Evolution of streaming in Flink
 Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
...
Upcoming features
 SQL: ongoing work in collaboration with Apache
Calcite
 Dynamic scaling: adapt resources to stream vo...
Queryable state
42
Using the
stream
processor as a
database
Closing
43
Summary
 Stream processing gaining momentum, the right
paradigm for continuous data applications
 Even seemingly simple ...
Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration...
Join the community!
46
 Follow: @ApacheFlink, @dataArtisans
 Read: flink.apache.org/blog, data-artisans.com/blog
 Subsc...
2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance b...
Upcoming SlideShare
Loading in …5
×

Apache Flink at Strata San Jose 2016

26,528 views

Published on

Apache Flink presentation at Strata/Hadoop World San Jose 2016

Published in: Software
  • Be the first to comment

Apache Flink at Strata San Jose 2016

  1. 1. Kostas Tzoumas @kostas_tzoumas Apache FlinkTM Counting elements in streams
  2. 2. Introduction 2
  3. 3. 3 Data streaming is becoming increasingly popular* *Biggest understatement of 2016
  4. 4. 4 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  5. 5. 5 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams
  6. 6. Counting 6
  7. 7. Continuous counting  A seemingly simple application, but generally an unsolved problem  E.g., count visitors, impressions, interactions, clicks, etc  Aggregations and OLAP cube operations are generalizations of counting 7
  8. 8. Counting in batch architecture  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 8
  9. 9. Problems with batch architecture  High latency  Too many moving parts  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 9
  10. 10. Counting in λ architecture  "Batch layer": what we had before  "Stream layer": approximate early results 10
  11. 11. Problems with batch and λ  Way too many moving parts (and code dup)  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 11
  12. 12. Counting in streaming architecture  Message queue ensures stream durability and replay  Stream processor ensures consistent counting 12
  13. 13. Counting in Flink DataStream API Number of visitors in last hour by country 13 DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
  14. 14. Counting hierarchy of needs 14 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable Based on Maslow's hierarchy of needs
  15. 15. Counting hierarchy of needs 15 Continuous counting
  16. 16. Counting hierarchy of needs 16 Continuous counting ... with low latency,
  17. 17. Counting hierarchy of needs 17 Continuous counting ... with low latency, ... efficiently on high volume streams,
  18. 18. Counting hierarchy of needs 18 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once),
  19. 19. Counting hierarchy of needs 19 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable,
  20. 20. Counting hierarchy of needs 20 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable 1.1+
  21. 21. Rest of this talk 21 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable
  22. 22. Latency 22
  23. 23. Yahoo! Streaming Benchmark 23  Storm, Spark Streaming, and Flink benchmark by the Storm team at Yahoo!  Focus on measuring end-to-end latency at low throughputs  First benchmark that was modeled after a real application  Read more: • https://yahooeng.tumblr.com/post/135321837876/benchma rking-streaming-computation-engines-at
  24. 24. Benchmark task: counting! 24  Count ad impressions grouped by campaign  Compute aggregates over last 10 seconds  Make aggregates available for queries (Redis)
  25. 25. Results (lower is better) 25 Flink and Storm at sub-second latencies Spark has a latency- throughput tradeoff 170k events/sec
  26. 26. Efficiency, and scalability 26
  27. 27. Handling high-volume streams  Scalability: how many events/sec can a system scale to, with infinite resources?  Scalability comes at a cost (systems add overhead to be scalable)  Efficiency: how many events/sec can a system scale to, with limited resources? 27
  28. 28. Extending the Yahoo! benchmark 28  Yahoo! benchmark is a great starting point to understand engine behavior.  However, benchmark stops at low write throughput and programs are not fault tolerant.  We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state. • http://data-artisans.com/extending-the-yahoo- streaming-benchmark/
  29. 29. Results (higher is better) 29 Also: Flink jobs are correct under failures (exactly once), Storm jobs are not 500k events/sec 3mi events/sec 15mi events/sec
  30. 30. Fault tolerance and repeatability 30
  31. 31. Stateful streaming applications  Most interesting stream applications are stateful  How to ensure that state is correct after failures? 31
  32. 32. Fault tolerance definitions  At least once • May over-count • In some cases even under-count with different definition  Exactly once • Counts are the same after failure  End-to-end exactly once • Counts appear the same in an external sink (database, file system) after failure 32
  33. 33. Fault tolerance in Flink  Flink guarantees exactly once  End-to-end exactly once supported with specific sources and sinks • E.g., Kafka  Flink  HDFS  Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation 33
  34. 34. Savepoints  Maintaining stateful applications in production is challenging  Flink savepoints: externally triggered, durable checkpoints  Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 34
  35. 35. Explicit handling of time 35
  36. 36. Notions of time 36 1977 1980 1983 1999 2002 2005 2015 Episode IV: A New Hope Episode V: The Empire Strikes Back Episode VI: Return of the Jedi Episode I: The Phantom Menace Episode II: Attach of the Clones Episode III: Revenge of the Sith Episode VII: The Force Awakens This is called event time This is called processing time
  37. 37. Out of order streams 37 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  38. 38. Why event time  Most stream processors are limited to processing/arrival time, Flink can operate on event time as well  Benefits of event time • Accurate results for out of order data • Sessions and unaligned windows • Time travel (backstreaming) 38
  39. 39. What's coming up in Flink 39
  40. 40. Evolution of streaming in Flink  Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once guarantees via checkpoiting  Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring  Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support 40
  41. 41. Upcoming features  SQL: ongoing work in collaboration with Apache Calcite  Dynamic scaling: adapt resources to stream volume, historical stream processing  Queryable state: ability to query the state inside the stream processor  Mesos support  More sources and sinks (e.g., Kinesis, Cassandra) 41
  42. 42. Queryable state 42 Using the stream processor as a database
  43. 43. Closing 43
  44. 44. Summary  Stream processing gaining momentum, the right paradigm for continuous data applications  Even seemingly simple applications can be complex at scale and in production – choice of framework crucial 44  Flink: unique combination of capabilities, performance, and robustness
  45. 45. Flink in the wild 45 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at
  46. 46. Join the community! 46  Follow: @ApacheFlink, @dataArtisans  Read: flink.apache.org/blog, data-artisans.com/blog  Subscribe: (news | user | dev) @ flink.apache.org
  47. 47. 2 meetups next week in Bay Area! 47 April 6, San JoseApril 5, San Francisco What's new in Flink 1.0 & recent performance benchmarks with Flink http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

×