SlideShare a Scribd company logo
1 of 47
Jamie Grier
@jamiegrier
jamie@data-artisans.com
Apache FlinkTM
Counting elements in streams
Introduction
2
3
Data streaming is becoming
increasingly popular*
*Biggest understatement of 2016
4
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced
5
Streaming is the next programming paradigm
for data applications, and you need to start
thinking in terms of streams
Counting
6
Continuous counting
 A seemingly simple application,
but generally an unsolved
problem
 E.g., count visitors, impressions,
interactions, clicks, etc
 Aggregations and OLAP cube
operations are generalizations of
counting
7
Counting in batch architecture
Continuous
ingestion
Periodic (e.g.,
hourly) files
Periodic batch
jobs
8
Problems with batch architecture
High latency
Too many moving parts
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
9
Counting in λ architecture
"Batch layer": what
we had before
"Stream layer":
approximate early
results
10
Problems with batch and λ
Way too many moving parts (and code dup)
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
11
Counting in streaming architecture
 Message queue ensures stream durability and replay
 Stream processor ensures consistent counting
12
Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka
.keyBy("country"); // group by country
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window
Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs
Counting hierarchy of needs
15
Continuous counting
Counting hierarchy of needs
16
Continuous counting
... with low latency,
Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
1.1+
Rest of this talk
21
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Latency
22
Yahoo! Streaming Benchmark
23
 Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
 Focus on measuring end-to-end latency at low throughputs
 First benchmark that was modeled after a real application
 Read more:
• https://yahooeng.tumblr.com/post/135321837876/benchmarkin
g-streaming-computation-engines-at
Benchmark task: counting!
24
 Count ad impressions
grouped by campaign
 Compute aggregates
over last 10 seconds
 Make aggregates
available for queries
(Redis)
Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/sec
Efficiency, and scalability
26
Handling high-volume streams
Scalability: how many events/sec can a system
scale to, with infinite resources?
Scalability comes at a cost (systems add
overhead to be scalable)
Efficiency: how many events/sec can a system
scale to, with limited resources?
27
Extending the Yahoo! benchmark
28
Yahoo! benchmark is a great starting point to understand
engine behavior.
However, benchmark stops at low write throughput and
programs are not fault tolerant.
We extended the benchmark to (1) high volumes, and (2)
to use Flink's built-in state.
• http://data-artisans.com/extending-the-yahoo-streaming-
benchmark/
Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
Fault tolerance and repeatability
30
Stateful streaming applications
 Most interesting stream applications are stateful
 How to ensure that state is correct after failures?
31
Fault tolerance definitions
 At least once
• May over-count
• In some cases even under-count with different definition
 Exactly once
• Counts are the same after failure
 End-to-end exactly once
• Counts appear the same in an external sink (database, file
system) after failure
32
Fault tolerance in Flink
Flink guarantees exactly once
End-to-end exactly once supported with specific
sources and sinks
• E.g., Kafka Flink HDFS
Internally, Flink periodically takes consistent
snapshots of the state without ever stopping the
computation
33
Savepoints
 Maintaining stateful applications
in production is challenging
 Flink savepoints: externally
triggered, durable checkpoints
 Easy code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
34
Explicit handling of time
35
Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI:
Return of
the Jedi
Episode I:
The Phantom
Menace
Episode II:
Attach of
the Clones
Episode III:
Revenge of
the Sith
Episode VII:
The Force
Awakens
This is called event time
This is called processing time
Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
Why event time
Most stream processors are limited to
processing/arrival time, Flink can operate on event
time as well
Benefits of event time
• Accurate results for out of order data
• Sessions and unaligned windows
• Time travel (backstreaming)
38
What's coming up in Flink
39
Evolution of streaming in Flink
 Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
 Flink 0.10 (Nov 2015): Event time support, windowing
mechanism based on Dataflow/Beam model, graduated
DataStream API, high availability, state interface, new/updated
connectors (Kafka, Nifi, Elastic, ...), improved monitoring
 Flink 1.0 (Mar 2015): DataStream API stability, out of core
state, savepoints, CEP library, improved monitoring, Kafka 0.9
support
40
Upcoming features
 SQL: ongoing work in collaboration with Apache Calcite
 Dynamic scaling: adapt resources to stream volume,
historical stream processing
 Queryable state: ability to query the state inside the
stream processor
 Mesos support
 More sources and sinks (e.g., Kinesis, Cassandra)
41
Queryable state
42
Using the
stream
processor as a
database
Closing
43
Summary
 Stream processing gaining momentum, the right
paradigm for continuous data applications
 Even seemingly simple applications can be complex at
scale and in production – choice of framework crucial
44
 Flink: unique
combination of
capabilities,
performance, and
robustness
Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at
Join the community!
46
 Follow: @ApacheFlink, @dataArtisans
 Read: flink.apache.org/blog, data-artisans.com/blog
 Subscribe: (news | user | dev) @ flink.apache.org
2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flink
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

More Related Content

What's hot

Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionalityNikhil Sharma
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 

What's hot (20)

Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
K means
K meansK means
K means
 
Feature scaling
Feature scalingFeature scaling
Feature scaling
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 

Similar to Counting Elements in Streams

Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Kostas Tzoumas
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingKostas Tzoumas
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkFabian Hueske
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextVerverica
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
The Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkThe Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkC4Media
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 
Data analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishData analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishCodeValue
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableHostedbyConfluent
 

Similar to Counting Elements in Streams (20)

Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
The Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkThe Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache Flink
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 
Data analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishData analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publish
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
 

More from Jamie Grier

Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ LyftJamie Grier
 
Robust Stream Processing With Apache Flink
Robust Stream Processing With Apache FlinkRobust Stream Processing With Apache Flink
Robust Stream Processing With Apache FlinkJamie Grier
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkJamie Grier
 
Flink 1.0-slides
Flink 1.0-slidesFlink 1.0-slides
Flink 1.0-slidesJamie Grier
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedJamie Grier
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkJamie Grier
 

More from Jamie Grier (7)

Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 
Robust Stream Processing With Apache Flink
Robust Stream Processing With Apache FlinkRobust Stream Processing With Apache Flink
Robust Stream Processing With Apache Flink
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache Flink
 
Flink 1.0-slides
Flink 1.0-slidesFlink 1.0-slides
Flink 1.0-slides
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 

Recently uploaded

Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdfKamal Acharya
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...Amil baba
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineAftabkhan575376
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Lecture_8-Digital implementation of analog controller design.pdf
Lecture_8-Digital implementation of analog controller design.pdfLecture_8-Digital implementation of analog controller design.pdf
Lecture_8-Digital implementation of analog controller design.pdfmohamedsamy9878
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.MdManikurRahman
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyBOHRInternationalJou1
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdfKamal Acharya
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...Roi Lipman
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Lovely Professional University
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...MohammadAliNayeem
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 

Recently uploaded (20)

Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Lecture_8-Digital implementation of analog controller design.pdf
Lecture_8-Digital implementation of analog controller design.pdfLecture_8-Digital implementation of analog controller design.pdf
Lecture_8-Digital implementation of analog controller design.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technology
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 

Counting Elements in Streams

  • 3. 3 Data streaming is becoming increasingly popular* *Biggest understatement of 2016
  • 4. 4 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  • 5. 5 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams
  • 7. Continuous counting  A seemingly simple application, but generally an unsolved problem  E.g., count visitors, impressions, interactions, clicks, etc  Aggregations and OLAP cube operations are generalizations of counting 7
  • 8. Counting in batch architecture Continuous ingestion Periodic (e.g., hourly) files Periodic batch jobs 8
  • 9. Problems with batch architecture High latency Too many moving parts Implicit treatment of time Out of order event handling Implicit batch boundaries 9
  • 10. Counting in λ architecture "Batch layer": what we had before "Stream layer": approximate early results 10
  • 11. Problems with batch and λ Way too many moving parts (and code dup) Implicit treatment of time Out of order event handling Implicit batch boundaries 11
  • 12. Counting in streaming architecture  Message queue ensures stream durability and replay  Stream processor ensures consistent counting 12
  • 13. Counting in Flink DataStream API Number of visitors in last hour by country 13 DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
  • 14. Counting hierarchy of needs 14 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable Based on Maslow's hierarchy of needs
  • 15. Counting hierarchy of needs 15 Continuous counting
  • 16. Counting hierarchy of needs 16 Continuous counting ... with low latency,
  • 17. Counting hierarchy of needs 17 Continuous counting ... with low latency, ... efficiently on high volume streams,
  • 18. Counting hierarchy of needs 18 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once),
  • 19. Counting hierarchy of needs 19 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable,
  • 20. Counting hierarchy of needs 20 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable 1.1+
  • 21. Rest of this talk 21 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable
  • 23. Yahoo! Streaming Benchmark 23  Storm, Spark Streaming, and Flink benchmark by the Storm team at Yahoo!  Focus on measuring end-to-end latency at low throughputs  First benchmark that was modeled after a real application  Read more: • https://yahooeng.tumblr.com/post/135321837876/benchmarkin g-streaming-computation-engines-at
  • 24. Benchmark task: counting! 24  Count ad impressions grouped by campaign  Compute aggregates over last 10 seconds  Make aggregates available for queries (Redis)
  • 25. Results (lower is better) 25 Flink and Storm at sub-second latencies Spark has a latency- throughput tradeoff 170k events/sec
  • 27. Handling high-volume streams Scalability: how many events/sec can a system scale to, with infinite resources? Scalability comes at a cost (systems add overhead to be scalable) Efficiency: how many events/sec can a system scale to, with limited resources? 27
  • 28. Extending the Yahoo! benchmark 28 Yahoo! benchmark is a great starting point to understand engine behavior. However, benchmark stops at low write throughput and programs are not fault tolerant. We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state. • http://data-artisans.com/extending-the-yahoo-streaming- benchmark/
  • 29. Results (higher is better) 29 Also: Flink jobs are correct under failures (exactly once), Storm jobs are not 500k events/sec 3mi events/sec 15mi events/sec
  • 30. Fault tolerance and repeatability 30
  • 31. Stateful streaming applications  Most interesting stream applications are stateful  How to ensure that state is correct after failures? 31
  • 32. Fault tolerance definitions  At least once • May over-count • In some cases even under-count with different definition  Exactly once • Counts are the same after failure  End-to-end exactly once • Counts appear the same in an external sink (database, file system) after failure 32
  • 33. Fault tolerance in Flink Flink guarantees exactly once End-to-end exactly once supported with specific sources and sinks • E.g., Kafka Flink HDFS Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation 33
  • 34. Savepoints  Maintaining stateful applications in production is challenging  Flink savepoints: externally triggered, durable checkpoints  Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 34
  • 36. Notions of time 36 1977 1980 1983 1999 2002 2005 2015 Episode IV: A New Hope Episode V: The Empire Strikes Back Episode VI: Return of the Jedi Episode I: The Phantom Menace Episode II: Attach of the Clones Episode III: Revenge of the Sith Episode VII: The Force Awakens This is called event time This is called processing time
  • 37. Out of order streams 37 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  • 38. Why event time Most stream processors are limited to processing/arrival time, Flink can operate on event time as well Benefits of event time • Accurate results for out of order data • Sessions and unaligned windows • Time travel (backstreaming) 38
  • 39. What's coming up in Flink 39
  • 40. Evolution of streaming in Flink  Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once guarantees via checkpoiting  Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring  Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support 40
  • 41. Upcoming features  SQL: ongoing work in collaboration with Apache Calcite  Dynamic scaling: adapt resources to stream volume, historical stream processing  Queryable state: ability to query the state inside the stream processor  Mesos support  More sources and sinks (e.g., Kinesis, Cassandra) 41
  • 44. Summary  Stream processing gaining momentum, the right paradigm for continuous data applications  Even seemingly simple applications can be complex at scale and in production – choice of framework crucial 44  Flink: unique combination of capabilities, performance, and robustness
  • 45. Flink in the wild 45 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at
  • 46. Join the community! 46  Follow: @ApacheFlink, @dataArtisans  Read: flink.apache.org/blog, data-artisans.com/blog  Subscribe: (news | user | dev) @ flink.apache.org
  • 47. 2 meetups next week in Bay Area! 47 April 6, San JoseApril 5, San Francisco What's new in Flink 1.0 & recent performance benchmarks with Flink http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Editor's Notes

  1. 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?
  2. 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?