SlideShare a Scribd company logo
1 of 35
Download to read offline
An Introduction to Data Stream
Analytics
using Apache Flink
SeRC Big Data Workshop
Paris Carbone<parisc@kth.se>
PhD Candidate
KTH Royal Institute of Technology
1
Motivation
• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..
How about Tsunamis
3
4
Q =
Q
Deploy Sensors
Analyse Data
Regularly
Collect
Data
evacuation
window
earth & wave activity
Motivation
5
Q Q
Q =
Motivation
6
Q
Standing Query
Q =
evacuation
window
Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the
most recent views of the stream ~ windows
7
Data Stream Basics
• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2
Streaming Pipelines
9
stream1
stream2
approximations
predictions
alerts
……
Q
sources
sinks
Stream Analytics Systems
10
Proprietary Open Source
Google
DataFlow
IBM
Infosphere
Microsoft
Azure
Flink
Storm
Samza
Spark
Programming Models
11
Compositional Declarative
• Offer basic building blocks
for composing custom
operators and topologies
• Advanced behaviour such
as windowing is often
missing
• Custom Optimisation
• Expose a high-level API
• Operators are transformations
on abstract data types
• Advanced behaviour such as
windowing is supported
• Self-Optimisation
Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by git
commits
• A Top-level project
• Community-driven open
source software development
• Publicly open to new
contributors
Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines
Scalable
Machine Learning
Graph Analytics
14
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources
• Blocking Operations
• Structured Iterations
• Unbounded Data Sources
• Continuous Operations
• Asynchronous Iterations
The Big Picture
DataStreamDataSet
Distributed Dataflow
Deployment
Graph-Gelly
Table
ML
HadoopM/R
Table
CEP
SQL
SQL
ML
Graph-Gelly
16
Basic API Concept
Source
Data
Stream Operator
Data
Stream Sink
Source
Data
Set
Operator
Data
Set
Sink
Writing a Flink Program
1.Bootstrap Sources
2.Apply Operators
3.Output to Sinks
Data Streams as
Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
17
DataStream
Example
18
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.sum(1)
.print()
“live and let live”
“live”	“and”	“let”	“live”
(live,1)	(and,1)	(let,1)	(live,1)
(live,1)
(and,1)
(let,1)
(live,2)
Working with Windows
19
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
#sec
40 80
SUM #2
0
SUM #1
20 60 100
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow(
Time.seconds(60),
Time.seconds(20));
1) Sliding windows
2) Tumbling windows
myKeyedStream.timeWindow(
Time.seconds(60));
window buckets/panes
Example
20
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
“live and”
(live,1)	(and,1)
(let,1)	(live,1)
counting words over windows
“let live”
10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)
Example
21
printwindow sumflatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
map
where counts are
kept in state
Example
22
window sum
flatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.setParallelism(4)
.print()
map print
Making State Explicit
23
• Explicitly defined state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS
Fault Tolerance
24
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failures
When failures occur we
revert computation and state back to a snapshot
events
Also part of Apache Storm
Performance
• Twitter Hack Week - Flink as an in-memory data store
25
Jamie Grier - http://data-artisans.com/extending-the-
yahoo-streaming-benchmark/
So how is Flink different that
Spark?
26
Two major differences
1) Stream Execution
2) Mutable State
Flink vs Spark
27
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state
What about DataSets?
28
• Sophisticated SQL-inspired optimiser
• Efficient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations
Some Interesting Libraries
29
Detecting Patterns
30
PatternStream<Event> tsunamiPattern =
CEP.pattern(sensorStream,
Pattern
.begin("seismic").where(evt -> evt.motion.equals(“ClassB”))
.next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select(
pattern -> {
return getEvacuationAlert(pattern);
});
CEP Java library Example
Scala DSL coming soon
Mining Graphs with Gelly
31
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label
Propagation, Connected Components,
PageRank.Shortest Paths, Triangle Count etc…
Coming Soon : Real-time graph stream support
Machine Learning Pipelines
32
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS
Relational Queries
33
Table table = tableEnv.fromDataSet(input);
Table filtered = table
.groupBy("word")
.select("word.count as count, word")
.filter("count = 2");
DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);
Table API Example
SQL and Stream SQL coming soon
Real-Time Monitoring
34
…for real-time processing
Coming Soon
35
• SQL and Stream SQL
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots

More Related Content

What's hot

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 

What's hot (20)

Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache Flink
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 

Viewers also liked

Viewers also liked (20)

Chapter 2.1 : Data Stream
Chapter 2.1 : Data StreamChapter 2.1 : Data Stream
Chapter 2.1 : Data Stream
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Cerrera DINWC2015
Cerrera DINWC2015Cerrera DINWC2015
Cerrera DINWC2015
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Cloud computing - Pros and Cons
Cloud computing - Pros and ConsCloud computing - Pros and Cons
Cloud computing - Pros and Cons
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analyticsGraph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analytics
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 

Similar to Data Stream Analytics - Why they are important

SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 

Similar to Data Stream Analytics - Why they are important (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
data-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptxdata-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptx
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 

More from Paris Carbone

Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 

More from Paris Carbone (9)

Scalable and Reliable Data Stream Processing - Doctorate Seminar
Scalable and Reliable Data Stream Processing - Doctorate SeminarScalable and Reliable Data Stream Processing - Doctorate Seminar
Scalable and Reliable Data Stream Processing - Doctorate Seminar
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
 
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
 
A Future Look of Data Stream Processing as an Architecture for AI
A Future Look of Data Stream Processing as an Architecture for AIA Future Look of Data Stream Processing as an Architecture for AI
A Future Look of Data Stream Processing as an Architecture for AI
 
Continuous Deep Analytics
Continuous Deep AnalyticsContinuous Deep Analytics
Continuous Deep Analytics
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionNot Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
 
Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream Windows
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 

Recently uploaded (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Data Stream Analytics - Why they are important

  • 1. An Introduction to Data Stream Analytics using Apache Flink SeRC Big Data Workshop Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology 1
  • 2. Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations 2 more like First-World Problems..
  • 4. 4 Q = Q Deploy Sensors Analyse Data Regularly Collect Data evacuation window earth & wave activity
  • 7. Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7
  • 8. Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators: consume streams and generate new ones. • Events are consumed once - no backtracking! 8 f S1 S2 So S’1 S’2
  • 10. Stream Analytics Systems 10 Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark
  • 11. Programming Models 11 Compositional Declarative • Offer basic building blocks for composing custom operators and topologies • Advanced behaviour such as windowing is often missing • Custom Optimisation • Expose a high-level API • Operators are transformations on abstract data types • Advanced behaviour such as windowing is supported • Self-Optimisation
  • 12. Introducing Apache Flink 0 20 40 60 80 100 120 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 #unique contributor ids by git commits • A Top-level project • Community-driven open source software development • Publicly open to new contributors
  • 13. Native Workload Support Apache Flink Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics
  • 14. 14 The Apache Flink Stack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Blocking Operations • Structured Iterations • Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
  • 15. The Big Picture DataStreamDataSet Distributed Dataflow Deployment Graph-Gelly Table ML HadoopM/R Table CEP SQL SQL ML Graph-Gelly
  • 16. 16 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
  • 17. Data Streams as Abstract Data Types • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • Transformations: map, flatmap, filter, union… • Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… 17 DataStream
  • 18. Example 18 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)
  • 19. Working with Windows 19 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20)); 1) Sliding windows 2) Tumbling windows myKeyedStream.timeWindow( Time.seconds(60)); window buckets/panes
  • 20. Example 20 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() “live and” (live,1) (and,1) (let,1) (live,1) counting words over windows “let live” 10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)
  • 21. Example 21 printwindow sumflatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() map where counts are kept in state
  • 22. Example 22 window sum flatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() map print
  • 23. Making State Explicit 23 • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS
  • 24. Fault Tolerance 24 t2t1 snap - t1 snap - t2 snapshotting snapshotting State is not affected by failures When failures occur we revert computation and state back to a snapshot events Also part of Apache Storm
  • 25. Performance • Twitter Hack Week - Flink as an in-memory data store 25 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/
  • 26. So how is Flink different that Spark? 26 Two major differences 1) Stream Execution 2) Mutable State
  • 27. Flink vs Spark 27 (Spark Streaming) put new states in output RDDdstream.updateStateByKey(…) In S’ S • dedicated resources • leased resources • mutable state • immutable state
  • 28. What about DataSets? 28 • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations
  • 30. Detecting Patterns 30 PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); CEP Java library Example Scala DSL coming soon
  • 31. Mining Graphs with Gelly 31 • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming Soon : Real-time graph stream support
  • 32. Machine Learning Pipelines 32 • Scikit-learn inspired pipelining • Supervised: SVM, Linear Regression • Preprocessing: Polynomial Features, Scalers • Recommendation: ALS
  • 33. Relational Queries 33 Table table = tableEnv.fromDataSet(input); Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2"); DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class); Table API Example SQL and Stream SQL coming soon
  • 35. Coming Soon 35 • SQL and Stream SQL • Stream ML • Stream Graph Processing (Gelly-Stream) • Autoscaling • Incremental Snapshots