Data Stream Analytics - Why they are important

An Introduction to Data Stream
Analytics
using Apache Flink
SeRC Big Data Workshop
Paris Carbone<parisc@kth.se>
PhD Candidate
KTH Royal Institute of Technology
1

Motivation
• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..

4
Q =
Q
Deploy Sensors
Analyse Data
Regularly
Collect
Data
evacuation
window
earth & wave activity

Motivation
6
Q
Standing Query
Q =
evacuation
window

Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the
most recent views of the stream ~ windows
7

Data Stream Basics
• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2

Streaming Pipelines
9
stream1
stream2
approximations
predictions
alerts
……
Q
sources
sinks

Stream Analytics Systems
10
Proprietary Open Source
Google
DataFlow
IBM
Infosphere
Microsoft
Azure
Flink
Storm
Samza
Spark

Programming Models
11
Compositional Declarative
• Offer basic building blocks
for composing custom
operators and topologies
• Advanced behaviour such
as windowing is often
missing
• Custom Optimisation
• Expose a high-level API
• Operators are transformations
on abstract data types
• Advanced behaviour such as
windowing is supported
• Self-Optimisation

Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by git
commits
• A Top-level project
• Community-driven open
source software development
• Publicly open to new
contributors

Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines
Scalable
Machine Learning
Graph Analytics

14
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataﬂow
Deployment
• Bounded Data Sources
• Blocking Operations
• Structured Iterations
• Unbounded Data Sources
• Continuous Operations
• Asynchronous Iterations

The Big Picture
DataStreamDataSet
Distributed Dataﬂow
Deployment
Graph-Gelly
Table
ML
HadoopM/R
Table
CEP
SQL
SQL
ML
Graph-Gelly

16
Basic API Concept
Source
Data
Stream Operator
Data
Stream Sink
Source
Data
Set
Operator
Data
Set
Sink
Writing a Flink Program
1.Bootstrap Sources
2.Apply Operators
3.Output to Sinks

Data Streams as
Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
17
DataStream

Example
18
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.sum(1)
.print()
“live and let live”
“live” “and” “let” “live”
(live,1) (and,1) (let,1) (live,1)
(live,1)
(and,1)
(let,1)
(live,2)

Working with Windows
19
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
#sec
40 80
SUM #2
0
SUM #1
20 60 100
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow(
Time.seconds(60),
Time.seconds(20));
1) Sliding windows
2) Tumbling windows
myKeyedStream.timeWindow(
Time.seconds(60));
window buckets/panes

Example
20
textStream
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
“live and”
(live,1) (and,1)
(let,1) (live,1)
counting words over windows
“let live”
10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)

Example
21
printwindow sumﬂatMap
textStream
.map {(_, 1)}
.keyBy(0)
.sum(1)
.print()
map
where counts are
kept in state

Example
22
window sum
ﬂatMap
textStream
.map {(_, 1)}
.keyBy(0)
.sum(1)
.setParallelism(4)
.print()
map print

Making State Explicit
23
• Explicitly deﬁned state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS

Fault Tolerance
24
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failures
When failures occur we
revert computation and state back to a snapshot
events
Also part of Apache Storm

Performance
• Twitter Hack Week - Flink as an in-memory data store
25
Jamie Grier - http://data-artisans.com/extending-the-
yahoo-streaming-benchmark/

So how is Flink different that
Spark?
26
Two major differences
1) Stream Execution
2) Mutable State

Flink vs Spark
27
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state

What about DataSets?
28
• Sophisticated SQL-inspired optimiser
• Efﬁcient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations

Detecting Patterns
30
PatternStream<Event> tsunamiPattern =
CEP.pattern(sensorStream,
Pattern
.begin("seismic").where(evt -> evt.motion.equals(“ClassB”))
.next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select(
pattern -> {
return getEvacuationAlert(pattern);
});
CEP Java library Example
Scala DSL coming soon

Mining Graphs with Gelly
31
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label
Propagation, Connected Components,
PageRank.Shortest Paths, Triangle Count etc…
Coming Soon : Real-time graph stream support

Machine Learning Pipelines
32
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS

Relational Queries
33
Table table = tableEnv.fromDataSet(input);
Table filtered = table
.groupBy("word")
.select("word.count as count, word")
.filter("count = 2");
DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);
Table API Example
SQL and Stream SQL coming soon

Real-Time Monitoring
34
…for real-time processing

Coming Soon
35
• SQL and Stream SQL
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots

Data Stream Analytics - Why they are important

More Related Content

What's hot

Viewers also liked

Similar to Data Stream Analytics - Why they are important

More from Paris Carbone

Recently uploaded

Data Stream Analytics - Why they are important