An Introduction to Data Stream
Analytics
using Apache Flink
SeRC Big Data Workshop
Paris Carbone<parisc@kth.se>
PhD Candidate
KTH Royal Institute of Technology
1
Motivation
• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..
How about Tsunamis
3
4
Q =
Q
Deploy Sensors
Analyse Data
Regularly
Collect
Data
evacuation
window
earth & wave activity
Motivation
5
Q Q
Q =
Motivation
6
Q
Standing Query
Q =
evacuation
window
Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the
most recent views of the stream ~ windows
7
Data Stream Basics
• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2
Streaming Pipelines
9
stream1
stream2
approximations
predictions
alerts
……
Q
sources
sinks
Stream Analytics Systems
10
Proprietary Open Source
Google
DataFlow
IBM
Infosphere
Microsoft
Azure
Flink
Storm
Samza
Spark
Programming Models
11
Compositional Declarative
• Offer basic building blocks
for composing custom
operators and topologies
• Advanced behaviour such
as windowing is often
missing
• Custom Optimisation
• Expose a high-level API
• Operators are transformations
on abstract data types
• Advanced behaviour such as
windowing is supported
• Self-Optimisation
Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by git
commits
• A Top-level project
• Community-driven open
source software development
• Publicly open to new
contributors
Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines
Scalable
Machine Learning
Graph Analytics
14
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources
• Blocking Operations
• Structured Iterations
• Unbounded Data Sources
• Continuous Operations
• Asynchronous Iterations
The Big Picture
DataStreamDataSet
Distributed Dataflow
Deployment
Graph-Gelly
Table
ML
HadoopM/R
Table
CEP
SQL
SQL
ML
Graph-Gelly
16
Basic API Concept
Source
Data
Stream Operator
Data
Stream Sink
Source
Data
Set
Operator
Data
Set
Sink
Writing a Flink Program
1.Bootstrap Sources
2.Apply Operators
3.Output to Sinks
Data Streams as
Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
17
DataStream
Example
18
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.sum(1)
.print()
“live and let live”
“live”	“and”	“let”	“live”
(live,1)	(and,1)	(let,1)	(live,1)
(live,1)
(and,1)
(let,1)
(live,2)
Working with Windows
19
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
#sec
40 80
SUM #2
0
SUM #1
20 60 100
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow(
Time.seconds(60),
Time.seconds(20));
1) Sliding windows
2) Tumbling windows
myKeyedStream.timeWindow(
Time.seconds(60));
window buckets/panes
Example
20
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
“live and”
(live,1)	(and,1)
(let,1)	(live,1)
counting words over windows
“let live”
10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)
Example
21
printwindow sumflatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
map
where counts are
kept in state
Example
22
window sum
flatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.setParallelism(4)
.print()
map print
Making State Explicit
23
• Explicitly defined state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS
Fault Tolerance
24
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failures
When failures occur we
revert computation and state back to a snapshot
events
Also part of Apache Storm
Performance
• Twitter Hack Week - Flink as an in-memory data store
25
Jamie Grier - http://data-artisans.com/extending-the-
yahoo-streaming-benchmark/
So how is Flink different that
Spark?
26
Two major differences
1) Stream Execution
2) Mutable State
Flink vs Spark
27
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state
What about DataSets?
28
• Sophisticated SQL-inspired optimiser
• Efficient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations
Some Interesting Libraries
29
Detecting Patterns
30
PatternStream<Event> tsunamiPattern =
CEP.pattern(sensorStream,
Pattern
.begin("seismic").where(evt -> evt.motion.equals(“ClassB”))
.next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select(
pattern -> {
return getEvacuationAlert(pattern);
});
CEP Java library Example
Scala DSL coming soon
Mining Graphs with Gelly
31
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label
Propagation, Connected Components,
PageRank.Shortest Paths, Triangle Count etc…
Coming Soon : Real-time graph stream support
Machine Learning Pipelines
32
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS
Relational Queries
33
Table table = tableEnv.fromDataSet(input);
Table filtered = table
.groupBy("word")
.select("word.count as count, word")
.filter("count = 2");
DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);
Table API Example
SQL and Stream SQL coming soon
Real-Time Monitoring
34
…for real-time processing
Coming Soon
35
• SQL and Stream SQL
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots

Data Stream Analytics - Why they are important

  • 1.
    An Introduction toData Stream Analytics using Apache Flink SeRC Big Data Workshop Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology 1
  • 2.
    Motivation • Time-critical problems/ Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations 2 more like First-World Problems..
  • 3.
  • 4.
    4 Q = Q Deploy Sensors AnalyseData Regularly Collect Data evacuation window earth & wave activity
  • 5.
  • 6.
  • 7.
    Data Stream Paradigm •Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7
  • 8.
    Data Stream Basics •Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators: consume streams and generate new ones. • Events are consumed once - no backtracking! 8 f S1 S2 So S’1 S’2
  • 9.
  • 10.
    Stream Analytics Systems 10 ProprietaryOpen Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark
  • 11.
    Programming Models 11 Compositional Declarative •Offer basic building blocks for composing custom operators and topologies • Advanced behaviour such as windowing is often missing • Custom Optimisation • Expose a high-level API • Operators are transformations on abstract data types • Advanced behaviour such as windowing is supported • Self-Optimisation
  • 12.
    Introducing Apache Flink 0 20 40 60 80 100 120 juli-09nov-10 apr-12 aug-13 dec-14 maj-16 #unique contributor ids by git commits • A Top-level project • Community-driven open source software development • Publicly open to new contributors
  • 13.
    Native Workload Support ApacheFlink Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics
  • 14.
    14 The Apache FlinkStack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Blocking Operations • Structured Iterations • Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
  • 15.
    The Big Picture DataStreamDataSet DistributedDataflow Deployment Graph-Gelly Table ML HadoopM/R Table CEP SQL SQL ML Graph-Gelly
  • 16.
    16 Basic API Concept Source Data StreamOperator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
  • 17.
    Data Streams as AbstractData Types • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • Transformations: map, flatmap, filter, union… • Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… 17 DataStream
  • 18.
    Example 18 textStream .flatMap {_.split("W+")} .map {(_,1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)
  • 19.
    Working with Windows 19 Whywindows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20)); 1) Sliding windows 2) Tumbling windows myKeyedStream.timeWindow( Time.seconds(60)); window buckets/panes
  • 20.
    Example 20 textStream .flatMap {_.split("W+")} .map {(_,1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() “live and” (live,1) (and,1) (let,1) (live,1) counting words over windows “let live” 10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)
  • 21.
    Example 21 printwindow sumflatMap textStream .flatMap {_.split("W+")} .map{(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() map where counts are kept in state
  • 22.
    Example 22 window sum flatMap textStream .flatMap {_.split("W+")} .map{(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() map print
  • 23.
    Making State Explicit 23 •Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS
  • 24.
    Fault Tolerance 24 t2t1 snap -t1 snap - t2 snapshotting snapshotting State is not affected by failures When failures occur we revert computation and state back to a snapshot events Also part of Apache Storm
  • 25.
    Performance • Twitter HackWeek - Flink as an in-memory data store 25 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/
  • 26.
    So how isFlink different that Spark? 26 Two major differences 1) Stream Execution 2) Mutable State
  • 27.
    Flink vs Spark 27 (SparkStreaming) put new states in output RDDdstream.updateStateByKey(…) In S’ S • dedicated resources • leased resources • mutable state • immutable state
  • 28.
    What about DataSets? 28 •Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations
  • 29.
  • 30.
    Detecting Patterns 30 PatternStream<Event> tsunamiPattern= CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); CEP Java library Example Scala DSL coming soon
  • 31.
    Mining Graphs withGelly 31 • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming Soon : Real-time graph stream support
  • 32.
    Machine Learning Pipelines 32 •Scikit-learn inspired pipelining • Supervised: SVM, Linear Regression • Preprocessing: Polynomial Features, Scalers • Recommendation: ALS
  • 33.
    Relational Queries 33 Table table= tableEnv.fromDataSet(input); Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2"); DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class); Table API Example SQL and Stream SQL coming soon
  • 34.
  • 35.
    Coming Soon 35 • SQLand Stream SQL • Stream ML • Stream Graph Processing (Gelly-Stream) • Autoscaling • Incremental Snapshots