This document provides information about the first conference on Apache Flink. It summarizes key aspects of the Apache Flink streaming engine, including its improved DataStream API, support for event time processing, high availability, and integration of batch and streaming capabilities. The document also outlines Flink's progress towards version 1.0, which will focus on defining public APIs and backwards compatibility, and outlines future plans such as enhancing usability features on top of the DataStream API.
2. Some practical info
§ Registration, cloakroom, and meals are in
Palais
§ Information point always staffed
§ WiFi is FlinkForward
§ Twitter hashtag is #ff15
§ Follow @FlinkForward
3. Some practical info
§ Need help? Look for a volunteer (pink badges)
§ All sessions are recorded and will be made
available online
§ This includes the training sessions
3
16. 16
1. Radically simplified infrastructure
2. Internet of Things, on-demand services
3. Can completely subsume batch
17. 17
In a world of events and isolated apps, the stream processor is the
backbone of the data infrastructure
App App
App
local view
local viewlocal view
Consistent
movement,
analytics
App App App
Global view
Consistent store
18. 18
§ Until now, stream processors were less
mature than batch processors
§ This led to
• in-house solutions
• abuse of batch processors
• Lambda architectures
§ This is no longer the case
19. 19
Flink 0.10
With the upcoming 0.10 release, Flink
significantly surpasses the state of the art in
open source stream processing systems.
And, we are heading to Flink 1.0 after that.
20. 20
§ Streaming technology has matured
• e.g., Flink, Kafka, Dataflow
§ Flink and Dataflow duality
• a Google technology
• an open source Apache project
+
21. 21
§ Streaming is happening
§ Better adapt now
§ Flink 0.10: a ready to use open
source stream processor
23. Improved DataStream API
§ Stream data analysis differs from batch data
analysis by introducing time
§ Streams are unbounded and produce data
over time
§ Simple as batch API if handling time in a
simple way
§ Powerful if you want to handle time in an
advanced way (out-of-order records,
preliminary results, etc)
23
24. Improved DataStream API
24
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
25. Improved DataStream API
25
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.sum("numVehicles")
26. Improved DataStream API
26
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.trigger(new
Threshold(200))
.sum("numVehicles")
27. Improved DataStream API
27
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.trigger(new
Threshold(200))
.sum("numVehicles")
.keyBy(
evt
=>
evt.location.grid
)
.mapWithState
{
(evt,
state:
Option[Model])
=>
{
val
model
=
state.orElse(new
Model())
(model.classify(evt),
Some(model.update(evt)))
}}
28. IoT / Mobile Applications
28
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
32. IoT / Mobile Applications
32
Out of order !!!
First burst of events
Second burst of events
33. IoT / Mobile Applications
33
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
34. High Availability and Consistency
34
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
è Allows for highly flexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover
36. Batch and Streaming
36
case
class
WordCount(word:
String,
count:
Int)
val
text:
DataStream[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new
EOFTrigger())
.sum("count")
Batch Word Count in the DataStream API
37. Batch and Streaming
37
Batch Word Count in the DataSet API
case
class
WordCount(word:
String,
count:
Int)
val
text:
DataStream[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new
EOFTrigger())
.sum("count")
val
text:
DataSet[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.groupBy("word")
.sum("count")
38. Batch and Streaming
38
Pipelined and
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters
39. Batch and Streaming
39
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
40. Batch and Streaming
40
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
More details at Dongwon Kim's Talk
"A comparative performance evaluation of Flink"
42. Monitoring
42
Life system metrics and
user-defined accumulators/statistics
Get
http://flink-‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators
Monitoring REST API for
custom monitoring tools
{
"id":
"dceafe2df1f57a1206fcb907cb38ad97",
"user-‐accumulators":
[
{
"name":"avglen",
"type":"DoubleCounter",
"value":"123.03259440000001"
},
{
"name":"genwords",
"type":"LongCounter",
"value":"75000000"
}
]
}
43. Flink 0.10 Summary
§ Focus on operational readiness
• high availability
• monitoring
• integration with other systems
§ First-class support for event time
§ Refined DataStream API: easy and
powerful
43
45. Towards Flink 1.0
§ Flink 1.0 is around the corner
§ Focus on defining public APIs and
automatic API compatibility checks
§ Guarantee backwards compatibility in all
Flink 1.X versions
45
46. Beyond Flink 1.0
§ Flink engine has most features in place
§ Focus on usability features on top of
DataStream API
• e.g., SQL, ML, more connectors
§ Continue work on elasticity and memory
management
46
47. 47
Enjoy the rest of
The first conference on Apache Flink