Community Update &
Roadmap 2016
Robert Metzger
@rmetzger_
rmetzger@apache.org
Berlin Apache Flink Meetup,
January 26, 2016
January Community Update
What happened in the last month
2
What happened?
3
 Google proposed Dataflow API to Apache
Incubator
 Proposal discussions at the mailing list:
• SQL / Stream SQL support
• CEP (Complex Event Processing) library
 Flink Kinesis Connector
 Chengxiang Li added as committer
 Discussions for releasing 1.0.0
Now merged to master (1.0-SNAPSOT)
4
 Savepoints: Manual checkpoints for
restarting jobs with state
 Kafka 0.9.0.0 integration
 Job submission through JobManager web
interface
 Checkpoint statistics in JobManager web
interface
 Streaming examples are now in the binary
dist
Reading List
 Benchmarking Streaming Computation
Engines at Yahoo!
 Receiving metrics from Apache Flink
applications
 Running Apache Flink on Amazon Elastic
Mapreduce
5
1. http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
2. http://mnxfst.tumblr.com/post/136539620407/receiving-metrics-from-apache-flink-
applications
3. http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/e
mr/aws/2016/01/06/running-apache-flink-on-amazon-elastic-mapreduce/
Upcoming talks
 FOSDEM Brussels (4 talks) (Jan 30-31)
 Big Data Technology Summit Warsaw
(Feb. 25-26)
 Qcon London (March 7-9)
 Hadoop Summit Dublin (2 talks) (April 13-
14)
 Strata San Jose
 Strata London
6
Global Meetup Community
 Brazil-Sao Paulo Apache Flink Meetup
 Apache Flink Taiwan User Group
 Also new groups in Delhi, Phoenix and
Dallas
7
Github stats
8
 900 Stars
Roadmap 2016
Whats next?
9
Overview
10
 SQL / StreamSQL
 CEP Library
 Managed Operator State
 Dynamic Scaling
 Miscellaneous
SQL and StreamSQL
11
SQL / StreamSQL
12
 Structured queries over data sets and
streams
 Add support for SQL
• Standard SQL queries over (batch) data sets
• Continuous StreamSQL queries over data
streams
 Keep and extend Table API as structured
query API on data sets and streams
Proposed Architecture
13
Table API
(Batch) SQL
Query
StreamSQL
Query
ApacheCalcite
Standard
SQL parser
Customized
StreamSQL
parser
Optimizer
Logical Plan
DataSet
Program
DataStream
Program
APIs
Internals
SQL integration into APIs
14
val stream : DataStream[(String, Double, Int)]
= env.addSource(new FlinkKafkaConsumer(...))
val tabEnv = new TableEnvironment(env)
tabEnv.registerStream(stream, “myStream”,
(“ID”, “MEASURE”, “COUNT”))
val sqlQuery = tabEnv.sql(
“SELECT ID, MEASURE FROM myStream WHERE
COUNT > 17”)
 Define Kafka input stream
 Define table environment
 SQL Query
Complex Event Processing
15
CEP Library
 Complex Event Processing: the analysis of
complex patterns such as correlations and
sequence detection from multiple sources
 Most current systems are not distributed
(beyond multi-threading)
 Goal: provide an easy to use API for CEP,
running on a distributed high-throughput, low
latency engine.
16
CEP Example
17
Realtime stock prices
15.1 15.3 15.2 15.5
State
Machine
Alerts
Start
Price drop by at least $.5
Ignore
Alert
Programming API for CEP
CEPStream<Event> cepStream = CEP.from(inputDataStream)
// grouping
GroupedCEPStream<Event> grouped = cepStream.groupBy(“id”)
// windows
WindowedCEPStream windowed = grouped.timeWindow(Time.minutes(10),
Time.minutes(1))
WindowedCEPStream windowed = grouped.countWindow(10L, 1L)
// pattern matching
CEPStream<Result> resultStream =
CEP.from(input).groupBy(0).pattern(
Pattern.<Event>next("e1").where( (evt) -> evt.id == 42 )
.followedBy("e2").where( (evt) -> evt.id == 1337 )
.within(Time.minutes(10))
).select( (Map<String, Event> patternElements) ->
new Result(patternElements.get("e2").timestamp -
patternElements.get("e1").timestamp) ) 18
 convert stream into CEPStream of Events
 Window events
 Define a pattern to match
DSL for CEP
select e1.id, e1.price
from every e1 = Event(price > 10) →
e2 = Event(date == 42) → e3 =
Event(price == 10) within 10 seconds
where e1.id == e2.id
19
 No programming required
 Potentially integrated with SQL
Managed Operator State
20
State in Flink
21
Operator
“count tweet
impressions”
User Function
state
impression counts
Retrieve/set
count for
tweet it
State in Flink
22
Operator
“count tweet
impressions”
User Function
state
impression counts
Retrieve/set
count for
tweet it
What happens if the job
crashes?
Loss of data
Solution: Checkpoints
23
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
Periodic checkpoints
of state to HDFS
Restore from HDFS
in case of failure
state
Solution: Checkpoints
24
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
Periodic checkpoints
of state to HDFS
Restore from HDFS
in case of failure
state
This is the current state in
Flink!
State on Steroids
25
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
State on Steroids
26
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
What if state
grows too big?
State on Steroids
27
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
State on Steroids
28
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
What if state
grows too big?
Checkpointing stalls
processing!
State on Steroids
29
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
Dealing with Dynamic
Resources
30
Streams with varying data rate
31
time
events/second
With static resources: Provision for max. rate
Idle capacity
(1) Adjust Parallelism
32
Initial
configuration
Scale Out
(for load)
Scale In
(save resources)
(1) Adjust Parallelism
 Adjusting parallelism without (significantly)
interrupting the program
 Initial version:
• Checkpoint -> stop -> restart-with-different-parallelism
 Stateless operators: Trivial
 Stateful operators: Repartition state
• Transparent for key/value state and windows
• Consistent hashing simplifies state reorganization
33
(2) Dynamic Worker Pool
34
JobManager
Resource
Manager
Pool of Cluster
ResourcesYARN/Mesos/…
TaskManager
TaskManager
Miscellaneous
 Support for Apache Mesos
 Security
• Over-the-wire encryption of RPC (akka) and data
transfers (netty)
 More connectors
• Apache Cassandra
• Amazon Kinesis
 Enhance metrics
• Throughput / Latencies
• Backpressure monitoring
• Spilling / Out of Core
35

January 2016 Flink Community Update & Roadmap 2016

  • 1.
    Community Update & Roadmap2016 Robert Metzger @rmetzger_ rmetzger@apache.org Berlin Apache Flink Meetup, January 26, 2016
  • 2.
    January Community Update Whathappened in the last month 2
  • 3.
    What happened? 3  Googleproposed Dataflow API to Apache Incubator  Proposal discussions at the mailing list: • SQL / Stream SQL support • CEP (Complex Event Processing) library  Flink Kinesis Connector  Chengxiang Li added as committer  Discussions for releasing 1.0.0
  • 4.
    Now merged tomaster (1.0-SNAPSOT) 4  Savepoints: Manual checkpoints for restarting jobs with state  Kafka 0.9.0.0 integration  Job submission through JobManager web interface  Checkpoint statistics in JobManager web interface  Streaming examples are now in the binary dist
  • 5.
    Reading List  BenchmarkingStreaming Computation Engines at Yahoo!  Receiving metrics from Apache Flink applications  Running Apache Flink on Amazon Elastic Mapreduce 5 1. http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at 2. http://mnxfst.tumblr.com/post/136539620407/receiving-metrics-from-apache-flink- applications 3. http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/e mr/aws/2016/01/06/running-apache-flink-on-amazon-elastic-mapreduce/
  • 6.
    Upcoming talks  FOSDEMBrussels (4 talks) (Jan 30-31)  Big Data Technology Summit Warsaw (Feb. 25-26)  Qcon London (March 7-9)  Hadoop Summit Dublin (2 talks) (April 13- 14)  Strata San Jose  Strata London 6
  • 7.
    Global Meetup Community Brazil-Sao Paulo Apache Flink Meetup  Apache Flink Taiwan User Group  Also new groups in Delhi, Phoenix and Dallas 7
  • 8.
  • 9.
  • 10.
    Overview 10  SQL /StreamSQL  CEP Library  Managed Operator State  Dynamic Scaling  Miscellaneous
  • 11.
  • 12.
    SQL / StreamSQL 12 Structured queries over data sets and streams  Add support for SQL • Standard SQL queries over (batch) data sets • Continuous StreamSQL queries over data streams  Keep and extend Table API as structured query API on data sets and streams
  • 13.
    Proposed Architecture 13 Table API (Batch)SQL Query StreamSQL Query ApacheCalcite Standard SQL parser Customized StreamSQL parser Optimizer Logical Plan DataSet Program DataStream Program APIs Internals
  • 14.
    SQL integration intoAPIs 14 val stream : DataStream[(String, Double, Int)] = env.addSource(new FlinkKafkaConsumer(...)) val tabEnv = new TableEnvironment(env) tabEnv.registerStream(stream, “myStream”, (“ID”, “MEASURE”, “COUNT”)) val sqlQuery = tabEnv.sql( “SELECT ID, MEASURE FROM myStream WHERE COUNT > 17”)  Define Kafka input stream  Define table environment  SQL Query
  • 15.
  • 16.
    CEP Library  ComplexEvent Processing: the analysis of complex patterns such as correlations and sequence detection from multiple sources  Most current systems are not distributed (beyond multi-threading)  Goal: provide an easy to use API for CEP, running on a distributed high-throughput, low latency engine. 16
  • 17.
    CEP Example 17 Realtime stockprices 15.1 15.3 15.2 15.5 State Machine Alerts Start Price drop by at least $.5 Ignore Alert
  • 18.
    Programming API forCEP CEPStream<Event> cepStream = CEP.from(inputDataStream) // grouping GroupedCEPStream<Event> grouped = cepStream.groupBy(“id”) // windows WindowedCEPStream windowed = grouped.timeWindow(Time.minutes(10), Time.minutes(1)) WindowedCEPStream windowed = grouped.countWindow(10L, 1L) // pattern matching CEPStream<Result> resultStream = CEP.from(input).groupBy(0).pattern( Pattern.<Event>next("e1").where( (evt) -> evt.id == 42 ) .followedBy("e2").where( (evt) -> evt.id == 1337 ) .within(Time.minutes(10)) ).select( (Map<String, Event> patternElements) -> new Result(patternElements.get("e2").timestamp - patternElements.get("e1").timestamp) ) 18  convert stream into CEPStream of Events  Window events  Define a pattern to match
  • 19.
    DSL for CEP selecte1.id, e1.price from every e1 = Event(price > 10) → e2 = Event(date == 42) → e3 = Event(price == 10) within 10 seconds where e1.id == e2.id 19  No programming required  Potentially integrated with SQL
  • 20.
  • 21.
    State in Flink 21 Operator “counttweet impressions” User Function state impression counts Retrieve/set count for tweet it
  • 22.
    State in Flink 22 Operator “counttweet impressions” User Function state impression counts Retrieve/set count for tweet it What happens if the job crashes? Loss of data
  • 23.
    Solution: Checkpoints 23 Operator “count tweet impressions” UserFunction impression counts Retrieve/set count for tweet it Periodic checkpoints of state to HDFS Restore from HDFS in case of failure state
  • 24.
    Solution: Checkpoints 24 Operator “count tweet impressions” UserFunction impression counts Retrieve/set count for tweet it Periodic checkpoints of state to HDFS Restore from HDFS in case of failure state This is the current state in Flink!
  • 25.
    State on Steroids 25 Operator “counttweet impressions” User Function impression counts Retrieve/set count for tweet it state
  • 26.
    State on Steroids 26 Operator “counttweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure What if state grows too big?
  • 27.
    State on Steroids 27 Operator “counttweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk
  • 28.
    State on Steroids 28 Operator “counttweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure What if state grows too big? Checkpointing stalls processing!
  • 29.
    State on Steroids 29 Operator “counttweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure
  • 30.
  • 31.
    Streams with varyingdata rate 31 time events/second With static resources: Provision for max. rate Idle capacity
  • 32.
    (1) Adjust Parallelism 32 Initial configuration ScaleOut (for load) Scale In (save resources)
  • 33.
    (1) Adjust Parallelism Adjusting parallelism without (significantly) interrupting the program  Initial version: • Checkpoint -> stop -> restart-with-different-parallelism  Stateless operators: Trivial  Stateful operators: Repartition state • Transparent for key/value state and windows • Consistent hashing simplifies state reorganization 33
  • 34.
    (2) Dynamic WorkerPool 34 JobManager Resource Manager Pool of Cluster ResourcesYARN/Mesos/… TaskManager TaskManager
  • 35.
    Miscellaneous  Support forApache Mesos  Security • Over-the-wire encryption of RPC (akka) and data transfers (netty)  More connectors • Apache Cassandra • Amazon Kinesis  Enhance metrics • Throughput / Latencies • Backpressure monitoring • Spilling / Out of Core 35

Editor's Notes

  • #22 Data loss happens if the job crashes
  • #23 Data loss happens if the job crashes