SlideShare a Scribd company logo
1 of 76
Introduction to
Apache Flink™
Maximilian Michels
mxm@apache.org
@stadtlegende
EIT ICT Summer School 2015
Ufuk Celebi
uce@apache.org
@iamuce
About Us
 Ufuk Celebi <uce@apache.org>
 Maximilian Michels <mxm@apache.org>
 Apache Flink Committers
 Working full-time on Apache Flink at
dataArtisans
2
Structure
Goal: Get you started with Apache Flink
 Overview
 Internals
 APIs
 Exercise
 Demo
 Advanced features
3
Exercises
 We will ask you to do an exercise using Flink
 Please check out the website for this session:
http://dataArtisans.github.io/eit-summerschool-15
 The website also contains a solution which we will
present later on but, first, try to solve the exercise.
4
1 year of Flink - code
April 2014 April 2015
Flink Community
0
20
40
60
80
100
120
Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15
#unique contributor ids by git commits
In top 5 of Apache's big
data projects after one year
in the Apache Software
Foundation
The Apache Way
 Independent, non-profit organization
 Community-driven open source software
development approach
 Consensus-based decision making
 Public communication and open to new
contributors
7
Integration into the Big Data Stack
8
9
A stream processor
with many applications
Streaming dataflow runtime
Apache Flink
What can I do with Flink?
10
Basic API Concept
How do I write a Flink program?
1. Bootstrap sources
2. Apply operations
3. Output to source
11
Data
Stream
Operation
Data
Stream
Source Sink
Data
Set
Operation
Data
Set
Source Sink
Stream & Batch Processing
 DataStream API
12
Stock Feed
Name Price
Microsoft 124
Google 516
Apple 235
… …
Alert if
Microsoft
> 120
Write
event to
database
Sum every
10
seconds
Alert if
sum >
10000
Microsoft 124
Google 516
Apple 235
Microsoft 124
Google 516
Apple 235
b h
2 1
3 5
7 4
… …
Map Reduce
a
1
2
…
 DataSet API
Streaming & Batch
13
Batch
finite
blocking or
pipelined
high
Streaming
infinite
pipelined
low
Input
Data transfer
Latency
Scaling out
14
Data
Set
Operation
Data
Set
Source Sink
Data
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source SinkData
Set
Operation
Data
Set
Source Sink
Scaling up
15
Flink’s Architecture
16
Architecture Overview
 Client
 Master (Job Manager)
 Worker (Task Manager)
17
Client
Job Manager
Task
Manager
Task
Manager
Task
Manager
Client
 Optimize
 Construct job graph
 Pass job graph to job manager
 Retrieve job results
18
Job Manager
Client
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type
extraction
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Job Manager
 Parallelization: Create Execution Graph
 Scheduling: Assign tasks to task managers
 State tracking: Supervise the execution
19
Job Manager
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Task Manager
Task Manager
Task Manager
Task Manager
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Task Manager
 Operations are split up into tasks depending
on the specified parallelism
 Each parallel instance of an operation runs in
a separate task slot
 The scheduler may run several tasks from
different operators in one task slot
20
Task Manager
S
l
o
t
Task ManagerTask Manager
S
l
o
t
S
l
o
t
From Program to Execution
21
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Job Manager
Task Managers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
Flink's execution model
22
Flink execution model
 A program is a graph (DAG) of operators
 Operators = computation + state
 Operators produce intermediate results = logical
streams of records
 Other operators can consume those
23
map
join sum
ID1
ID2
ID3
24
A map-reduce
job with Flink
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Blocked" result partition
25
Execution Graph
Job Manager
Task Manager 1
Task Manager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
Streaming
"Pipelined" result partition
Non-native streaming
26
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Batch on Streaming
 Batch programs are a special kind of
streaming program
27
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
Stream processor applications
28
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Demo
29
Introduction to the DataSet API
30
Flink’s APIs
31
DataStream (Java/Scala)
Streaming dataflow runtime
DataSet (Java/Scala)
API Preview
32
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
WordCount: main method
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
33
Execution Environment
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
34
Data Sources
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
35
Data types
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
36
Transformations
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
37
User functions
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
38
DataSinks
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
39
Execute!
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data either from file or use example data
DataSet<String> inputText = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in tuples containing: (word,1)
inputText.flatMap(new Tokenizer())
// group by the tuple field "0"
.groupBy(0)
//sum up tuple field "1"
.reduceGroup(new SumWords());
// emit result
counts.writeAsCsv(args[1], "n", " ");
// execute program
env.execute("WordCount Example");
}
40
WordCount: Map
public static class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
} 41
WordCount: Map: Interface
public static class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
} 42
WordCount: Map: Types
public static class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
} 43
WordCount: Map: Collector
public static class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
} 44
WordCount: Reduce
public static class SumWords implements
GroupReduceFunction<Tuple2<String, Integer>,
Tuple2<String, Integer>> {
@Override
public void reduce(Iterable<Tuple2<String, Integer>> values,
Collector<Tuple2<String, Integer>> out) {
int count = 0;
String word = null;
for (Tuple2<String, Integer> tuple : values) {
word = tuple.f0;
count++;
}
out.collect(new Tuple2<String, Integer>(word, count));
}
}
45
WordCount: Reduce: Interface
public static class SumWords implements
GroupReduceFunction<Tuple2<String, Integer>,
Tuple2<String, Integer>> {
@Override
public void reduce(Iterable<Tuple2<String, Integer>> values,
Collector<Tuple2<String, Integer>> out) {
int count = 0;
String word = null;
for (Tuple2<String, Integer> tuple : values) {
word = tuple.f0;
count++;
}
out.collect(new Tuple2<String, Integer>(word, count));
}
}
46
WordCount: Reduce: Types
public static class SumWords implements
GroupReduceFunction<Tuple2<String, Integer>,
Tuple2<String, Integer>> {
@Override
public void reduce(Iterable<Tuple2<String, Integer>> values,
Collector<Tuple2<String, Integer>> out) {
int count = 0;
String word = null;
for (Tuple2<String, Integer> tuple : values) {
word = tuple.f0;
count++;
}
out.collect(new Tuple2<String, Integer>(word, count));
}
}
47
WordCount: Reduce: Collector
public static class SumWords implements
GroupReduceFunction<Tuple2<String, Integer>,
Tuple2<String, Integer>> {
@Override
public void reduce(Iterable<Tuple2<String, Integer>> values,
Collector<Tuple2<String, Integer>> out) {
int count = 0;
String word = null;
for (Tuple2<String, Integer> tuple : values) {
word = tuple.f0;
count++;
}
out.collect(new Tuple2<String, Integer>(word, count));
}
}
48
DataSet API Concepts
49
Data Types
 Basic Java Types
• String, Long, Integer, Boolean,…
• Arrays
 Composite Types
• Tuple
• PoJo (Java objects)
• Custom type
50
Tuples
 The easiest, lightweight, and generic way
of encapsulating data in Flink
 Tuple1 up to Tuple25
Tuple3<String, String, Integer> person =
new Tuple3<>("Max", "Mustermann", 42);
// zero based index!
String firstName = person.f0;
String secondName = person.f1;
Integer age = person.f2;
51
Transformations: Map
DataSet<Integer> integers = env.fromElements(1, 2, 3, 4);
// Regular Map - Takes one element and produces one element
DataSet<Integer> doubleIntegers =
integers.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) {
return value * 2;
}
});
doubleIntegers.print();
> 2, 4, 6, 8
// Flat Map - Takes one element and produces zero, one, or more elements.
DataSet<Integer> doubleIntegers2 =
integers.flatMap(new FlatMapFunction<Integer, Integer>() {
@Override
public void flatMap(Integer value, Collector<Integer> out) {
out.collect(value * 2);
}
});
doubleIntegers2.print();
> 2, 4, 6, 8
52
Transformations: Filter
// The DataSet
DataSet<Integer> integers = env.fromElements(1, 2, 3, 4);
DataSet<Integer> filtered =
integers.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) {
return value != 3;
}
});
integers.print();
> 1, 2, 4
53
Groupings and Reduce
// (name, age) of employees
DataSet<Tuple2<String, Integer>> employees = …
// group by second field (age)
DataSet<Integer, Integer> grouped = employees.groupBy(1)
// return a list of age groups with its counts
.reduceGroup(new CountSameAge());
54
Name Age
Stephan 18
Fabian 23
Julia 27
Romeo 27
Anna 18
• DataSets can be split into groups
• Groups are defined using a
common key
AgeGroup Count
18 2
23 1
27 2
GroupReduce
public static class CountSameAge implements GroupReduceFunction <Tuple2<String,
Integer>, Tuple2<Integer, Integer>> {
@Override
public void reduce(Iterable<Tuple2<String, Integer>> values,
Collector<Tuple2<Integer, Integer>> out) {
Integer ageGroup = 0;
Integer countsInGroup = 0;
for (Tuple2<String, Integer> person : values) {
ageGroup = person.f1;
countsInGroup++;
}
out.collect(new Tuple2<Integer, Integer>
(ageGroup, countsInGroup));
}
}
55
Joining two DataSets
// authors (id, name, email)
DataSet<Tuple3<Integer, String, String>> authors = ..;
// posts (title, content, author_id)
DataSet<Tuple3<String, String, Integer>> posts = ..;
DataSet<Tuple2<
Tuple3<Integer, String, String>,
Tuple3<String, String, Integer>
>> archive = authors.join(posts).where(0).equalTo(2);
56
Authors
Id Name email
1 Fabian fabian@..
2 Julia julia@...
3 Max max@...
4 Romeo romeo@.
Posts
Title Content Author id
… … 2
.. .. 4
.. .. 4
.. .. 1
.. .. 2
Joining two DataSets
// authors (id, name, email)
DataSet<Tuple3<Integer, String, String>> authors = ..;
// posts (title, content, author_id)
DataSet<Tuple3<String, String, Integer>> posts = ..;
DataSet<Tuple2<
Tuple3<Integer, String, String>,
Tuple3<String, String, Integer>
>> archive = authors.join(posts).where(0).equalTo(2);
57
Archive
Id Name email Title Content Author id
1 Fabian fabian@.. .. .. 1
2 Julia julia@... .. .. 2
2 Julia julia@... .. .. 2
3 Romeo romeo@... .. .. 4
4 Romeo romeo@. .. .. 4
Join with join function
// authors (id, name, email)
DataSet<Tuple3<Integer, String, String>> authors = ..;
// posts (title, content, author_id)
DataSet<Tuple3<String, String, Integer>> posts = ..;
// (title, author name)
DataSet<Tuple2<String, String>> archive =
authors.join(posts).where(0).equalTo(2)
.with(new PostsByUser());
public static class PostsByUser implements
JoinFunction<Tuple3<Integer, String, String>,
Tuple3<String, String, Integer>,
Tuple2<String, String>> {
@Override
public Tuple2<String, String> join(
Tuple3<Integer, String, String> left,
Tuple3<String, String, Integer> right) {
return new Tuple2<String, String>(left.f1, right.f0);
}
}
58
Archive
Name Title
Fabian ..
Julia ..
Julia ..
Romeo ..
Romeo ..
Data Sources / Data Sinks
59
Data Sources
Text
 readTextFile(“/path/to/file”)
CSV
 readCsvFile(“/path/to/file”)
Collection
 fromCollection(collection)
 fromElements(1,2,3,4,5)
60
Data Sources: Collections
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// read from elements
DataSet<String> names = env.fromElements(“Some”, “Example”, “Strings”);
// read from Java collection
List<String> list = new ArrayList<String>();
list.add(“Some”);
list.add(“Example”);
list.add(“Strings”);
DataSet<String> names = env.fromCollection(list);
61
Data Sources: File-Based
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// read text file from local or distributed file system
DataSet<String> localLines =
env.readTextFile(”/path/to/my/textfile");
// read a CSV file with three fields
DataSet<Tuple3<Integer, String, Double>> csvInput =
env.readCsvFile(“/the/CSV/file")
.types(Integer.class, String.class, Double.class);
// read a CSV file with five fields, taking only two of them
DataSet<Tuple2<String, Double>> csvInput =
env.readCsvFile(“/the/CSV/file")
// take the first and the fourth field
.includeFields("10010") .types(String.class,
Double.class);
62
Data Sinks
Text
 writeAsText(“/path/to/file”)
 writeAsFormattedText(“/path/to/file”, formatFunction)
CSV
 writeAsCsv(“/path/to/file”)
Return data to the Client
 Print()
 Collect()
 Count()
63
Data Sinks (lazy)
 Lazily executed when env.execute() is called
DataSet<…> result;
// write DataSet to a file on the local file system
result.writeAsText(“/path/to/file");
// write DataSet to a file and overwrite the file if it exists
result.writeAsText("/path/to/file”,FileSystem.WriteMode.OVERWRITE);
// tuples as lines with pipe as the separator "a|b|c”
result.writeAsCsv("/path/to/file", "n", "|");
// this wites values as strings using a user-defined TextFormatter object
result.writeAsFormattedText("/path/to/file",
new TextFormatter<Tuple2<Integer, Integer>>() {
public String format (Tuple2<Integer, Integer> value) {
return value.f1 + " - " + value.f0;
}
});
64
Data Sinks (eager)
 Eagerly executed
DataSet<Tuple2<String, Integer> result;
// print
result.print();
// count
int numberOfElements = result.count();
// collect
List<Tuple2<String, Integer> materializedResults = result.collect();
65
Execution Setups
66
Ways to Run a Flink Program
67
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Local Remote Yarn Tez Embedded
Streaming dataflow runtime
Local Execution
 Starts local Flink cluster
 All processes run in the
same JVM
 Behaves just like a
regular Cluster
 Very useful for developing
and debugging
68
Job Manager
Task
Manager
Task
Manager
Task
Manager
Task
Manager
JVM
Remote Execution
 The cluster mode
 Submit a Job
remotely
 Monitors the status
of the job
70
Client Job Manager
Cluster
Task
Manager
Task
Manager
Task
Manager
Task
Manager
Submit job
YARN Execution
 Multi user scenario
 Resource sharing
 Uses YARN
containers to run a
Flink cluster
 Very easy to setup
Flink
71
Client
Node Manager
Job Manager
YARN Cluster
Resource Manager
Node Manager
Task
Manager
Node Manager
Task
Manager
Node Manager
Other
Application
Exercises and Hands-on
http://dataArtisans.github.io/
eit-summerschool-15
73
Closing
74
tl;dr: What was this about?
 The case for Flink
• Low latency
• High throughput
• Fault-tolerant
• Easy to use APIs, library ecosystem
• Growing community
 A stream processor that is great for batch
analytics as well
75
I ♥ , do you?
76
 Get involved and start a discussion on
Flink‘s mailing list
 { user, dev }@flink.apache.org
 Subscribe to news@flink.apache.org
 Follow flink.apache.org/blog and
@ApacheFlink on Twitter
77
flink-forward.org
Call for papers deadline:
August 14, 2015
October 12-13, 2015
Discount code: FlinkEITSummerSchool25
Thank you for listening!
78

More Related Content

What's hot

Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAljoscha Krettek
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Patternconfluent
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 

What's hot (20)

Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applications
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 

Similar to Introduction to Apache Flink

Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_iNico Ludwig
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Robert Metzger
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache FlinkDataWorks Summit
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Aljoscha Krettek
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
A Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIA Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIJörn Guy Süß JGS
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 

Similar to Introduction to Apache Flink (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
A Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIA Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its API
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 

Recently uploaded

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Introduction to Apache Flink

  • 1. Introduction to Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT Summer School 2015 Ufuk Celebi uce@apache.org @iamuce
  • 2. About Us  Ufuk Celebi <uce@apache.org>  Maximilian Michels <mxm@apache.org>  Apache Flink Committers  Working full-time on Apache Flink at dataArtisans 2
  • 3. Structure Goal: Get you started with Apache Flink  Overview  Internals  APIs  Exercise  Demo  Advanced features 3
  • 4. Exercises  We will ask you to do an exercise using Flink  Please check out the website for this session: http://dataArtisans.github.io/eit-summerschool-15  The website also contains a solution which we will present later on but, first, try to solve the exercise. 4
  • 5. 1 year of Flink - code April 2014 April 2015
  • 6. Flink Community 0 20 40 60 80 100 120 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15 #unique contributor ids by git commits In top 5 of Apache's big data projects after one year in the Apache Software Foundation
  • 7. The Apache Way  Independent, non-profit organization  Community-driven open source software development approach  Consensus-based decision making  Public communication and open to new contributors 7
  • 8. Integration into the Big Data Stack 8
  • 9. 9 A stream processor with many applications Streaming dataflow runtime Apache Flink
  • 10. What can I do with Flink? 10
  • 11. Basic API Concept How do I write a Flink program? 1. Bootstrap sources 2. Apply operations 3. Output to source 11 Data Stream Operation Data Stream Source Sink Data Set Operation Data Set Source Sink
  • 12. Stream & Batch Processing  DataStream API 12 Stock Feed Name Price Microsoft 124 Google 516 Apple 235 … … Alert if Microsoft > 120 Write event to database Sum every 10 seconds Alert if sum > 10000 Microsoft 124 Google 516 Apple 235 Microsoft 124 Google 516 Apple 235 b h 2 1 3 5 7 4 … … Map Reduce a 1 2 …  DataSet API
  • 13. Streaming & Batch 13 Batch finite blocking or pipelined high Streaming infinite pipelined low Input Data transfer Latency
  • 14. Scaling out 14 Data Set Operation Data Set Source Sink Data Set Operation Data Set Source SinkData Set Operation Data Set Source SinkData Set Operation Data Set Source SinkData Set Operation Data Set Source SinkData Set Operation Data Set Source SinkData Set Operation Data Set Source SinkData Set Operation Data Set Source Sink
  • 17. Architecture Overview  Client  Master (Job Manager)  Worker (Task Manager) 17 Client Job Manager Task Manager Task Manager Task Manager
  • 18. Client  Optimize  Construct job graph  Pass job graph to job manager  Retrieve job results 18 Job Manager Client case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward
  • 19. Job Manager  Parallelization: Create Execution Graph  Scheduling: Assign tasks to task managers  State tracking: Supervise the execution 19 Job Manager Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Task Manager Task Manager Task Manager Task Manager Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d
  • 20. Task Manager  Operations are split up into tasks depending on the specified parallelism  Each parallel instance of an operation runs in a separate task slot  The scheduler may run several tasks from different operators in one task slot 20 Task Manager S l o t Task ManagerTask Manager S l o t S l o t
  • 21. From Program to Execution 21 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Job Manager Task Managers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  • 23. Flink execution model  A program is a graph (DAG) of operators  Operators = computation + state  Operators produce intermediate results = logical streams of records  Other operators can consume those 23 map join sum ID1 ID2 ID3
  • 24. 24 A map-reduce job with Flink ExecutionGraph JobManager TaskManager 1 TaskManager 2 M1 M2 RP1RP2 R1 R2 1 2 3a 3b 4a 4b 5a 5b "Blocked" result partition
  • 25. 25 Execution Graph Job Manager Task Manager 1 Task Manager 2 M1 M2 RP1RP2 R1 R2 1 2 3a 3b 4a 4b 5a 5b Streaming "Pipelined" result partition
  • 26. Non-native streaming 26 stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job }
  • 27. Batch on Streaming  Batch programs are a special kind of streaming program 27 Infinite Streams Finite Streams Stream Windows Global View Pipelined Data Exchange Pipelined or Blocking Exchange Streaming Programs Batch Programs
  • 30. Introduction to the DataSet API 30
  • 31. Flink’s APIs 31 DataStream (Java/Scala) Streaming dataflow runtime DataSet (Java/Scala)
  • 32. API Preview 32 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 33. WordCount: main method public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 33
  • 34. Execution Environment public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 34
  • 35. Data Sources public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 35
  • 36. Data types public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 36
  • 37. Transformations public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 37
  • 38. User functions public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 38
  • 39. DataSinks public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 39
  • 40. Execute! public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data either from file or use example data DataSet<String> inputText = env.readTextFile(args[0]); DataSet<Tuple2<String, Integer>> counts = // split up the lines in tuples containing: (word,1) inputText.flatMap(new Tokenizer()) // group by the tuple field "0" .groupBy(0) //sum up tuple field "1" .reduceGroup(new SumWords()); // emit result counts.writeAsCsv(args[1], "n", " "); // execute program env.execute("WordCount Example"); } 40
  • 41. WordCount: Map public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 41
  • 42. WordCount: Map: Interface public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 42
  • 43. WordCount: Map: Types public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 43
  • 44. WordCount: Map: Collector public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 44
  • 45. WordCount: Reduce public static class SumWords implements GroupReduceFunction<Tuple2<String, Integer>, Tuple2<String, Integer>> { @Override public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<Tuple2<String, Integer>> out) { int count = 0; String word = null; for (Tuple2<String, Integer> tuple : values) { word = tuple.f0; count++; } out.collect(new Tuple2<String, Integer>(word, count)); } } 45
  • 46. WordCount: Reduce: Interface public static class SumWords implements GroupReduceFunction<Tuple2<String, Integer>, Tuple2<String, Integer>> { @Override public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<Tuple2<String, Integer>> out) { int count = 0; String word = null; for (Tuple2<String, Integer> tuple : values) { word = tuple.f0; count++; } out.collect(new Tuple2<String, Integer>(word, count)); } } 46
  • 47. WordCount: Reduce: Types public static class SumWords implements GroupReduceFunction<Tuple2<String, Integer>, Tuple2<String, Integer>> { @Override public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<Tuple2<String, Integer>> out) { int count = 0; String word = null; for (Tuple2<String, Integer> tuple : values) { word = tuple.f0; count++; } out.collect(new Tuple2<String, Integer>(word, count)); } } 47
  • 48. WordCount: Reduce: Collector public static class SumWords implements GroupReduceFunction<Tuple2<String, Integer>, Tuple2<String, Integer>> { @Override public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<Tuple2<String, Integer>> out) { int count = 0; String word = null; for (Tuple2<String, Integer> tuple : values) { word = tuple.f0; count++; } out.collect(new Tuple2<String, Integer>(word, count)); } } 48
  • 50. Data Types  Basic Java Types • String, Long, Integer, Boolean,… • Arrays  Composite Types • Tuple • PoJo (Java objects) • Custom type 50
  • 51. Tuples  The easiest, lightweight, and generic way of encapsulating data in Flink  Tuple1 up to Tuple25 Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42); // zero based index! String firstName = person.f0; String secondName = person.f1; Integer age = person.f2; 51
  • 52. Transformations: Map DataSet<Integer> integers = env.fromElements(1, 2, 3, 4); // Regular Map - Takes one element and produces one element DataSet<Integer> doubleIntegers = integers.map(new MapFunction<Integer, Integer>() { @Override public Integer map(Integer value) { return value * 2; } }); doubleIntegers.print(); > 2, 4, 6, 8 // Flat Map - Takes one element and produces zero, one, or more elements. DataSet<Integer> doubleIntegers2 = integers.flatMap(new FlatMapFunction<Integer, Integer>() { @Override public void flatMap(Integer value, Collector<Integer> out) { out.collect(value * 2); } }); doubleIntegers2.print(); > 2, 4, 6, 8 52
  • 53. Transformations: Filter // The DataSet DataSet<Integer> integers = env.fromElements(1, 2, 3, 4); DataSet<Integer> filtered = integers.filter(new FilterFunction<Integer>() { @Override public boolean filter(Integer value) { return value != 3; } }); integers.print(); > 1, 2, 4 53
  • 54. Groupings and Reduce // (name, age) of employees DataSet<Tuple2<String, Integer>> employees = … // group by second field (age) DataSet<Integer, Integer> grouped = employees.groupBy(1) // return a list of age groups with its counts .reduceGroup(new CountSameAge()); 54 Name Age Stephan 18 Fabian 23 Julia 27 Romeo 27 Anna 18 • DataSets can be split into groups • Groups are defined using a common key AgeGroup Count 18 2 23 1 27 2
  • 55. GroupReduce public static class CountSameAge implements GroupReduceFunction <Tuple2<String, Integer>, Tuple2<Integer, Integer>> { @Override public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<Tuple2<Integer, Integer>> out) { Integer ageGroup = 0; Integer countsInGroup = 0; for (Tuple2<String, Integer> person : values) { ageGroup = person.f1; countsInGroup++; } out.collect(new Tuple2<Integer, Integer> (ageGroup, countsInGroup)); } } 55
  • 56. Joining two DataSets // authors (id, name, email) DataSet<Tuple3<Integer, String, String>> authors = ..; // posts (title, content, author_id) DataSet<Tuple3<String, String, Integer>> posts = ..; DataSet<Tuple2< Tuple3<Integer, String, String>, Tuple3<String, String, Integer> >> archive = authors.join(posts).where(0).equalTo(2); 56 Authors Id Name email 1 Fabian fabian@.. 2 Julia julia@... 3 Max max@... 4 Romeo romeo@. Posts Title Content Author id … … 2 .. .. 4 .. .. 4 .. .. 1 .. .. 2
  • 57. Joining two DataSets // authors (id, name, email) DataSet<Tuple3<Integer, String, String>> authors = ..; // posts (title, content, author_id) DataSet<Tuple3<String, String, Integer>> posts = ..; DataSet<Tuple2< Tuple3<Integer, String, String>, Tuple3<String, String, Integer> >> archive = authors.join(posts).where(0).equalTo(2); 57 Archive Id Name email Title Content Author id 1 Fabian fabian@.. .. .. 1 2 Julia julia@... .. .. 2 2 Julia julia@... .. .. 2 3 Romeo romeo@... .. .. 4 4 Romeo romeo@. .. .. 4
  • 58. Join with join function // authors (id, name, email) DataSet<Tuple3<Integer, String, String>> authors = ..; // posts (title, content, author_id) DataSet<Tuple3<String, String, Integer>> posts = ..; // (title, author name) DataSet<Tuple2<String, String>> archive = authors.join(posts).where(0).equalTo(2) .with(new PostsByUser()); public static class PostsByUser implements JoinFunction<Tuple3<Integer, String, String>, Tuple3<String, String, Integer>, Tuple2<String, String>> { @Override public Tuple2<String, String> join( Tuple3<Integer, String, String> left, Tuple3<String, String, Integer> right) { return new Tuple2<String, String>(left.f1, right.f0); } } 58 Archive Name Title Fabian .. Julia .. Julia .. Romeo .. Romeo ..
  • 59. Data Sources / Data Sinks 59
  • 60. Data Sources Text  readTextFile(“/path/to/file”) CSV  readCsvFile(“/path/to/file”) Collection  fromCollection(collection)  fromElements(1,2,3,4,5) 60
  • 61. Data Sources: Collections ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // read from elements DataSet<String> names = env.fromElements(“Some”, “Example”, “Strings”); // read from Java collection List<String> list = new ArrayList<String>(); list.add(“Some”); list.add(“Example”); list.add(“Strings”); DataSet<String> names = env.fromCollection(list); 61
  • 62. Data Sources: File-Based ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // read text file from local or distributed file system DataSet<String> localLines = env.readTextFile(”/path/to/my/textfile"); // read a CSV file with three fields DataSet<Tuple3<Integer, String, Double>> csvInput = env.readCsvFile(“/the/CSV/file") .types(Integer.class, String.class, Double.class); // read a CSV file with five fields, taking only two of them DataSet<Tuple2<String, Double>> csvInput = env.readCsvFile(“/the/CSV/file") // take the first and the fourth field .includeFields("10010") .types(String.class, Double.class); 62
  • 63. Data Sinks Text  writeAsText(“/path/to/file”)  writeAsFormattedText(“/path/to/file”, formatFunction) CSV  writeAsCsv(“/path/to/file”) Return data to the Client  Print()  Collect()  Count() 63
  • 64. Data Sinks (lazy)  Lazily executed when env.execute() is called DataSet<…> result; // write DataSet to a file on the local file system result.writeAsText(“/path/to/file"); // write DataSet to a file and overwrite the file if it exists result.writeAsText("/path/to/file”,FileSystem.WriteMode.OVERWRITE); // tuples as lines with pipe as the separator "a|b|c” result.writeAsCsv("/path/to/file", "n", "|"); // this wites values as strings using a user-defined TextFormatter object result.writeAsFormattedText("/path/to/file", new TextFormatter<Tuple2<Integer, Integer>>() { public String format (Tuple2<Integer, Integer> value) { return value.f1 + " - " + value.f0; } }); 64
  • 65. Data Sinks (eager)  Eagerly executed DataSet<Tuple2<String, Integer> result; // print result.print(); // count int numberOfElements = result.count(); // collect List<Tuple2<String, Integer> materializedResults = result.collect(); 65
  • 67. Ways to Run a Flink Program 67 DataSet (Java/Scala/Python) DataStream (Java/Scala) Local Remote Yarn Tez Embedded Streaming dataflow runtime
  • 68. Local Execution  Starts local Flink cluster  All processes run in the same JVM  Behaves just like a regular Cluster  Very useful for developing and debugging 68 Job Manager Task Manager Task Manager Task Manager Task Manager JVM
  • 69. Remote Execution  The cluster mode  Submit a Job remotely  Monitors the status of the job 70 Client Job Manager Cluster Task Manager Task Manager Task Manager Task Manager Submit job
  • 70. YARN Execution  Multi user scenario  Resource sharing  Uses YARN containers to run a Flink cluster  Very easy to setup Flink 71 Client Node Manager Job Manager YARN Cluster Resource Manager Node Manager Task Manager Node Manager Task Manager Node Manager Other Application
  • 73. tl;dr: What was this about?  The case for Flink • Low latency • High throughput • Fault-tolerant • Easy to use APIs, library ecosystem • Growing community  A stream processor that is great for batch analytics as well 75
  • 74. I ♥ , do you? 76  Get involved and start a discussion on Flink‘s mailing list  { user, dev }@flink.apache.org  Subscribe to news@flink.apache.org  Follow flink.apache.org/blog and @ApacheFlink on Twitter
  • 75. 77 flink-forward.org Call for papers deadline: August 14, 2015 October 12-13, 2015 Discount code: FlinkEITSummerSchool25
  • 76. Thank you for listening! 78

Editor's Notes

  1. dev list: 300-400 messages/month. record 1000 messages on