Kappa Architecture is a software architecture pattern that makes use of an immutable, append only log. All the processing of the event will be performed in the input streams and persisted as real-time views. Apache Flink is very well suited to be the processing engine because it provides support for event-time semantics, stateful exactly-once processing, and achieves high throughput and low latency at the same time. Apache Kudu Kudu is a storage system good at both ingesting streaming data and analysing it using ad-hoc queries (e.g. interactive SQL based) and full-scan processes (e.g Spark/Flink). So Kudu is a good fit to store the real-time views in a Kappa Architecture. We have developed and open-sourced a connector to integrate Apache Kudu and Apache Flink. It allows reading/writing data from/to Kudu using the DataSet and DataStream Flink’s APIs. The connector has been submitted to the Apache Bahir project and is already available from maven central repository.
5. PROJECT STARTED BY JUNIOR DEVELOPERS
• Learn Apache Flink
• Learn Apache Kudu
• Encourage the use and contribution to
the open source community
1) Open Jira ticket
2) Ticket is assigned to you
3) Open a Pull-Request
4) Overhaul the code
5) Done!
10. █ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES
Agenda
11. APACHE KUDU
What is Apache Kudu?
Online (fast random access)
Analytics(fastscans)
https://www.youtube.com/watch?v=32zV7-I1JaM
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-
01-performance-comparison-different-file-formats-and-
storage-engines
12. APACHE KUDU
What is Apache Kudu?
Designed for fast analytics on fast data
An open-source columnar-oriented data store
• Provides a combination of fast inserts/updates and efficient columnar
scans
• Tables have a structured data model similar to RDMS
• Fast processing of OLAP workloads
• Integration with Hadoop ecosystem (Impala, Spark, Flink)
13. • Columnar data storage: strongly-typed columns
• Read efficiency: reads by columns, and not by rows
• Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk
• Data compression: because of the strongly—typed columns, compression is more efficient
• Table: the place where data is stored. Split into segments called tablets
• Tablet: contiguous segment of a table (partition)
• Replicated on multiple tablet servers. Leader-follower model
• Tablet server: stores and servers tablets to clients
• Master: keeps track of everything. Leader-follower model
• Catalog table: central location for meta-data.
APACHE KUDU: CONCEPTS
Key concepts of Apache Kudu
14. Kudu network architecture
Master
Server A (LEADER)
Master
Server B (Follower)
Master
Server C (Follower)
Tablet 2
LEADER
Tablet n
Follower
Tablet
Server F
Tablet 1
Follower
Tablet 2
Follower
Tablet
Server E
Tablet 1
Follower
Tablet 2
Follower
Tablet n
LEADER
Tablet
Server F
Tablet 1
LEADER
Tablet n
Follower
Tablet
Server D
APACHE KUDU : ARCHITECTURE
Master Servers Tablet Servers
15. Apache Flink
Flink programs are all about operators
How Flink works
source map() keyBy() sink
STREAMING DATAFLOW
SOURCE TRANSFORMATION SINK
OPERATORS
19. FLINK IO API
Flink IO
org.apache.flink.api.common.io.InputFormat
• KuduInputFormat
org.apache.flink.api.common.io.OutputFormat
• KuduOutputFormat
org.apache.flink.streaming.api.functions.source.SourceFunction
• Not Implemented
• Kudu does not provide CDC
• Open issue: here
org.apache.flink.streaming.api.functions.sink.SinkFunction
• KuduSinkFunction
20. CLASS USAGE BASE API
KuduInputFormat PUBLIC InputFormat DataSet
KuduOutputFormat PUBLIC DataSet
KuduSink PUBLIC SinkFunction DataStream
KuduInputSplit INTERNAL InputSplit -
KuduBatchTableSource PUBLIC BatchTableSource Table
KuduTableSink PUBLIC AppendStreamTableSink,
BatchTableSink
Table
KUDU CONNECTOR CLASSES
21. env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer,
String>>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO)
)
KUDUINPUTFORMAT
sample program
KuduInputFormat: Example
KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName(myTable")
//.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129))
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.flatMap(…)…
22. KUDUINPUTFORMAT
A program with parallelism = 5, TMs= 5, slots = 1 per TM
Reading data from Kudu in Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
map …
map …
map …
map …
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 3
TM 3, SLOT 1
PARALLEL INSTANCE 5
TM 5, SLOT 1
KUDU TABLE
KUDU MASTER
IDLE DATA SKEW
24. KUDUOUTPUTFORMAT
A program with parallelism = 4, TMs= 4, slots = 1 per TM
Writing to Kudu from Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
map output
map output
map output
map output
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 2
TM 3, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
KUDU TABLE
25. FLINK TABLE API
Kudu working with Flink Table API
Kudu and Table API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>(
// type information
));
tEnv.registerTableSource("Taxis", kuduTableSource);
Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by
f1").as("vehicle,avgSpeed,totalMeasures");
result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.DOUBLE_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
)));
28. FUTURE WORK
A first version is released, but the team keep working
• Support more data types (currently only Tuples are supported)
• Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is
available in Kudu)
• Fix issue with limit() operator: actually it is a open issue in kudu for the
Java API: KUDU-16, KUDU-2093