Apache Flink & Kudu: a connector to develop Kappa architectures

FLINK-KUDU-CONNECTOR
AN OPEN-SOURCE CONTRIBUTION TO
DEVELOP KAPPA ARCHITECTURES

WHO WE ARE
Team
RUBÉN CASADO
Big Data Manager,
Accenture Digital
NACHO GARCÍA
Senior Big Data Engineer,
Accenture Digital
• Big Data Chapter Lead in Accenture Digital Delivery Spain
• Apache Flink Madrid Meetup organizer
• Master in Big Data Architecture director at Kschool
• PhD in Computer Science / Software Engineering
• Open-source passionate
• Senior Big Data Engineer at Accenture Digital
• Lecturer of Master in Big Data at Kschool
• Msc in Computer Science
Copyright © 2017 Accenture. All rights reserved.
ruben_casado
ruben.casado.tejedor@accenture.com
0xNacho
n.garcia.fernandez@accenture.com

ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY
PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM,
THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN
DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND
TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES
Copyright © 2017 Accenture. All rights reserved.

PROJECT STARTED BY JUNIOR DEVELOPERS
• Learn Apache Flink
• Learn Apache Kudu
• Encourage the use and contribution to
the open source community
1) Open Jira ticket
2) Ticket is assigned to you
3) Open a Pull-Request
4) Overhaul the code
5) Done!

█ KAPPA ARCHITECTURE: A REAL NEED
Agenda

MOTIVATION
Batch
Processing
NoSQL
Stream
processing

LAMBDA ARCHITECTURE
Two processing engines
Lambda architecture
SERVING LAYER
BATCH LAYER
QUERIES
SPEED LAYER
ALL DATA
RECENT
DATA
Real-time view
Real-time view
Batch view
Batch view

KAPPA ARCHITECTURE
A single engine
Kappa Architecture
SERVING LAYERSTREAMING LAYER
REPLAYABLE
QUERIES

█ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES
Agenda

APACHE KUDU
What is Apache Kudu?
Online (fast random access)
Analytics(fastscans)
https://www.youtube.com/watch?v=32zV7-I1JaM
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-
01-performance-comparison-different-file-formats-and-
storage-engines

APACHE KUDU
What is Apache Kudu?
Designed for fast analytics on fast data
An open-source columnar-oriented data store
• Provides a combination of fast inserts/updates and efficient columnar
scans
• Tables have a structured data model similar to RDMS
• Fast processing of OLAP workloads
• Integration with Hadoop ecosystem (Impala, Spark, Flink)

• Columnar data storage: strongly-typed columns
• Read efficiency: reads by columns, and not by rows
• Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk
• Data compression: because of the strongly—typed columns, compression is more efficient
• Table: the place where data is stored. Split into segments called tablets
• Tablet: contiguous segment of a table (partition)
• Replicated on multiple tablet servers. Leader-follower model
• Tablet server: stores and servers tablets to clients
• Master: keeps track of everything. Leader-follower model
• Catalog table: central location for meta-data.
APACHE KUDU: CONCEPTS
Key concepts of Apache Kudu

Kudu network architecture
Master
Server A (LEADER)
Master
Server B (Follower)
Master
Server C (Follower)
Tablet 2
LEADER
Tablet n
Follower
Tablet
Server F
Tablet 1
Follower
Tablet 2
Follower
Tablet
Server E
Tablet 1
Follower
Tablet 2
Follower
Tablet n
LEADER
Tablet
Server F
Tablet 1
LEADER
Tablet n
Follower
Tablet
Server D
APACHE KUDU : ARCHITECTURE
Master Servers Tablet Servers

Apache Flink
Flink programs are all about operators
How Flink works
source map() keyBy() sink
STREAMING DATAFLOW
SOURCE TRANSFORMATION SINK
OPERATORS

█ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION
Agenda

FLINK CONNECTORS
out there
flink-connector-
elasticsearch
flink-connector-kafka
flink-connector-redis flink-connector-influxdb
…Flink-connector-
kinesis
flink-connector-hbase flink-jdbc

FLINK-KUDU-CONNECTOR
map() …
STREAMING DATAFLOW
Kudu Table Kudu Table
DataSet and DataStream APIs (batch & streaming)

FLINK IO API
Flink IO
org.apache.flink.api.common.io.InputFormat
• KuduInputFormat
org.apache.flink.api.common.io.OutputFormat
• KuduOutputFormat
org.apache.flink.streaming.api.functions.source.SourceFunction
• Not Implemented
• Kudu does not provide CDC
• Open issue: here
org.apache.flink.streaming.api.functions.sink.SinkFunction
• KuduSinkFunction

CLASS USAGE BASE API
KuduInputFormat PUBLIC InputFormat DataSet
KuduOutputFormat PUBLIC DataSet
KuduSink PUBLIC SinkFunction DataStream
KuduInputSplit INTERNAL InputSplit -
KuduBatchTableSource PUBLIC BatchTableSource Table
KuduTableSink PUBLIC AppendStreamTableSink,
BatchTableSink
Table
KUDU CONNECTOR CLASSES

env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer,
String>>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO)
)
KUDUINPUTFORMAT
sample program
KuduInputFormat: Example
KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName(myTable")
//.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129))
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.flatMap(…)…

KUDUINPUTFORMAT
A program with parallelism = 5, TMs= 5, slots = 1 per TM
Reading data from Kudu in Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
map …
map …
map …
map …
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 3
TM 3, SLOT 1
PARALLEL INSTANCE 5
TM 5, SLOT 1
KUDU TABLE
KUDU MASTER
IDLE DATA SKEW

DataSet dataset = env.fromElements(1,2,3,4,5,5,6);
dataset.map(…)
.combineGroup(…)
.reduceGroup(…)
KUDUOUTPUTFORMAT
sample program
Flink connectors
KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName("test")
.build();
.writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT)
.output(new KuduOutputFormat(outputConfig));

KUDUOUTPUTFORMAT
A program with parallelism = 4, TMs= 4, slots = 1 per TM
Writing to Kudu from Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
map output
map output
map output
map output
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 2
TM 3, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
KUDU TABLE

FLINK TABLE API
Kudu working with Flink Table API
Kudu and Table API
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>(
// type information
));
tEnv.registerTableSource("Taxis", kuduTableSource);
Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by
f1").as("vehicle,avgSpeed,totalMeasures");
result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.DOUBLE_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
)));

• Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17
• Code samples: https://github.com/0xNacho/kudu-flink-examples/
• flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu
C0DE
Contributions are welcome!

FUTURE WORK
A first version is released, but the team keep working
• Support more data types (currently only Tuples are supported)
• Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is
available in Kudu)
• Fix issue with limit() operator: actually it is a open issue in kudu for the
Java API: KUDU-16, KUDU-2093

THAT’S ALL FOLKS. THANK YOU!

Apache Flink & Kudu: a connector to develop Kappa architectures

More Related Content

What's hot

Similar to Apache Flink & Kudu: a connector to develop Kappa architectures

Recently uploaded

Apache Flink & Kudu: a connector to develop Kappa architectures