FLINK-KUDU-CONNECTOR
AN OPEN-SOURCE CONTRIBUTION TO
DEVELOP KAPPA ARCHITECTURES
WHO WE ARE
Team
RUBÉN CASADO
Big Data Manager,
Accenture Digital
NACHO GARCÍA
Senior Big Data Engineer,
Accenture Digital
• Big Data Chapter Lead in Accenture Digital Delivery Spain
• Apache Flink Madrid Meetup organizer
• Master in Big Data Architecture director at Kschool
• PhD in Computer Science / Software Engineering
• Open-source passionate
• Senior Big Data Engineer at Accenture Digital
• Lecturer of Master in Big Data at Kschool
• Msc in Computer Science
Copyright © 2017 Accenture. All rights reserved.
ruben_casado
ruben.casado.tejedor@accenture.com
0xNacho
n.garcia.fernandez@accenture.com
█ CONTEXT
Agenda
ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY
PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM,
THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN
DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND
TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES
Copyright © 2017 Accenture. All rights reserved.
PROJECT STARTED BY JUNIOR DEVELOPERS
• Learn Apache Flink
• Learn Apache Kudu
• Encourage the use and contribution to
the open source community
1) Open Jira ticket
2) Ticket is assigned to you
3) Open a Pull-Request
4) Overhaul the code
5) Done!
█ KAPPA ARCHITECTURE: A REAL NEED
Agenda
MOTIVATION
Batch
Processing
NoSQL
Stream
processing
LAMBDA ARCHITECTURE
Two processing engines
Lambda architecture
SERVING LAYER
BATCH LAYER
QUERIES
SPEED LAYER
ALL DATA
RECENT
DATA
Real-time view
Real-time view
Batch view
Batch view
KAPPA ARCHITECTURE
A single engine
Kappa Architecture
SERVING LAYERSTREAMING LAYER
REPLAYABLE
QUERIES
█ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES
Agenda
APACHE KUDU
What is Apache Kudu?
Online (fast random access)
Analytics(fastscans)
https://www.youtube.com/watch?v=32zV7-I1JaM
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-
01-performance-comparison-different-file-formats-and-
storage-engines
APACHE KUDU
What is Apache Kudu?
Designed for fast analytics on fast data
An open-source columnar-oriented data store
• Provides a combination of fast inserts/updates and efficient columnar
scans
• Tables have a structured data model similar to RDMS
• Fast processing of OLAP workloads
• Integration with Hadoop ecosystem (Impala, Spark, Flink)
• Columnar data storage: strongly-typed columns
• Read efficiency: reads by columns, and not by rows
• Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk
• Data compression: because of the strongly—typed columns, compression is more efficient
• Table: the place where data is stored. Split into segments called tablets
• Tablet: contiguous segment of a table (partition)
• Replicated on multiple tablet servers. Leader-follower model
• Tablet server: stores and servers tablets to clients
• Master: keeps track of everything. Leader-follower model
• Catalog table: central location for meta-data.
APACHE KUDU: CONCEPTS
Key concepts of Apache Kudu
Kudu network architecture
Master
Server A (LEADER)
Master
Server B (Follower)
Master
Server C (Follower)
Tablet 2
LEADER
Tablet n
Follower
Tablet
Server F
Tablet 1
Follower
Tablet 2
Follower
Tablet
Server E
Tablet 1
Follower
Tablet 2
Follower
Tablet n
LEADER
Tablet
Server F
Tablet 1
LEADER
Tablet n
Follower
Tablet
Server D
APACHE KUDU : ARCHITECTURE
Master Servers Tablet Servers
Apache Flink
Flink programs are all about operators
How Flink works
source map() keyBy() sink
STREAMING DATAFLOW
SOURCE TRANSFORMATION SINK
OPERATORS
█ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION
Agenda
FLINK CONNECTORS
out there
flink-connector-
elasticsearch
flink-connector-kafka
flink-connector-redis flink-connector-influxdb
…Flink-connector-
kinesis
flink-connector-hbase flink-jdbc
FLINK-KUDU-CONNECTOR
map() …
STREAMING DATAFLOW
Kudu Table Kudu Table
DataSet and DataStream APIs (batch & streaming)
FLINK IO API
Flink IO
org.apache.flink.api.common.io.InputFormat
• KuduInputFormat
org.apache.flink.api.common.io.OutputFormat
• KuduOutputFormat
org.apache.flink.streaming.api.functions.source.SourceFunction
• Not Implemented
• Kudu does not provide CDC
• Open issue: here
org.apache.flink.streaming.api.functions.sink.SinkFunction
• KuduSinkFunction
CLASS USAGE BASE API
KuduInputFormat PUBLIC InputFormat DataSet
KuduOutputFormat PUBLIC DataSet
KuduSink PUBLIC SinkFunction DataStream
KuduInputSplit INTERNAL InputSplit -
KuduBatchTableSource PUBLIC BatchTableSource Table
KuduTableSink PUBLIC AppendStreamTableSink,
BatchTableSink
Table
KUDU CONNECTOR CLASSES
env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer,
String>>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO)
)
KUDUINPUTFORMAT
sample program
KuduInputFormat: Example
KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName(myTable")
//.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129))
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.flatMap(…)…
KUDUINPUTFORMAT
A program with parallelism = 5, TMs= 5, slots = 1 per TM
Reading data from Kudu in Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
map …
map …
map …
map …
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 3
TM 3, SLOT 1
PARALLEL INSTANCE 5
TM 5, SLOT 1
KUDU TABLE
KUDU MASTER
IDLE DATA SKEW
DataSet dataset = env.fromElements(1,2,3,4,5,5,6);
dataset.map(…)
.combineGroup(…)
.reduceGroup(…)
KUDUOUTPUTFORMAT
sample program
Flink connectors
KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName("test")
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT)
.output(new KuduOutputFormat(outputConfig));
KUDUOUTPUTFORMAT
A program with parallelism = 4, TMs= 4, slots = 1 per TM
Writing to Kudu from Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
map output
map output
map output
map output
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 2
TM 3, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
KUDU TABLE
FLINK TABLE API
Kudu working with Flink Table API
Kudu and Table API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>(
// type information
));
tEnv.registerTableSource("Taxis", kuduTableSource);
Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by
f1").as("vehicle,avgSpeed,totalMeasures");
result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.DOUBLE_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
)));
DEMO
• Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17
• Code samples: https://github.com/0xNacho/kudu-flink-examples/
• flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu
C0DE
Contributions are welcome!
FUTURE WORK
A first version is released, but the team keep working
• Support more data types (currently only Tuples are supported)
• Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is
available in Kudu)
• Fix issue with limit() operator: actually it is a open issue in kudu for the
Java API: KUDU-16, KUDU-2093
THAT’S ALL FOLKS. THANK YOU!

Apache Flink & Kudu: a connector to develop Kappa architectures

  • 1.
  • 2.
    WHO WE ARE Team RUBÉNCASADO Big Data Manager, Accenture Digital NACHO GARCÍA Senior Big Data Engineer, Accenture Digital • Big Data Chapter Lead in Accenture Digital Delivery Spain • Apache Flink Madrid Meetup organizer • Master in Big Data Architecture director at Kschool • PhD in Computer Science / Software Engineering • Open-source passionate • Senior Big Data Engineer at Accenture Digital • Lecturer of Master in Big Data at Kschool • Msc in Computer Science Copyright © 2017 Accenture. All rights reserved. ruben_casado ruben.casado.tejedor@accenture.com 0xNacho n.garcia.fernandez@accenture.com
  • 3.
  • 4.
    ACCENTURE HAS BEENRANKED AS #1 BIG DATA PROVIDER IN SPAIN BY PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM, THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES Copyright © 2017 Accenture. All rights reserved.
  • 5.
    PROJECT STARTED BYJUNIOR DEVELOPERS • Learn Apache Flink • Learn Apache Kudu • Encourage the use and contribution to the open source community 1) Open Jira ticket 2) Ticket is assigned to you 3) Open a Pull-Request 4) Overhaul the code 5) Done!
  • 6.
    █ KAPPA ARCHITECTURE:A REAL NEED Agenda
  • 7.
  • 8.
    LAMBDA ARCHITECTURE Two processingengines Lambda architecture SERVING LAYER BATCH LAYER QUERIES SPEED LAYER ALL DATA RECENT DATA Real-time view Real-time view Batch view Batch view
  • 9.
    KAPPA ARCHITECTURE A singleengine Kappa Architecture SERVING LAYERSTREAMING LAYER REPLAYABLE QUERIES
  • 10.
    █ APACHE KUDUAND APACHE FLINK: INTRODUCTION AND FEATURES Agenda
  • 11.
    APACHE KUDU What isApache Kudu? Online (fast random access) Analytics(fastscans) https://www.youtube.com/watch?v=32zV7-I1JaM https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017- 01-performance-comparison-different-file-formats-and- storage-engines
  • 12.
    APACHE KUDU What isApache Kudu? Designed for fast analytics on fast data An open-source columnar-oriented data store • Provides a combination of fast inserts/updates and efficient columnar scans • Tables have a structured data model similar to RDMS • Fast processing of OLAP workloads • Integration with Hadoop ecosystem (Impala, Spark, Flink)
  • 13.
    • Columnar datastorage: strongly-typed columns • Read efficiency: reads by columns, and not by rows • Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk • Data compression: because of the strongly—typed columns, compression is more efficient • Table: the place where data is stored. Split into segments called tablets • Tablet: contiguous segment of a table (partition) • Replicated on multiple tablet servers. Leader-follower model • Tablet server: stores and servers tablets to clients • Master: keeps track of everything. Leader-follower model • Catalog table: central location for meta-data. APACHE KUDU: CONCEPTS Key concepts of Apache Kudu
  • 14.
    Kudu network architecture Master ServerA (LEADER) Master Server B (Follower) Master Server C (Follower) Tablet 2 LEADER Tablet n Follower Tablet Server F Tablet 1 Follower Tablet 2 Follower Tablet Server E Tablet 1 Follower Tablet 2 Follower Tablet n LEADER Tablet Server F Tablet 1 LEADER Tablet n Follower Tablet Server D APACHE KUDU : ARCHITECTURE Master Servers Tablet Servers
  • 15.
    Apache Flink Flink programsare all about operators How Flink works source map() keyBy() sink STREAMING DATAFLOW SOURCE TRANSFORMATION SINK OPERATORS
  • 16.
    █ FLINK-KUDU-CONNECTOR: ADEEP EXPLANATION Agenda
  • 17.
    FLINK CONNECTORS out there flink-connector- elasticsearch flink-connector-kafka flink-connector-redisflink-connector-influxdb …Flink-connector- kinesis flink-connector-hbase flink-jdbc
  • 18.
    FLINK-KUDU-CONNECTOR map() … STREAMING DATAFLOW KuduTable Kudu Table DataSet and DataStream APIs (batch & streaming)
  • 19.
    FLINK IO API FlinkIO org.apache.flink.api.common.io.InputFormat • KuduInputFormat org.apache.flink.api.common.io.OutputFormat • KuduOutputFormat org.apache.flink.streaming.api.functions.source.SourceFunction • Not Implemented • Kudu does not provide CDC • Open issue: here org.apache.flink.streaming.api.functions.sink.SinkFunction • KuduSinkFunction
  • 20.
    CLASS USAGE BASEAPI KuduInputFormat PUBLIC InputFormat DataSet KuduOutputFormat PUBLIC DataSet KuduSink PUBLIC SinkFunction DataStream KuduInputSplit INTERNAL InputSplit - KuduBatchTableSource PUBLIC BatchTableSource Table KuduTableSink PUBLIC AppendStreamTableSink, BatchTableSink Table KUDU CONNECTOR CLASSES
  • 21.
    env.createInput(new KuduInputFormat<>(inputConfig), newTupleTypeInfo<Tuple3<Long, Integer, String>>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO) ) KUDUINPUTFORMAT sample program KuduInputFormat: Example KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder() .masterAddress(”localhost") .tableName(myTable") //.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129)) .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .flatMap(…)…
  • 22.
    KUDUINPUTFORMAT A program withparallelism = 5, TMs= 5, slots = 1 per TM Reading data from Kudu in Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 map … map … map … map … PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 3 TM 3, SLOT 1 PARALLEL INSTANCE 5 TM 5, SLOT 1 KUDU TABLE KUDU MASTER IDLE DATA SKEW
  • 23.
    DataSet dataset =env.fromElements(1,2,3,4,5,5,6); dataset.map(…) .combineGroup(…) .reduceGroup(…) KUDUOUTPUTFORMAT sample program Flink connectors KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder() .masterAddress(”localhost") .tableName("test") .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT) .output(new KuduOutputFormat(outputConfig));
  • 24.
    KUDUOUTPUTFORMAT A program withparallelism = 4, TMs= 4, slots = 1 per TM Writing to Kudu from Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 map output map output map output map output PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 2 TM 3, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 KUDU TABLE
  • 25.
    FLINK TABLE API Kuduworking with Flink Table API Kudu and Table API ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env); KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>( // type information )); tEnv.registerTableSource("Taxis", kuduTableSource); Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by f1").as("vehicle,avgSpeed,totalMeasures"); result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.DOUBLE_TYPE_INFO, BasicTypeInfo.LONG_TYPE_INFO )));
  • 26.
  • 27.
    • Ongoing contributionto Apache Bahir: https://github.com/apache/bahir-flink/pull/17 • Code samples: https://github.com/0xNacho/kudu-flink-examples/ • flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu C0DE Contributions are welcome!
  • 28.
    FUTURE WORK A firstversion is released, but the team keep working • Support more data types (currently only Tuples are supported) • Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is available in Kudu) • Fix issue with limit() operator: actually it is a open issue in kudu for the Java API: KUDU-16, KUDU-2093
  • 29.