SlideShare a Scribd company logo
1 of 29
Download to read offline
FLINK-KUDU-CONNECTOR
AN OPEN-SOURCE CONTRIBUTION TO
DEVELOP KAPPA ARCHITECTURES
WHO WE ARE
Team
RUBÉN CASADO
Big Data Manager,
Accenture Digital
NACHO GARCÍA
Senior Big Data Engineer,
Accenture Digital
• Big Data Chapter Lead in Accenture Digital Delivery Spain
• Apache Flink Madrid Meetup organizer
• Master in Big Data Architecture director at Kschool
• PhD in Computer Science / Software Engineering
• Open-source passionate
• Senior Big Data Engineer at Accenture Digital
• Lecturer of Master in Big Data at Kschool
• Msc in Computer Science
Copyright © 2017 Accenture. All rights reserved.
ruben_casado
ruben.casado.tejedor@accenture.com
0xNacho
n.garcia.fernandez@accenture.com
█ CONTEXT
Agenda
ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY
PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM,
THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN
DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND
TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES
Copyright © 2017 Accenture. All rights reserved.
PROJECT STARTED BY JUNIOR DEVELOPERS
• Learn Apache Flink
• Learn Apache Kudu
• Encourage the use and contribution to
the open source community
1) Open Jira ticket
2) Ticket is assigned to you
3) Open a Pull-Request
4) Overhaul the code
5) Done!
█ KAPPA ARCHITECTURE: A REAL NEED
Agenda
MOTIVATION
Batch
Processing
NoSQL
Stream
processing
LAMBDA ARCHITECTURE
Two processing engines
Lambda architecture
SERVING LAYER
BATCH LAYER
QUERIES
SPEED LAYER
ALL DATA
RECENT
DATA
Real-time view
Real-time view
Batch view
Batch view
KAPPA ARCHITECTURE
A single engine
Kappa Architecture
SERVING LAYERSTREAMING LAYER
REPLAYABLE
QUERIES
█ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES
Agenda
APACHE KUDU
What is Apache Kudu?
Online (fast random access)
Analytics(fastscans)
https://www.youtube.com/watch?v=32zV7-I1JaM
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-
01-performance-comparison-different-file-formats-and-
storage-engines
APACHE KUDU
What is Apache Kudu?
Designed for fast analytics on fast data
An open-source columnar-oriented data store
• Provides a combination of fast inserts/updates and efficient columnar
scans
• Tables have a structured data model similar to RDMS
• Fast processing of OLAP workloads
• Integration with Hadoop ecosystem (Impala, Spark, Flink)
• Columnar data storage: strongly-typed columns
• Read efficiency: reads by columns, and not by rows
• Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk
• Data compression: because of the strongly—typed columns, compression is more efficient
• Table: the place where data is stored. Split into segments called tablets
• Tablet: contiguous segment of a table (partition)
• Replicated on multiple tablet servers. Leader-follower model
• Tablet server: stores and servers tablets to clients
• Master: keeps track of everything. Leader-follower model
• Catalog table: central location for meta-data.
APACHE KUDU: CONCEPTS
Key concepts of Apache Kudu
Kudu network architecture
Master
Server A (LEADER)
Master
Server B (Follower)
Master
Server C (Follower)
Tablet 2
LEADER
Tablet n
Follower
Tablet
Server F
Tablet 1
Follower
Tablet 2
Follower
Tablet
Server E
Tablet 1
Follower
Tablet 2
Follower
Tablet n
LEADER
Tablet
Server F
Tablet 1
LEADER
Tablet n
Follower
Tablet
Server D
APACHE KUDU : ARCHITECTURE
Master Servers Tablet Servers
Apache Flink
Flink programs are all about operators
How Flink works
source map() keyBy() sink
STREAMING DATAFLOW
SOURCE TRANSFORMATION SINK
OPERATORS
█ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION
Agenda
FLINK CONNECTORS
out there
flink-connector-
elasticsearch
flink-connector-kafka
flink-connector-redis flink-connector-influxdb
…Flink-connector-
kinesis
flink-connector-hbase flink-jdbc
FLINK-KUDU-CONNECTOR
map() …
STREAMING DATAFLOW
Kudu Table Kudu Table
DataSet and DataStream APIs (batch & streaming)
FLINK IO API
Flink IO
org.apache.flink.api.common.io.InputFormat
• KuduInputFormat
org.apache.flink.api.common.io.OutputFormat
• KuduOutputFormat
org.apache.flink.streaming.api.functions.source.SourceFunction
• Not Implemented
• Kudu does not provide CDC
• Open issue: here
org.apache.flink.streaming.api.functions.sink.SinkFunction
• KuduSinkFunction
CLASS USAGE BASE API
KuduInputFormat PUBLIC InputFormat DataSet
KuduOutputFormat PUBLIC DataSet
KuduSink PUBLIC SinkFunction DataStream
KuduInputSplit INTERNAL InputSplit -
KuduBatchTableSource PUBLIC BatchTableSource Table
KuduTableSink PUBLIC AppendStreamTableSink,
BatchTableSink
Table
KUDU CONNECTOR CLASSES
env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer,
String>>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO)
)
KUDUINPUTFORMAT
sample program
KuduInputFormat: Example
KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName(myTable")
//.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129))
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.flatMap(…)…
KUDUINPUTFORMAT
A program with parallelism = 5, TMs= 5, slots = 1 per TM
Reading data from Kudu in Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
map …
map …
map …
map …
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 3
TM 3, SLOT 1
PARALLEL INSTANCE 5
TM 5, SLOT 1
KUDU TABLE
KUDU MASTER
IDLE DATA SKEW
DataSet dataset = env.fromElements(1,2,3,4,5,5,6);
dataset.map(…)
.combineGroup(…)
.reduceGroup(…)
KUDUOUTPUTFORMAT
sample program
Flink connectors
KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName("test")
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT)
.output(new KuduOutputFormat(outputConfig));
KUDUOUTPUTFORMAT
A program with parallelism = 4, TMs= 4, slots = 1 per TM
Writing to Kudu from Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
map output
map output
map output
map output
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 2
TM 3, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
KUDU TABLE
FLINK TABLE API
Kudu working with Flink Table API
Kudu and Table API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>(
// type information
));
tEnv.registerTableSource("Taxis", kuduTableSource);
Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by
f1").as("vehicle,avgSpeed,totalMeasures");
result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.DOUBLE_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
)));
DEMO
• Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17
• Code samples: https://github.com/0xNacho/kudu-flink-examples/
• flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu
C0DE
Contributions are welcome!
FUTURE WORK
A first version is released, but the team keep working
• Support more data types (currently only Tuples are supported)
• Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is
available in Kudu)
• Fix issue with limit() operator: actually it is a open issue in kudu for the
Java API: KUDU-16, KUDU-2093
THAT’S ALL FOLKS. THANK YOU!

More Related Content

What's hot

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningAparup Chatterjee
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 

What's hot (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
ORC Files
ORC FilesORC Files
ORC Files
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition Pruning
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 

Similar to Apache Flink & Kudu: a connector to develop Kappa architectures

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Similar to Apache Flink & Kudu: a connector to develop Kappa architectures (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Hadoop
HadoopHadoop
Hadoop
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentationanshikakulshreshtha11
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 

Recently uploaded (20)

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 

Apache Flink & Kudu: a connector to develop Kappa architectures

  • 1. FLINK-KUDU-CONNECTOR AN OPEN-SOURCE CONTRIBUTION TO DEVELOP KAPPA ARCHITECTURES
  • 2. WHO WE ARE Team RUBÉN CASADO Big Data Manager, Accenture Digital NACHO GARCÍA Senior Big Data Engineer, Accenture Digital • Big Data Chapter Lead in Accenture Digital Delivery Spain • Apache Flink Madrid Meetup organizer • Master in Big Data Architecture director at Kschool • PhD in Computer Science / Software Engineering • Open-source passionate • Senior Big Data Engineer at Accenture Digital • Lecturer of Master in Big Data at Kschool • Msc in Computer Science Copyright © 2017 Accenture. All rights reserved. ruben_casado ruben.casado.tejedor@accenture.com 0xNacho n.garcia.fernandez@accenture.com
  • 4. ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM, THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES Copyright © 2017 Accenture. All rights reserved.
  • 5. PROJECT STARTED BY JUNIOR DEVELOPERS • Learn Apache Flink • Learn Apache Kudu • Encourage the use and contribution to the open source community 1) Open Jira ticket 2) Ticket is assigned to you 3) Open a Pull-Request 4) Overhaul the code 5) Done!
  • 6. █ KAPPA ARCHITECTURE: A REAL NEED Agenda
  • 8. LAMBDA ARCHITECTURE Two processing engines Lambda architecture SERVING LAYER BATCH LAYER QUERIES SPEED LAYER ALL DATA RECENT DATA Real-time view Real-time view Batch view Batch view
  • 9. KAPPA ARCHITECTURE A single engine Kappa Architecture SERVING LAYERSTREAMING LAYER REPLAYABLE QUERIES
  • 10. █ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES Agenda
  • 11. APACHE KUDU What is Apache Kudu? Online (fast random access) Analytics(fastscans) https://www.youtube.com/watch?v=32zV7-I1JaM https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017- 01-performance-comparison-different-file-formats-and- storage-engines
  • 12. APACHE KUDU What is Apache Kudu? Designed for fast analytics on fast data An open-source columnar-oriented data store • Provides a combination of fast inserts/updates and efficient columnar scans • Tables have a structured data model similar to RDMS • Fast processing of OLAP workloads • Integration with Hadoop ecosystem (Impala, Spark, Flink)
  • 13. • Columnar data storage: strongly-typed columns • Read efficiency: reads by columns, and not by rows • Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk • Data compression: because of the strongly—typed columns, compression is more efficient • Table: the place where data is stored. Split into segments called tablets • Tablet: contiguous segment of a table (partition) • Replicated on multiple tablet servers. Leader-follower model • Tablet server: stores and servers tablets to clients • Master: keeps track of everything. Leader-follower model • Catalog table: central location for meta-data. APACHE KUDU: CONCEPTS Key concepts of Apache Kudu
  • 14. Kudu network architecture Master Server A (LEADER) Master Server B (Follower) Master Server C (Follower) Tablet 2 LEADER Tablet n Follower Tablet Server F Tablet 1 Follower Tablet 2 Follower Tablet Server E Tablet 1 Follower Tablet 2 Follower Tablet n LEADER Tablet Server F Tablet 1 LEADER Tablet n Follower Tablet Server D APACHE KUDU : ARCHITECTURE Master Servers Tablet Servers
  • 15. Apache Flink Flink programs are all about operators How Flink works source map() keyBy() sink STREAMING DATAFLOW SOURCE TRANSFORMATION SINK OPERATORS
  • 16. █ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION Agenda
  • 17. FLINK CONNECTORS out there flink-connector- elasticsearch flink-connector-kafka flink-connector-redis flink-connector-influxdb …Flink-connector- kinesis flink-connector-hbase flink-jdbc
  • 18. FLINK-KUDU-CONNECTOR map() … STREAMING DATAFLOW Kudu Table Kudu Table DataSet and DataStream APIs (batch & streaming)
  • 19. FLINK IO API Flink IO org.apache.flink.api.common.io.InputFormat • KuduInputFormat org.apache.flink.api.common.io.OutputFormat • KuduOutputFormat org.apache.flink.streaming.api.functions.source.SourceFunction • Not Implemented • Kudu does not provide CDC • Open issue: here org.apache.flink.streaming.api.functions.sink.SinkFunction • KuduSinkFunction
  • 20. CLASS USAGE BASE API KuduInputFormat PUBLIC InputFormat DataSet KuduOutputFormat PUBLIC DataSet KuduSink PUBLIC SinkFunction DataStream KuduInputSplit INTERNAL InputSplit - KuduBatchTableSource PUBLIC BatchTableSource Table KuduTableSink PUBLIC AppendStreamTableSink, BatchTableSink Table KUDU CONNECTOR CLASSES
  • 21. env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer, String>>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO) ) KUDUINPUTFORMAT sample program KuduInputFormat: Example KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder() .masterAddress(”localhost") .tableName(myTable") //.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129)) .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .flatMap(…)…
  • 22. KUDUINPUTFORMAT A program with parallelism = 5, TMs= 5, slots = 1 per TM Reading data from Kudu in Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 map … map … map … map … PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 3 TM 3, SLOT 1 PARALLEL INSTANCE 5 TM 5, SLOT 1 KUDU TABLE KUDU MASTER IDLE DATA SKEW
  • 23. DataSet dataset = env.fromElements(1,2,3,4,5,5,6); dataset.map(…) .combineGroup(…) .reduceGroup(…) KUDUOUTPUTFORMAT sample program Flink connectors KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder() .masterAddress(”localhost") .tableName("test") .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT) .output(new KuduOutputFormat(outputConfig));
  • 24. KUDUOUTPUTFORMAT A program with parallelism = 4, TMs= 4, slots = 1 per TM Writing to Kudu from Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 map output map output map output map output PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 2 TM 3, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 KUDU TABLE
  • 25. FLINK TABLE API Kudu working with Flink Table API Kudu and Table API ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env); KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>( // type information )); tEnv.registerTableSource("Taxis", kuduTableSource); Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by f1").as("vehicle,avgSpeed,totalMeasures"); result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.DOUBLE_TYPE_INFO, BasicTypeInfo.LONG_TYPE_INFO )));
  • 26. DEMO
  • 27. • Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17 • Code samples: https://github.com/0xNacho/kudu-flink-examples/ • flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu C0DE Contributions are welcome!
  • 28. FUTURE WORK A first version is released, but the team keep working • Support more data types (currently only Tuples are supported) • Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is available in Kudu) • Fix issue with limit() operator: actually it is a open issue in kudu for the Java API: KUDU-16, KUDU-2093
  • 29. THAT’S ALL FOLKS. THANK YOU!