SlideShare a Scribd company logo
APACHE KUDU
ASIM JALIS
GALVANIZE
INTRO
ASIM JALIS
Galvanize/Zipfian, Data
Engineering
Cloudera, Microso!,
Salesforce
MS in Computer Science
from University of
Virginia
WHAT IS GALVANIZE’S DATA
ENGINEERING IMMERSIVE?
Immersive Peer Learning
Environment
Master High-Demand
Skills and Technologies
Heart of San Francisco in
SOMA
YOU GET TO . . .
Play with Terabytes of
Data
Spark, Hadoop, Hive,
Kafka, Storm, HBase
Data Science at Scale
Level UP your Career
FOR MORE INFORMATION
Check out
http://galvanize.com
asim.jalis@galvanize.com
TALK OVERVIEW
WHAT IS THIS TALK ABOUT?
What is Kudu?
How can I use it to
simplify my Big Data
architecture?
How can I use it as an
alternative to HBase?
How does it work?
Demo
HOW MANY PEOPLE HERE ARE
FAMILIAR WITH HDFS?
HOW MANY PEOPLE HERE ARE
FAMILIAR WITH HBASE?
HOW MANY PEOPLE HERE ARE
FAMILIAR WITH KUDU?
WHY KUDU
WHAT IS A KUDU?
Kudus are a kind of
antelope
Found in eastern and
southern Africa
WHAT PROBLEM DOES KUDU
SOLVE?
WHAT IS LATENCY?
How long it takes to start
How long it takes to setup
WHICH HAS HIGHER LATENCY:
AIRPLANES OR CARS?
WHICH HAS HIGHER LATENCY:
AIRPLANES OR CARS?
Airplanes have high
latency
Cars have lower latency
Winner: Cars
WHAT IS THROUGHPUT?
Operations per second/minute/hour
How much you get done in unit time
WHICH HAS HIGHER
THROUGHPUT: AIRPLANES OR
CARS?
WHICH HAS HIGHER
THROUGHPUT: AIRPLANES OR
CARS?
Airplanes have high
throughput
Cars have lower
throughput
Winner: Airplanes
WHICH IS BETTER?
Low throughput High throughput
Low latency Cars Jet-Packs
High latency Horses Planes
It depends, usually there is a tradeoff
Going to Oakland: Drive
Going to Florida: Fly
WHY DOES HBASE EXIST?
HDFS immutable,
append-only
HBase mutable, random
access read/write
Fast random access
Low latency (good)
WHY DOES PARQUET EXIST?
Main idea: columnar
storage
Fast scan, fast batch
processing
High throughput (good)
High latency (bad)
WHY DOES KUDU EXIST?
Low throughput High throughput
Low latency HBase Kudu
High latency JSON on HDFS Parquet on HDFS
WHY KUDU?
Kudu is the child of
Parquet and HBase
HBase with columnar
storage
Mutable Parquet with
fast random access for
read/write
Parquet-like HBase
HBase-like Parquet
WHAT IS THE USE CASE FOR
KUDU?
Use HBase-like features to store data as it arrives in real-
time
Use Parquet-like features to run analytic workloads on
real-time data
In queries easily combine long-term historical data and
short-term real-time data
WHAT IS THE BIG IDEA OF KUDU?
HBase uses in-memory store for low-latency reads/writes
Parquet on HDFS uses columnar layout for high-
throughput scans
Kudu idea: in-memory store + columnar layout
KUDU MOTIVATION
Goal Competing With
Fast columnar scans Parquet/HDFS
Low-latency random updates HBase
Consistent performance -
KUDU, HBASE, HDFS
WHAT LANGUAGE IS KUDU
WRITTEN IN?
C++ for performance
Java and Python wrappers
HOW CAN I INTERACT WITH
KUDU?
Impala is Kudu’s defacto shell
C++, Java, or Python client
MapReduce
Spark (beta)
MapReduce and Spark read-only access (currently)
BENCHMARKS
KUDU VS PARQUET ON HDFS
TPC-H: Business-oriented queries/updates
Latency in ms: lower is better
KUDU VS HBASE
Yahoo! Cloud System
Benchmark (YCSB)
Evaluates key-value and
cloud serving stores
Random acccess
workload
Throughput: higher is
better
KUDU VS PHOENIX VS PARQUET
SQL analytic workload
TPC-H LINEITEM table only
Phoenix best-of-breed SQL on HBase
LAMBDA ARCHITECTURE
KUDU USE CASE: LAMBDA
ARCHITECTURE
LAMBDA ARCHITECTURE
Real-time data process
by Speed Layer
Historical data
processed by Batch
Layer
Speed Layer results
saved in HBase
New and old data joined
in Spark Streaming
ONLINE RETAILER USE CASE
List popular items top of
front page
What is best selling item:
this hour
today
this week
Keep inventory numbers
updated
KUDU
Using Kudu
Speed layer queries can
include analytics
PRE-KUDU ARCH
POST-KUDU ARCH
KUDU INTERNALS
KUDU ARCHITECTURE
KUDU MASTER
Fault tolerance
Failover to backup
masters
Ra! used for electing
new leaders
Only leader serves client
requests
KUDU MASTER ROLES
Catalog Manager
Metadata: schema,
replication
Master Tablet stores
metadata
Cluster Coordinator
Who is alive:
redistribute data on
death
Tablet directory
Which tablet is where
TABLETS
Tablets are similar to
HBase Regions
Tables statically
partitioned into tablets
Tablets can be
replicated to 3 or 5
Replication uses Ra!’s
Leader/Follower pattern
for consensus
Data stored on local file
system, not HDFS
TABLET REPLICATION
Writes must be done on
leader tablet
Reads can be on leader
or followers
Write log is replicated
Log replication driven by
leader
Uses Ra! consensus
THICK CLIENT
Client caches tablet metadata
Uses metadata to figure out leader to write to
Failure on write to deposed leader
Write failure forces metadata refresh
TABLET INTERNALS
Component Description
MemRowSet In-memory writes, stored row-wise
DiskRowSet Disk-based data store, columnar, 32 KB
DeltaMemStores In-memory update store
DeltaFile Disk-based update store
MEM ROW SET
Writes for new keys are
written in MemRowSets.
Then flushed out to
DiskRowSets.
Kudu uses Bloom filters
to determine if a key is in
DiskRowSet.
DELTA MEM STORES
Updates are written to
DeltaMemStores.
The flushed to
DeltaFiles.
TABLET COUNT
Tablet count based on partitions
Partitioning function maps row to tablet
Partitioning defined at table creation time
Partitions cannot be redefined dynamically
PARTITIONING
Hash Partitioning
Subset of the primary key columns and number of buckets
For example
DISTRIBUTE BY HASH(col1,col2) INTO 16 BUCK
Range Partitioning
Ordered subset of primary key columns
Maps tuples into binary strings by concatenating values of
specified columns using order-preserving encoding.
DOES KUDU REQUIRE HADOOP?
It does not depend on HDFS.
Has no required dependency on Hadoop, HDFS,
MapReduce, Spark.
However, in practice accessed most easily through Impala.
DO KUDU TABLETSERVERS
SHARE DISK SPACE WITH HDFS?
Kudu TabletServers and HDFS DataNodes can run on the
machines.
Kudu’s InputFormat enables data locality.
Data locality: MapReduce and Spark tasks likely to run on
machines containing data.
KUDU SCHEMA
WHAT DATA TYPES DOES KUDU
SUPPORT?
Boolean
8-bit signed integer
16-bit signed integer
32-bit signed integer
64-bit signed integer
Timestamp
32-bit floating-point
64-bit floating-point
String
Binary
HOW LARGE CAN VALUES BE IN
KUDU?
Values in the 10s of KB and above are not recommended
Poor performance
Stability issues in current release
Not intended for big blobs or images
ENCODING TYPES
Column Type Encoding
integer, timestamp plain, bitshuffle, run length
float plain, bitshuffle
bool plain, dictionary, run length
string, binary plain, prefix, dictionary
WHAT IS PLAIN ENCODING?
Data in its natural format.
E.g. int32 stored as 32-bit little-endian integers.
WHAT IS BITSHUFFLE ENCODING?
Bitwise columnar
encoding.
MSB stored first, then
second-MSB, etc.
Result LZ4 compressed.
Works well when values
repeat or change by
small amounts.
11010010 1111 1111
11010011 --> 0000 1111
11010100 0000 0011
11010101 1100 1010
WHAT IS RUN LENGTH
ENCODING?
Repeated values (runs) are stored as value and count.
Works well for denormalized tables with consecutive
repeated values when sorted by primary key.
54.231.184.7
54.231.184.7
54.231.184.7 --> 54.231.184.7 | 5
54.231.184.7
54.231.184.7
WHAT IS DICTIONARY ENCODING?
Dictionary of unique values.
Value encoded as index in dictionary.
Works well if column has small set of unique values.
If there are too many values Kudu falls back to plain
encoding.
WHAT IS PREFIX ENCODING?
Common prefixes compressed in consecutive column
values.
Works well when values share common prefixes.
COLUMN COMPRESSION
Per-column compression using LZ4, Snappy, or ZLib.
By default, columns stored uncompressed.
WHAT IMPACT WILL
COMPRESSION HAVE ON SCAN
PERFORMANCE AND ON SPACE?
It will reduce storage space.
It will reduce scan performance.
DEMO
ADMIN UI VM PORT SETUP
Virtual Box > Settings > Network > Adapter 2 > Port Forwarding > +
Enter:
Name: Rule 1
Protocol: TCP
Host IP:
Host Port: 8051
Guest IP:
Guest Port: 8051
Open browser:
http://quickstart.cloudera:8051
CONNECT TO VM
ssh demo@quickstart.cloudera
START IMPALA SHELL
impala-shell
CREATE A TABLE
CREATE TABLE sales (
state STRING,
id INTEGER,
sale_date STRING,
store INTEGER,
product INTEGER,
amount DOUBLE
)
DISTRIBUTE BY RANGE(state)
SPLIT ROWS(('CA'),('WA'),('OR'))
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'sales',
'kudu.master_addresses' = 'quickstart.cloudera:7051',
'kudu.key_columns' = 'state,id'
);
INSERT SOME DATA INTO IT
INSERT INTO sales
(state, id, sale_date, store, product, amount)
VALUES
('WA',101,'2014-11-13',100,331,300.00),
('OR',104,'2014-11-18',700,329,450.00),
('CA',102,'2014-11-15',203,321,200.00),
('CA',106,'2014-11-19',202,331,330.00),
('WA',103,'2014-11-17',101,373,750.00),
('CA',105,'2014-11-19',202,321,200.00);
TRY SQL
SELECT COUNT(*) FROM sales;
SELECT STATE,COUNT(*) FROM sales GROUP BY STATE;
VIEW TABLE IN ADMIN UI
Go to http://quickstart.cloudera:8051/
REFERENCES
READINGS
Kudu Whitepaper
http://getkudu.io/kudu.pdf
Quickstart VM
http://getkudu.io/docs/quickstart.html
Schema
http://getkudu.io/docs/schema_design.html
Kudu Impala Guide
http://www.cloudera.com/documentation/betas/kudu/0-5-
0/topics/kudu_impala.html
GALVANIZE DATA ENGINEERING
QUESTIONS

More Related Content

What's hot

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
Carlos del Cacho
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
confluent
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
Amazon Web Services
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 

What's hot (20)

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 

Viewers also liked

Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 

Viewers also liked (11)

Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Similar to Apache kudu

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
Tony Rogerson
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
Manpreet Khurana
 
Lecture 26
Lecture 26Lecture 26
Lecture 26
Shani729
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
Asim Jalis
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 

Similar to Apache kudu (20)

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
Lecture 26
Lecture 26Lecture 26
Lecture 26
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 

Recently uploaded

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 

Recently uploaded (20)

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 

Apache kudu

  • 3. ASIM JALIS Galvanize/Zipfian, Data Engineering Cloudera, Microso!, Salesforce MS in Computer Science from University of Virginia
  • 4. WHAT IS GALVANIZE’S DATA ENGINEERING IMMERSIVE? Immersive Peer Learning Environment Master High-Demand Skills and Technologies Heart of San Francisco in SOMA
  • 5. YOU GET TO . . . Play with Terabytes of Data Spark, Hadoop, Hive, Kafka, Storm, HBase Data Science at Scale Level UP your Career
  • 6.
  • 7. FOR MORE INFORMATION Check out http://galvanize.com asim.jalis@galvanize.com
  • 9. WHAT IS THIS TALK ABOUT? What is Kudu? How can I use it to simplify my Big Data architecture? How can I use it as an alternative to HBase? How does it work? Demo
  • 10. HOW MANY PEOPLE HERE ARE FAMILIAR WITH HDFS?
  • 11. HOW MANY PEOPLE HERE ARE FAMILIAR WITH HBASE?
  • 12. HOW MANY PEOPLE HERE ARE FAMILIAR WITH KUDU?
  • 14. WHAT IS A KUDU? Kudus are a kind of antelope Found in eastern and southern Africa
  • 15. WHAT PROBLEM DOES KUDU SOLVE?
  • 16. WHAT IS LATENCY? How long it takes to start How long it takes to setup
  • 17. WHICH HAS HIGHER LATENCY: AIRPLANES OR CARS?
  • 18. WHICH HAS HIGHER LATENCY: AIRPLANES OR CARS? Airplanes have high latency Cars have lower latency Winner: Cars
  • 19. WHAT IS THROUGHPUT? Operations per second/minute/hour How much you get done in unit time
  • 20. WHICH HAS HIGHER THROUGHPUT: AIRPLANES OR CARS?
  • 21. WHICH HAS HIGHER THROUGHPUT: AIRPLANES OR CARS? Airplanes have high throughput Cars have lower throughput Winner: Airplanes
  • 22. WHICH IS BETTER? Low throughput High throughput Low latency Cars Jet-Packs High latency Horses Planes It depends, usually there is a tradeoff Going to Oakland: Drive Going to Florida: Fly
  • 23. WHY DOES HBASE EXIST? HDFS immutable, append-only HBase mutable, random access read/write Fast random access Low latency (good)
  • 24. WHY DOES PARQUET EXIST? Main idea: columnar storage Fast scan, fast batch processing High throughput (good) High latency (bad)
  • 25. WHY DOES KUDU EXIST? Low throughput High throughput Low latency HBase Kudu High latency JSON on HDFS Parquet on HDFS
  • 26. WHY KUDU? Kudu is the child of Parquet and HBase HBase with columnar storage Mutable Parquet with fast random access for read/write Parquet-like HBase HBase-like Parquet
  • 27. WHAT IS THE USE CASE FOR KUDU? Use HBase-like features to store data as it arrives in real- time Use Parquet-like features to run analytic workloads on real-time data In queries easily combine long-term historical data and short-term real-time data
  • 28. WHAT IS THE BIG IDEA OF KUDU? HBase uses in-memory store for low-latency reads/writes Parquet on HDFS uses columnar layout for high- throughput scans Kudu idea: in-memory store + columnar layout
  • 29. KUDU MOTIVATION Goal Competing With Fast columnar scans Parquet/HDFS Low-latency random updates HBase Consistent performance -
  • 31. WHAT LANGUAGE IS KUDU WRITTEN IN? C++ for performance Java and Python wrappers
  • 32. HOW CAN I INTERACT WITH KUDU? Impala is Kudu’s defacto shell C++, Java, or Python client MapReduce Spark (beta) MapReduce and Spark read-only access (currently)
  • 34. KUDU VS PARQUET ON HDFS TPC-H: Business-oriented queries/updates Latency in ms: lower is better
  • 35. KUDU VS HBASE Yahoo! Cloud System Benchmark (YCSB) Evaluates key-value and cloud serving stores Random acccess workload Throughput: higher is better
  • 36. KUDU VS PHOENIX VS PARQUET SQL analytic workload TPC-H LINEITEM table only Phoenix best-of-breed SQL on HBase
  • 38. KUDU USE CASE: LAMBDA ARCHITECTURE
  • 39. LAMBDA ARCHITECTURE Real-time data process by Speed Layer Historical data processed by Batch Layer Speed Layer results saved in HBase New and old data joined in Spark Streaming
  • 40. ONLINE RETAILER USE CASE List popular items top of front page What is best selling item: this hour today this week Keep inventory numbers updated
  • 41. KUDU Using Kudu Speed layer queries can include analytics
  • 46. KUDU MASTER Fault tolerance Failover to backup masters Ra! used for electing new leaders Only leader serves client requests
  • 47. KUDU MASTER ROLES Catalog Manager Metadata: schema, replication Master Tablet stores metadata Cluster Coordinator Who is alive: redistribute data on death Tablet directory Which tablet is where
  • 48. TABLETS Tablets are similar to HBase Regions Tables statically partitioned into tablets Tablets can be replicated to 3 or 5 Replication uses Ra!’s Leader/Follower pattern for consensus Data stored on local file system, not HDFS
  • 49. TABLET REPLICATION Writes must be done on leader tablet Reads can be on leader or followers Write log is replicated Log replication driven by leader Uses Ra! consensus
  • 50. THICK CLIENT Client caches tablet metadata Uses metadata to figure out leader to write to Failure on write to deposed leader Write failure forces metadata refresh
  • 51. TABLET INTERNALS Component Description MemRowSet In-memory writes, stored row-wise DiskRowSet Disk-based data store, columnar, 32 KB DeltaMemStores In-memory update store DeltaFile Disk-based update store
  • 52. MEM ROW SET Writes for new keys are written in MemRowSets. Then flushed out to DiskRowSets. Kudu uses Bloom filters to determine if a key is in DiskRowSet.
  • 53. DELTA MEM STORES Updates are written to DeltaMemStores. The flushed to DeltaFiles.
  • 54. TABLET COUNT Tablet count based on partitions Partitioning function maps row to tablet Partitioning defined at table creation time Partitions cannot be redefined dynamically
  • 55. PARTITIONING Hash Partitioning Subset of the primary key columns and number of buckets For example DISTRIBUTE BY HASH(col1,col2) INTO 16 BUCK Range Partitioning Ordered subset of primary key columns Maps tuples into binary strings by concatenating values of specified columns using order-preserving encoding.
  • 56. DOES KUDU REQUIRE HADOOP? It does not depend on HDFS. Has no required dependency on Hadoop, HDFS, MapReduce, Spark. However, in practice accessed most easily through Impala.
  • 57. DO KUDU TABLETSERVERS SHARE DISK SPACE WITH HDFS? Kudu TabletServers and HDFS DataNodes can run on the machines. Kudu’s InputFormat enables data locality. Data locality: MapReduce and Spark tasks likely to run on machines containing data.
  • 59. WHAT DATA TYPES DOES KUDU SUPPORT? Boolean 8-bit signed integer 16-bit signed integer 32-bit signed integer 64-bit signed integer Timestamp 32-bit floating-point 64-bit floating-point String Binary
  • 60. HOW LARGE CAN VALUES BE IN KUDU? Values in the 10s of KB and above are not recommended Poor performance Stability issues in current release Not intended for big blobs or images
  • 61. ENCODING TYPES Column Type Encoding integer, timestamp plain, bitshuffle, run length float plain, bitshuffle bool plain, dictionary, run length string, binary plain, prefix, dictionary
  • 62. WHAT IS PLAIN ENCODING? Data in its natural format. E.g. int32 stored as 32-bit little-endian integers.
  • 63. WHAT IS BITSHUFFLE ENCODING? Bitwise columnar encoding. MSB stored first, then second-MSB, etc. Result LZ4 compressed. Works well when values repeat or change by small amounts. 11010010 1111 1111 11010011 --> 0000 1111 11010100 0000 0011 11010101 1100 1010
  • 64. WHAT IS RUN LENGTH ENCODING? Repeated values (runs) are stored as value and count. Works well for denormalized tables with consecutive repeated values when sorted by primary key. 54.231.184.7 54.231.184.7 54.231.184.7 --> 54.231.184.7 | 5 54.231.184.7 54.231.184.7
  • 65. WHAT IS DICTIONARY ENCODING? Dictionary of unique values. Value encoded as index in dictionary. Works well if column has small set of unique values. If there are too many values Kudu falls back to plain encoding.
  • 66. WHAT IS PREFIX ENCODING? Common prefixes compressed in consecutive column values. Works well when values share common prefixes.
  • 67. COLUMN COMPRESSION Per-column compression using LZ4, Snappy, or ZLib. By default, columns stored uncompressed.
  • 68. WHAT IMPACT WILL COMPRESSION HAVE ON SCAN PERFORMANCE AND ON SPACE? It will reduce storage space. It will reduce scan performance.
  • 69. DEMO
  • 70. ADMIN UI VM PORT SETUP Virtual Box > Settings > Network > Adapter 2 > Port Forwarding > + Enter: Name: Rule 1 Protocol: TCP Host IP: Host Port: 8051 Guest IP: Guest Port: 8051 Open browser: http://quickstart.cloudera:8051
  • 71. CONNECT TO VM ssh demo@quickstart.cloudera
  • 73. CREATE A TABLE CREATE TABLE sales ( state STRING, id INTEGER, sale_date STRING, store INTEGER, product INTEGER, amount DOUBLE ) DISTRIBUTE BY RANGE(state) SPLIT ROWS(('CA'),('WA'),('OR')) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'sales', 'kudu.master_addresses' = 'quickstart.cloudera:7051', 'kudu.key_columns' = 'state,id' );
  • 74. INSERT SOME DATA INTO IT INSERT INTO sales (state, id, sale_date, store, product, amount) VALUES ('WA',101,'2014-11-13',100,331,300.00), ('OR',104,'2014-11-18',700,329,450.00), ('CA',102,'2014-11-15',203,321,200.00), ('CA',106,'2014-11-19',202,331,330.00), ('WA',103,'2014-11-17',101,373,750.00), ('CA',105,'2014-11-19',202,321,200.00);
  • 75. TRY SQL SELECT COUNT(*) FROM sales; SELECT STATE,COUNT(*) FROM sales GROUP BY STATE;
  • 76. VIEW TABLE IN ADMIN UI Go to http://quickstart.cloudera:8051/