A Closer Look at Apache Kudu

Apache Kudu
A Closer Look at
By Andriy Zabavskyy Mar 2017

A species of antelope from BigData Zoo

Analytics on Hadoop before Kudu
Fast Scans Fast Random Access

Weak side of combining Parquet and HBase
• Complex code to manage the flow and synchronization of data
between the two systems.
• Manage consistent backups, security policies, and monitoring
across multiple distinct systems.

Lambda Architecture Challenges
• In the real world, systems often need to accommodate
• Late-arriving data
• Corrections on past records
• Privacy-related deletions on data that has already been
migrated to the immutable store.

Happy Medium
• High Throughput. Goal within 2x Impala
• Low Latency for random read/write. Goal 1ms on SSD
• SQL and NoSQL style API
Fast Scans Fast Random Access

Tables, Schemas, Keys
• Kudu is a storage system for tables of structured data
• Schema consisting of a finite number of columns
• Each such column has a name, type:
• Boolean, Integers, Unixtime_Micros,
• Floating, String, Binary

Keys
• Some ordered subset of those columns are specified to be the
table’s primary key
• The primary key:
• enforces a uniqueness constraint
• acts as the sole index by which rows may be efficiently
updated or deleted

Write Operations
• User mutates the table using Insert, Update, and Delete
APIs
• Note: a primary key must be fully specified
• Java, C++, Python API
• No multi-row transactional APIs:
• each mutation conceptually executes as its own
transaction,
• despite being automatically batched with other mutations
for better performance.

Read Operations
• Scan operation:
• any number of predicates to filter the results
• two types of predicates:
• comparisons between a column and a constant value,
• and composite primary key ranges.
• An user may specify a projection for a scan.
• A projection consists of a subset of columns to be
retrieved.

Storage Layout Goals
• Fast columnar scans
• best-of-breed immutable data formats
such as Parquet
• efficiently encoded columnar data files.
• Low-latency random updates
• O(lg n) lookup complexity for random
access
• Consistency of performance
• Majority of users are willing
predictability

MemRowSet
• In-memory concurrent B-tree
• No removal from tree – MVCC
records instead
• No in-place updates – only
modifications without changing the
value size
• Link together leaf nodes for
sequential scans
• Row-wise layout
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

DiskRowSet
• Column-organized
• Each column is written to
disk in a single contiguous
block of data.
• The column itself is
subdivided into small
pages
• Granular random reads,
and
• An embedded B-tree index

Deltas
• A DeltaMemStore is a concurrent B-tree which shares the
implementation of MemRowSets
• A DeltaMemStore flushes into a DeltaFile
• A DeltaFile is a simple binary column

Insert Path
• Each DiskRowSet stores a Bloom filter of the set of keys
present
• Each DiskRowSet, we store the minimum and maximum
primary key,

Read Path
• Converts the key range predicate into a row offset range
predicate
• Performs the scan one column at a time
• Seeks the target column to the correct row offset
• Consult the delta stores to see if any later updates

Delta Compaction
• Background maintenance manager periodically
• scans DiskRowSets to find any cases where a large
number of deltas have accumulated, and
• schedules a delta compaction operation which merges
those deltas back into the base data columns.

RowSet Compaction
• A key-based merge of two or more DiskRowSets
• The output is written back to new DiskRowSets rolling every
32 MB
• RowSet compaction has two goals:
• We take this opportunity to remove deleted rows.
• This process reduces the number of DiskRowSets that
overlap in key range

Kudu Trade-Offs
• Random Updates will be slower
• Kudu requires key-lookup before update, bloom lookup
before insert
• Single Row Seek may be slower
• Columnar Design is optimized for scans
• Especially slow at reading a row with many recent
updates

The Kudu Master
Kudu’s central master process has several key responsibilities:
• A catalog manager
• keeping track of which tables and tablets exist, as well as their
schemas, desired replication levels, and other metadata
• A cluster coordinator
• keeping track of which servers in the cluster are aliveand
coordinating redistribution of data
• A tablet directory
• keeping track of which tablet servers are hosting replicas of
each tablet

Why Kudu
Cluster Architecture
Partitioning

Partitioning
• Tables in Kudu are horizontally partitioned.
• Kudu, like BigTable, calls these partitions tablets
• Kudu supports a flexible array of partitioning schemes

Partitioning: Hash
Img source: https://github.com/cloudera/kudu/blob/master/docs/images/hash-partitioning-example.png

Partitioning: Range
Img source: https://github.com/cloudera/kudu/blob/master/docs/images/range-partitioning-example.png

Partitioning: Hash plus Range
Img source: https://github.com/cloudera/kudu/blob/master/docs/images/hash-range-partitioning-example.png

Partitioning Recommendations
• Bigger tables, like fact tables are recommended to partition in
a way so that 1 tablet would contain about 1GB of data
• Do not partition small tables like dimensions
• Note: Impala doesn’t allow skipping the partitioning
clause, so you need to specify the 1 range partition
explicitly:

Dimension Table with One Partition

Why Kudu
Cluster Architecture
Replication

Replication Approach
• Kudu uses the Leader/Follower or Master-Slave
replication
• Kudu employs the Raft[25] consensus algorithm to
replicate its tablets
• If a majority of replicas accept the write and log it to
their own local write-ahead logs,
• the write is considered durably replicated and thus
can be committed on all replicas

Raft: Replicated State Machine
• Replicated log ensures state machines execute same commandsinsame order
• Consensus module ensures proper log replication
• System makes progress as long as any majority of servers are up
• Visualization: https://raft.github.io/raftscope/index.html

Consistency Model
• Kudu provides clients the choice between two consistency
modes for reads(scans):
• READ_AT_SNAPSHOT
• READ_LATEST

READ_LATEST consistency
• Monotonic reads are guaranteed(?) Read-your-writes is not
• Corresponds to "Read Committed" ACID Isolation mode:
• This is the default mode.

READ_LATEST consistency
• The server will always return committed writes at the time
the request was received.
• This type of read is not repeatable.

READ_AT_SNAPSHOT Consistency
• Guarantees read-your-writes consistency from a single client
• Corresponds "Repeatable Read” ACID Isolation mode.

READ_AT_SNAPSHOT Consistency
• The server attempts to perform a read at the provided
timestamp
• In this mode reads are repeatable
• at the expense of waiting for in-flight transactions whose
timestamp is lower than the snapshot's timestamp to
complete

Write Consistency
• Writes to a single tablet are always internally consistent
• By default, Kudu does not provide an external consistency
guarantee.
• However, for users who require a stronger guarantee, Kudu
offers the option to manually propagate timestamps between
clients

Replication Factor Limitation
• Since Kudu 1.2.0:
• The replication factor of tables is now limited to a
maximum of 7
• In addition, it is no longer allowed to create a table with an
even replication factor

Kudu and CAP Theorem
• Kudu is a CP type of storage engine.
• Writing to a tablet will be delayed if
the server that hosts that tablet’s
leader replica fails
• Kudu gains the following properties
by using Raft consensus:
• Leader elections are fast
• Follower replicas don’t allow
writes, but they do allow reads

Applications for which Kudu is a viable
• Reporting applications where new data must be immediately
available for end users
• Time-series applications with
• queries across large amounts of historic data
• granular queries about an individual entity
• Applications that use predictive models to make real-time
decisions

Why Kudu
Streaming Analytics
Case Study

Business Case
• A leader in health care
compliance consulting and
technology-driven managed
services
• Cloud-based multi-services
platform
• It offers
• enhanced data security and
scalability,
• operational managed services,
and access to business
information
http://ihealthone.com /wp-c ontent/uploads/2016/12/
Healthcare_Complianc e_Cons ultants-495x400.jpg

ETL Approach
Key Points:
• Leverage Confluent platform with
Schema Registry
• Apply configuration based approach:
• Avro Schema in Schema Registry for
Input Schema
• Impala Kudu SQL scripts for Target
Schema
• Stick to Python App as primary ETL code,
but extend:
• Develop new abstractions to work
with mapping rules
• Streaming processing for both facts and
dimensions
Cons:
• Scaling needs extra effortsData Flow
Analytics
DWH
Event
Topics
ETL
Code
Configuration
Input
Schema
Mapping
Rules
Target
Schema
Other
Configurations

Stream ETL using Pipeline Architecture
Cache
Manager
Mapper/
Flattener
Types
Adjuster
Data
Enricher
DB Sinker
Data
Reader
Configuration
Pipeline Modules:
• Data Reader: reads data from source DB
• Mapper/Flattener: flatten JSON treelike structure into flat one
and maps the field names to target ones
• Types Adjuster: adjusts/converts data types properly
• Data Enricher: enriches the data structure with new data:
• Generates surrogate key
• Looks up for the data from target DB(using cache)
• DB Sinker: writes data into target DB
Other Modules:
• Cache Manager: manages the cache with dimension data

Kudu Numeric vs String Keys
• Reason:
• Generating surrogate numeric keys adds extra processing step
and complexityto the overall ETL process
• Sample Schema:
• Dimension:
• Promotion dimension with 1000 unique members, 30
categories
• Products dimension with 50 000 unique members, 300
categories
• Facts
• Fact table containing the references to the 2 dimension
above with 1 million of rows
• Fact table containing the references to the 2 dimension
above with 100 million of rows

Pain Points
• Often releases with many changes
• Data types Limitations (especially in Python Lib, Impala)
• Lack of Sequences/Constraints
• Lack of Multi-Row transactions

Limitations
• Not recommended more than 50 columns
• Immutable primary keys
• Non-alterable Primary Key, Partitioning, Column Types
• Partitions splitable

Modeling Recommendations: Star Schema
Dimensions :
• Replication factor equal to
number of nodes in a cluster
• 1 Tablet per dimension
Facts:
• Aim for as many tablets as you
have cores in the cluster

What Kudu is Not
• Not a SQL interface itself
• It’s just the storage layer – you should use Impala or
SparkSQL
• Not an application that runs on HDFS
• It’s an alternative, native Hadoop storage engine
• Not a replacement for HDFS or Hbase
• Select the right storage for the right use case
• Cloudera will support and invest in all three

Why Kudu
Kudu vs MPP
Data Warehouse

Kudu vs MPP Data Warehouses
In Common:
• Fast analytics queries via SQL
• Ability to insert, update, delete data
Differences:
üFaster streaming inserts
üImproved Hadoop integration
o Slower batch inserts
o No transactional data loading, multi-row transactions,
indexing

Useful resources
• Community, Downloads, VM:
• https://kudu.apache.org
• Whitepaper:
• http://kudu.apache.org/kudu.pdf
• Slack channel:
• https://getkudu-slack.herokuapp.com

USA HQ
Toll Free: 866-687-3588
Tel: +1-512-516-8880
Ukraine HQ
Tel: +380-32-240-9090
Bulgaria
Tel: +359-2-902-3760
Germany
Tel: +49-69-2602-5857
Netherlands
Tel: +31-20-262-33-23
Poland
Tel: +48-71-382-2800
UK
Tel: +44-207-544-8414
EMAIL
info@softserveinc.com
WEBSITE:
www.softserveinc.com
Questions ?

A Closer Look at Apache Kudu

More Related Content

What's hot

Similar to A Closer Look at Apache Kudu

Recently uploaded

A Closer Look at Apache Kudu

Editor's Notes