Kudu: New Hadoop Storage for Fast Analytics on Fast Data

1© Cloudera, Inc. All rights reserved.
Todd Lipcon on behalf of the Kudu team
Kudu: Resolving Transactional
and Analytic Trade-offs in
Hadoop
1

The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco

Kudu
Storage for Fast Analytics on Fast Data
• New updating column store for
Hadoop
• Apache-licensed open source
• Beta now available
Columnar Store
Kudu

Motivation and Goals
Why build Kudu?
4

Motivating Questions
• Are there user problems that can we can’t address because of gaps in Hadoop
ecosystem storage technologies?
• Are we positioned to take advantage of advancements in the hardware
landscape?

Current Storage Landscape in Hadoop
HDFS excels at:
• Efficiently scanning large amounts
of data
• Accumulating data with high
throughput
HBase excels at:
• Efficiently finding and writing
individual rows
• Making data mutable
Gaps exist when these properties
are needed simultaneously

Changing Hardware landscape
• Spinning disk -> solid state storage
• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and
1.5GB/sec write throughput, at a price of less than $3/GB and dropping
• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:
• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t
designed with CPU efficiency in mind.
• Takeaway 2: Column stores are feasible for random access

• High throughput for big scans (columnar
storage and replication)
Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key
indexes and quorum replication)
Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row
ACID)
• Relational data model
• SQL query
• “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals

Kudu Usage
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala
• more to come!
9

Use cases and architectures

Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination of
sequential and random reads and writes
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups

Real-Time Analytics in Hadoop Today
Fraud Detection in the Real World = Storage Complexity
Considerations:
● How do I handle failure
during this process?
● How often do I reorganize
data streaming in into a
format appropriate for
reporting?
● When reporting, how do I see
data that has not yet been
reorganized?
● How do I ensure that
important jobs aren’t
interrupted by maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
Impala on HDFS

Real-Time Analytics in Hadoop with Kudu
Improvements:
● One system to operate
● No cron jobs or background
processes
● Handle late arrivals or data
corrections with ease
● New data available
immediately for analytics or
operations
Historical and Real-time
Data
Incoming Data
(Messaging
System)
Reporting
Request
Storage in Kudu

How it works
Replication and distribution
14

Tables and Tablets
• Table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with Raft consensus
• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets
• Store data on local disks (no HDFS)
15

Metadata
• Replicated master*
• Acts as a tablet directory (“META” table)
• Acts as a catalog (table schemas, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• 80-node load test, GetTableLocations RPC perf:
• 99th percentile: 68us, 99.99th percentile: 657us
• <2% peak CPU usage
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
16

Raft consensus
18
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!

Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
19

Fault tolerance (2)
• Permanent failure:
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica
• Leader copies the data over to the new one, which joins as a new FOLLOWER
20

How it works
Storage engine internals
21

Tablet design
• Inserts buffered in an in-memory store (like HBase’s memstore)
• Flushed to disk
• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)
• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans
• Near-optimal read path for “current time” scans
• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
22

LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
23

LSM Insert Path
24
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
flush

LSM Insert Path
25
MemStore
INSERT
HFile 2
Row=r2 col=c1 val=“blah2”
flush
HFile 1

LSM Update path
26
MemStore
UPDATE
HFile 1
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!

LSM Read path
27
MemStore
HFile 1
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c1 val=“newval”
Merge based on string row
keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist across
multiple HFiles: must always
merge!
The more HFiles to merge, the
slower it reads

Kudu storage – Inserts and Flushes
28
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush

Kudu storage – Inserts and Flushes
29
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush

Kudu storage - Updates
30
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data

Kudu storage - Updates
31
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data

Kudu storage – Read path
32
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data

Kudu storage – Delta flushes
33
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data

Kudu storage – Major delta compaction
34
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta application
work on reads
name pay role
DiskRowSet(post-compaction)
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be re-
written: those deltas maintained in new DeltaFile
Merge updates for columns with high update
percentage
base data

Kudu storage – RowSet Compactions
35
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl,
joe]
[jon, julie, linda,
mary]
[omar, zach, zeke,
zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges

Kudu storage – Compaction policy
• Solves an optimization problem (knapsack problem)
• Minimize “height” of rowsets for the average key lookup
• Bound on number of seeks for write or random-read
• Restrict total IO of any compaction to a budget (128MB)
• No long compactions, ever
• No “minor” vs “major” distinction
• Always be compacting or flushing
• Low IO priority maintenance threads
36

Kudu trade-offs
• Random updates will be slower
• HBase model allows random updates without incurring a disk seek
• Kudu requires a key lookup before update, bloom lookup before insert
• Single-row reads may be slower
• Columnar design is optimized for scans
• Future: may introduce “column groups” for applications where single-row
access is more important
• Especially slow at reading a row that has had many recent updates (e.g YCSB
“zipfian”)
37

Benchmarks
38

TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
39

- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)

What about Apache Phoenix?
• 10 node cluster (9 worker, 1 master)
• HBase 1.0, Phoenix 4.3
• TPC-H LINEITEM table only (6B rows)
41
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet

What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot
• 10 node cluster
(9 worker, 1 master)
• HBase 1.0
• 100M rows, 10M ops
42

What Kudu is not
43

Kudu is…
• NOT a SQL database
• “BYO SQL”
• NOT a filesystem
• data must have tabular structure
• NOT a replacement for HBase or HDFS
• Cloudera continues to invest in those systems
• Many use cases where they’re still more appropriate
• NOT an in-memory database
• Very fast for memory-sized workloads, but can operate on larger data too!
44

Getting started
45

Getting started as a user
• http://getkudu.io
• kudu-user@googlegroups.com
• Quickstart VM
• Easiest way to get started
• Impala and Kudu in an easy-to-install VM
• CSD and Parcels
• For installation on a Cloudera Manager-managed cluster
46

Getting started as a developer
• http://github.com/cloudera/kudu
• All commits go here first
• Public gerrit: http://gerrit.cloudera.org
• All code reviews happening here
• Public JIRA: http://issues.cloudera.org
• Includes bugs going back to 2013. Come see our dirty laundry!
• kudu-dev@googlegroups.com
• Apache 2.0 license open source
• Contributions are welcome and encouraged!
47

Demo?
(if we have time and internet gods willing)
48

http://getkudu.io/
@getkudu

Kudu: New Hadoop Storage for Fast Analytics on Fast Data

More Related Content

What's hot

Viewers also liked

Similar to Kudu: New Hadoop Storage for Fast Analytics on Fast Data

More from Cloudera, Inc.

Recently uploaded

Kudu: New Hadoop Storage for Fast Analytics on Fast Data