1© Cloudera, Inc. All rights reserved.
Todd Lipcon on behalf of the Kudu team
Kudu: Resolving Transactional
and Analytic Trade-offs in
Hadoop
1
2© Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco
3© Cloudera, Inc. All rights reserved.
Kudu
Storage for Fast Analytics on Fast Data
• New updating column store for
Hadoop
• Apache-licensed open source
• Beta now available
Columnar Store
Kudu
4© Cloudera, Inc. All rights reserved.
Motivation and Goals
Why build Kudu?
4
5© Cloudera, Inc. All rights reserved.
Motivating Questions
• Are there user problems that can we can’t address because of gaps in Hadoop
ecosystem storage technologies?
• Are we positioned to take advantage of advancements in the hardware
landscape?
6© Cloudera, Inc. All rights reserved.
Current Storage Landscape in Hadoop
HDFS excels at:
• Efficiently scanning large amounts
of data
• Accumulating data with high
throughput
HBase excels at:
• Efficiently finding and writing
individual rows
• Making data mutable
Gaps exist when these properties
are needed simultaneously
7© Cloudera, Inc. All rights reserved.
Changing Hardware landscape
• Spinning disk -> solid state storage
• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and
1.5GB/sec write throughput, at a price of less than $3/GB and dropping
• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:
• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t
designed with CPU efficiency in mind.
• Takeaway 2: Column stores are feasible for random access
8© Cloudera, Inc. All rights reserved.
• High throughput for big scans (columnar
storage and replication)
Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key
indexes and quorum replication)
Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row
ACID)
• Relational data model
• SQL query
• “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals
9© Cloudera, Inc. All rights reserved.
Kudu Usage
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala
• more to come!
9
10© Cloudera, Inc. All rights reserved.
Use cases and architectures
11© Cloudera, Inc. All rights reserved.
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination of
sequential and random reads and writes
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
12© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop Today
Fraud Detection in the Real World = Storage Complexity
Considerations:
● How do I handle failure
during this process?
● How often do I reorganize
data streaming in into a
format appropriate for
reporting?
● When reporting, how do I see
data that has not yet been
reorganized?
● How do I ensure that
important jobs aren’t
interrupted by maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
Impala on HDFS
13© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop with Kudu
Improvements:
● One system to operate
● No cron jobs or background
processes
● Handle late arrivals or data
corrections with ease
● New data available
immediately for analytics or
operations
Historical and Real-time
Data
Incoming Data
(Messaging
System)
Reporting
Request
Storage in Kudu
14© Cloudera, Inc. All rights reserved.
How it works
Replication and distribution
14
15© Cloudera, Inc. All rights reserved.
Tables and Tablets
• Table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with Raft consensus
• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets
• Store data on local disks (no HDFS)
15
16© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master*
• Acts as a tablet directory (“META” table)
• Acts as a catalog (table schemas, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• 80-node load test, GetTableLocations RPC perf:
• 99th percentile: 68us, 99.99th percentile: 657us
• <2% peak CPU usage
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
16
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
Raft consensus
18
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
19© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
19
20© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure:
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica
• Leader copies the data over to the new one, which joins as a new FOLLOWER
20
21© Cloudera, Inc. All rights reserved.
How it works
Storage engine internals
21
22© Cloudera, Inc. All rights reserved.
Tablet design
• Inserts buffered in an in-memory store (like HBase’s memstore)
• Flushed to disk
• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)
• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans
• Near-optimal read path for “current time” scans
• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
22
23© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
23
24© Cloudera, Inc. All rights reserved.
LSM Insert Path
24
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
flush
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
25
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
flush
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
26© Cloudera, Inc. All rights reserved.
LSM Update path
26
MemStore
UPDATE
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!
27© Cloudera, Inc. All rights reserved.
LSM Read path
27
MemStore
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row
keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist across
multiple HFiles: must always
merge!
The more HFiles to merge, the
slower it reads
28© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
28
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
29
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
30© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
30
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
31
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
32
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data
33© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
33
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
34
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta application
work on reads
name pay role
DiskRowSet(post-compaction)
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be re-
written: those deltas maintained in new DeltaFile
Merge updates for columns with high update
percentage
base data
35© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
35
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl,
joe]
[jon, julie, linda,
mary]
[omar, zach, zeke,
zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges
36© Cloudera, Inc. All rights reserved.
Kudu storage – Compaction policy
• Solves an optimization problem (knapsack problem)
• Minimize “height” of rowsets for the average key lookup
• Bound on number of seeks for write or random-read
• Restrict total IO of any compaction to a budget (128MB)
• No long compactions, ever
• No “minor” vs “major” distinction
• Always be compacting or flushing
• Low IO priority maintenance threads
36
37© Cloudera, Inc. All rights reserved.
Kudu trade-offs
• Random updates will be slower
• HBase model allows random updates without incurring a disk seek
• Kudu requires a key lookup before update, bloom lookup before insert
• Single-row reads may be slower
• Columnar design is optimized for scans
• Future: may introduce “column groups” for applications where single-row
access is more important
• Especially slow at reading a row that has had many recent updates (e.g YCSB
“zipfian”)
37
38© Cloudera, Inc. All rights reserved.
Benchmarks
38
39© Cloudera, Inc. All rights reserved.
TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
39
40© Cloudera, Inc. All rights reserved.
- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
41© Cloudera, Inc. All rights reserved.
What about Apache Phoenix?
• 10 node cluster (9 worker, 1 master)
• HBase 1.0, Phoenix 4.3
• TPC-H LINEITEM table only (6B rows)
41
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet
42© Cloudera, Inc. All rights reserved.
What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot
• 10 node cluster
(9 worker, 1 master)
• HBase 1.0
• 100M rows, 10M ops
42
43© Cloudera, Inc. All rights reserved.
What Kudu is not
43
44© Cloudera, Inc. All rights reserved.
Kudu is…
• NOT a SQL database
• “BYO SQL”
• NOT a filesystem
• data must have tabular structure
• NOT a replacement for HBase or HDFS
• Cloudera continues to invest in those systems
• Many use cases where they’re still more appropriate
• NOT an in-memory database
• Very fast for memory-sized workloads, but can operate on larger data too!
44
45© Cloudera, Inc. All rights reserved.
Getting started
45
46© Cloudera, Inc. All rights reserved.
Getting started as a user
• http://getkudu.io
• kudu-user@googlegroups.com
• Quickstart VM
• Easiest way to get started
• Impala and Kudu in an easy-to-install VM
• CSD and Parcels
• For installation on a Cloudera Manager-managed cluster
46
47© Cloudera, Inc. All rights reserved.
Getting started as a developer
• http://github.com/cloudera/kudu
• All commits go here first
• Public gerrit: http://gerrit.cloudera.org
• All code reviews happening here
• Public JIRA: http://issues.cloudera.org
• Includes bugs going back to 2013. Come see our dirty laundry!
• kudu-dev@googlegroups.com
• Apache 2.0 license open source
• Contributions are welcome and encouraged!
47
48© Cloudera, Inc. All rights reserved.
Demo?
(if we have time and internet gods willing)
48
49© Cloudera, Inc. All rights reserved.
http://getkudu.io/
@getkudu

Kudu: New Hadoop Storage for Fast Analytics on Fast Data

  • 1.
    1© Cloudera, Inc.All rights reserved. Todd Lipcon on behalf of the Kudu team Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop 1
  • 2.
    2© Cloudera, Inc.All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco
  • 3.
    3© Cloudera, Inc.All rights reserved. Kudu Storage for Fast Analytics on Fast Data • New updating column store for Hadoop • Apache-licensed open source • Beta now available Columnar Store Kudu
  • 4.
    4© Cloudera, Inc.All rights reserved. Motivation and Goals Why build Kudu? 4
  • 5.
    5© Cloudera, Inc.All rights reserved. Motivating Questions • Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies? • Are we positioned to take advantage of advancements in the hardware landscape?
  • 6.
    6© Cloudera, Inc.All rights reserved. Current Storage Landscape in Hadoop HDFS excels at: • Efficiently scanning large amounts of data • Accumulating data with high throughput HBase excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously
  • 7.
    7© Cloudera, Inc.All rights reserved. Changing Hardware landscape • Spinning disk -> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind. • Takeaway 2: Column stores are feasible for random access
  • 8.
    8© Cloudera, Inc.All rights reserved. • High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum replication) Goal: 1ms read/write on SSD • Database-like semantics (initially single-row ACID) • Relational data model • SQL query • “NoSQL” style scan/insert/update (Java client) Kudu Design Goals
  • 9.
    9© Cloudera, Inc.All rights reserved. Kudu Usage • Table has a SQL-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan() • Integrations with MapReduce, Spark, and Impala • more to come! 9
  • 10.
    10© Cloudera, Inc.All rights reserved. Use cases and architectures
  • 11.
    11© Cloudera, Inc.All rights reserved. Kudu Use Cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time Series ○ Examples: Stream market data; fraud detection & prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine Data Analytics ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online Reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  • 12.
    12© Cloudera, Inc.All rights reserved. Real-Time Analytics in Hadoop Today Fraud Detection in the Real World = Storage Complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request Impala on HDFS
  • 13.
    13© Cloudera, Inc.All rights reserved. Real-Time Analytics in Hadoop with Kudu Improvements: ● One system to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Incoming Data (Messaging System) Reporting Request Storage in Kudu
  • 14.
    14© Cloudera, Inc.All rights reserved. How it works Replication and distribution 14
  • 15.
    15© Cloudera, Inc.All rights reserved. Tables and Tablets • Table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5), with Raft consensus • Allow read from any replica, plus leader-driven writes with low MTTR • Tablet servers host tablets • Store data on local disks (no HDFS) 15
  • 16.
    16© Cloudera, Inc.All rights reserved. Metadata • Replicated master* • Acts as a tablet directory (“META” table) • Acts as a catalog (table schemas, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • 80-node load test, GetTableLocations RPC perf: • 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage • Client configured with master addresses • Asks master for tablet locations as needed and caches them 16
  • 17.
    17© Cloudera, Inc.All rights reserved.
  • 18.
    18© Cloudera, Inc.All rights reserved. Raft consensus 18 TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 19.
    19© Cloudera, Inc.All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures 19
  • 20.
    20© Cloudera, Inc.All rights reserved. Fault tolerance (2) • Permanent failure: • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica • Leader copies the data over to the new one, which joins as a new FOLLOWER 20
  • 21.
    21© Cloudera, Inc.All rights reserved. How it works Storage engine internals 21
  • 22.
    22© Cloudera, Inc.All rights reserved. Tablet design • Inserts buffered in an in-memory store (like HBase’s memstore) • Flushed to disk • Columnar layout, similar to Apache Parquet • Updates use MVCC (updates tagged with timestamp, not in-place) • Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans • Near-optimal read path for “current time” scans • No per row branches, fast vectorized decoding and predicate evaluation • Performance worsens based on number of recent updates 22
  • 23.
    23© Cloudera, Inc.All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans) 23
  • 24.
    24© Cloudera, Inc.All rights reserved. LSM Insert Path 24 MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 25.
    25© Cloudera, Inc.All rights reserved. LSM Insert Path 25 MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 26.
    26© Cloudera, Inc.All rights reserved. LSM Update path 26 MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  • 27.
    27© Cloudera, Inc.All rights reserved. LSM Read path 27 MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 28.
    28© Cloudera, Inc.All rights reserved. Kudu storage – Inserts and Flushes 28 MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  • 29.
    29© Cloudera, Inc.All rights reserved. Kudu storage – Inserts and Flushes 29 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  • 30.
    30© Cloudera, Inc.All rights reserved. Kudu storage - Updates 30 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  • 31.
    31© Cloudera, Inc.All rights reserved. Kudu storage - Updates 31 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 32.
    32© Cloudera, Inc.All rights reserved. Kudu storage – Read path 32 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  • 33.
    33© Cloudera, Inc.All rights reserved. Kudu storage – Delta flushes 33 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 34.
    34© Cloudera, Inc.All rights reserved. Kudu storage – Major delta compaction 34 name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re- written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  • 35.
    35© Cloudera, Inc.All rights reserved. Kudu storage – RowSet Compactions 35 DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges
  • 36.
    36© Cloudera, Inc.All rights reserved. Kudu storage – Compaction policy • Solves an optimization problem (knapsack problem) • Minimize “height” of rowsets for the average key lookup • Bound on number of seeks for write or random-read • Restrict total IO of any compaction to a budget (128MB) • No long compactions, ever • No “minor” vs “major” distinction • Always be compacting or flushing • Low IO priority maintenance threads 36
  • 37.
    37© Cloudera, Inc.All rights reserved. Kudu trade-offs • Random updates will be slower • HBase model allows random updates without incurring a disk seek • Kudu requires a key lookup before update, bloom lookup before insert • Single-row reads may be slower • Columnar design is optimized for scans • Future: may introduce “column groups” for applications where single-row access is more important • Especially slow at reading a row that has had many recent updates (e.g YCSB “zipfian”) 37
  • 38.
    38© Cloudera, Inc.All rights reserved. Benchmarks 38
  • 39.
    39© Cloudera, Inc.All rights reserved. TPC-H (Analytics benchmark) • 75TS + 1 master cluster • 12 (spinning) disk each, enough RAM to fit dataset • Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4 • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 39
  • 40.
    40© Cloudera, Inc.All rights reserved. - Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data - Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
  • 41.
    41© Cloudera, Inc.All rights reserved. What about Apache Phoenix? • 10 node cluster (9 worker, 1 master) • HBase 1.0, Phoenix 4.3 • TPC-H LINEITEM table only (6B rows) 41 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH Q1 COUNT(*) COUNT(*) WHERE… single-row lookup Time(sec) Phoenix Kudu Parquet
  • 42.
    42© Cloudera, Inc.All rights reserved. What about NoSQL-style random access? (YCSB) • YCSB 0.5.0-snapshot • 10 node cluster (9 worker, 1 master) • HBase 1.0 • 100M rows, 10M ops 42
  • 43.
    43© Cloudera, Inc.All rights reserved. What Kudu is not 43
  • 44.
    44© Cloudera, Inc.All rights reserved. Kudu is… • NOT a SQL database • “BYO SQL” • NOT a filesystem • data must have tabular structure • NOT a replacement for HBase or HDFS • Cloudera continues to invest in those systems • Many use cases where they’re still more appropriate • NOT an in-memory database • Very fast for memory-sized workloads, but can operate on larger data too! 44
  • 45.
    45© Cloudera, Inc.All rights reserved. Getting started 45
  • 46.
    46© Cloudera, Inc.All rights reserved. Getting started as a user • http://getkudu.io • kudu-user@googlegroups.com • Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-to-install VM • CSD and Parcels • For installation on a Cloudera Manager-managed cluster 46
  • 47.
    47© Cloudera, Inc.All rights reserved. Getting started as a developer • http://github.com/cloudera/kudu • All commits go here first • Public gerrit: http://gerrit.cloudera.org • All code reviews happening here • Public JIRA: http://issues.cloudera.org • Includes bugs going back to 2013. Come see our dirty laundry! • kudu-dev@googlegroups.com • Apache 2.0 license open source • Contributions are welcome and encouraged! 47
  • 48.
    48© Cloudera, Inc.All rights reserved. Demo? (if we have time and internet gods willing) 48
  • 49.
    49© Cloudera, Inc.All rights reserved. http://getkudu.io/ @getkudu