What is Kudu? What data model does it use? Why it might be better than Apache Parquet and NoSQL databases? Why column-oriented databases are important?
11. uuid first_name last_name dob
ee-c6-47-2c John Connor Feb 28th, 1985
84-ee-ff-d5 Sarah Connor May 11th, 1965
57-4f-d9-d8 Kyle Reese Mar 1st, 2002
SELECT MIN(dob) FROM characters WHERE last_name=”connor”
19. last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold
Schwarzenegger
37
CREATE TABLE ’characters’ (
last_name STRING,
first_name STRING,
movie STRING,
actor STRING,
actor_age INT
)
DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS
TBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’
)
20. last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold
Schwarzenegger
37
CREATE TABLE ’characters’ (
last_name STRING,
first_name STRING,
movie STRING,
actor STRING,
actor_age INT
)
DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS
TBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’
)
21. last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold
Schwarzenegger
37
CREATE TABLE ’characters’ (
last_name STRING,
first_name STRING,
movie STRING,
actor STRING,
actor_age INT
)
DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS
TBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’
)
23. last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold
Schwarzenegger
37
24. last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold
Schwarzenegger
37
Somewhere between BigTable/HBase range partitioning and Cassandra’s hash partitioning.
26. last_name
Connor
Connor
Reese
first_name
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator 2
actor
Edward Furlong
Michael Edwards
Michael Biehn
actor_age
14
47
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold
Schwarzenegger
actor_age
37
INSERT INTO
characters (last_name, first_name, movie, actor, actor_age)
VALUES
(’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36)
27. last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator
Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold
Schwarzenegger
actor_age
37
INSERT INTO
characters (last_name, first_name, movie, actor, actor_age)
VALUES
(’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36)
Delta
39. DiskRowSet
• Col A
• Col B
• …
• [Delta
store]
DiskRowSet
• Col A
• Col B
• …
• [Delta
store]
MemRowSet
• Col A
• Col B
• …
In-memory concurrent B-tree,
Keeps all recently-inserted rows
Each column separately written in a
single contiguous block of data
Base data
Deltas organized by rows
(until compaction happens)
43. Long story short:
- 30% faster than Parquet 1.0 (TPC-H)
- 16-187 times faster than Phoenix or HBase (TPC-H again)
- hundreds of thousands of rows inserted per second on a single tablet server
44. TPC-H test, scale factor 100, RF 3
- 75 nodes, each: 64 GB RAM, 12 spinning disks, 2x 6-core Xeon
- Expansion of 62 GB of data (post-replication, compactions done):
- 570 GB in Hbase (9.2x)
- 227 GB in Kudu (3.7x)
http://getkudu.io/kudu.pdf