Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Are you Kudu-ing me?!

2,169 views

Published on

What is Kudu? What data model does it use? Why it might be better than Apache Parquet and NoSQL databases? Why column-oriented databases are important?

Published in: Data & Analytics
  • Be the first to comment

Are you Kudu-ing me?!

  1. 1. This folks must be all wrong, aren’t they?
  2. 2. uuid first_name last_name dob ee-c6-47-2c John Connor Feb 28th, 1985 84-ee-ff-d5 Sarah Connor May 11th, 1965 57-4f-d9-d8 Kyle Reese Mar 1st, 2002 SELECT MIN(dob) FROM characters WHERE last_name=”connor”
  3. 3. uuid ee-c6-47-2c 84-ee-ff-d5 57-4f-d9-d8 last_name Connor Connor Reese first_name John Sarah Kyle dob Feb 28th, 1985 May 11th, 1965 Mar 1st, 2002 SELECT MIN(dob) FROM characters WHERE last_name=”connor”
  4. 4. What’s the problem with Apache Parquet then?
  5. 5. Ever implemented Lambda Architecture?
  6. 6. last_name first_name movie actor actor_age Connor John Terminator 2 Edward Furlong 14 Connor John Terminator 2 Michael Edwards 47 Connor Sarah Terminator Linda Hamilton 28 Connor Sarah Terminator 2 Linda Hamilton 35 Reese Kyle Terminator 2 Michael Biehn 35 T-800 Terminator Arnold Schwarzenegger 37 CREATE TABLE ’characters’ ( last_name STRING, first_name STRING, movie STRING, actor STRING, actor_age INT ) DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS TBLPROPERTIES ( ’kudu.key_columns’ = ’last_name, first_name, movie, actor’ )
  7. 7. last_name first_name movie actor actor_age Connor John Terminator 2 Edward Furlong 14 Connor John Terminator 2 Michael Edwards 47 Connor Sarah Terminator Linda Hamilton 28 Connor Sarah Terminator 2 Linda Hamilton 35 Reese Kyle Terminator 2 Michael Biehn 35 T-800 Terminator Arnold Schwarzenegger 37 CREATE TABLE ’characters’ ( last_name STRING, first_name STRING, movie STRING, actor STRING, actor_age INT ) DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS TBLPROPERTIES ( ’kudu.key_columns’ = ’last_name, first_name, movie, actor’ )
  8. 8. last_name first_name movie actor actor_age Connor John Terminator 2 Edward Furlong 14 Connor John Terminator 2 Michael Edwards 47 Connor Sarah Terminator Linda Hamilton 28 Connor Sarah Terminator 2 Linda Hamilton 35 Reese Kyle Terminator 2 Michael Biehn 35 T-800 Terminator Arnold Schwarzenegger 37 CREATE TABLE ’characters’ ( last_name STRING, first_name STRING, movie STRING, actor STRING, actor_age INT ) DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETS TBLPROPERTIES ( ’kudu.key_columns’ = ’last_name, first_name, movie, actor’ )
  9. 9. last_name first_name movie actor actor_age Connor John Terminator 2 Edward Furlong 14 Connor John Terminator 2 Michael Edwards 47 Connor Sarah Terminator Linda Hamilton 28 Connor Sarah Terminator 2 Linda Hamilton 35 Reese Kyle Terminator 2 Michael Biehn 35 T-800 Terminator Arnold Schwarzenegger 37
  10. 10. last_name first_name movie actor actor_age Connor John Terminator 2 Edward Furlong 14 Connor John Terminator 2 Michael Edwards 47 Connor Sarah Terminator Linda Hamilton 28 Connor Sarah Terminator 2 Linda Hamilton 35 Reese Kyle Terminator 2 Michael Biehn 35 T-800 Terminator Arnold Schwarzenegger 37 Somewhere between BigTable/HBase range partitioning and Cassandra’s hash partitioning.
  11. 11. last_name Connor Connor Reese first_name John John Kyle movie Terminator 2 Terminator 2 Terminator 2 actor Edward Furlong Michael Edwards Michael Biehn actor_age 14 47 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37
  12. 12. last_name Connor Connor Reese first_name John John Kyle movie Terminator 2 Terminator 2 Terminator 2 actor Edward Furlong Michael Edwards Michael Biehn actor_age 14 47 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 INSERT INTO characters (last_name, first_name, movie, actor, actor_age) VALUES (’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36)
  13. 13. last_name Connor Connor Connor Reese first_name John John John Kyle movie Terminator 2 Terminator 2 Terminator Genisys Terminator 2 actor Edward Furlong Michael Edwards Jason Clarke Michael Biehn actor_age 14 47 36 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 INSERT INTO characters (last_name, first_name, movie, actor, actor_age) VALUES (’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36) Delta
  14. 14. last_name Connor Connor Connor Reese first_name John John John Kyle movie Terminator 2 Terminator 2 Terminator Genisys Terminator 2 actor Edward Furlong Michael Edwards Jason Clarke Michael Biehn actor_age 14 47 36 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 SELECT MAX(actor_age) FROM characters WHERE last_name=’Connor’
  15. 15. last_name Connor Connor Connor Reese first_name John John John Kyle movie Terminator 2 Terminator 2 Terminator Genisys Terminator 2 actor Edward Furlong Michael Edwards Jason Clarke Michael Biehn actor_age 14 47 36 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 SELECT MAX(actor_age) FROM characters WHERE last_name=’Connor’ MPP FTW
  16. 16. last_name Connor Connor Connor Reese first_name John John John Kyle movie Terminator 2 Terminator 2 Terminator Genisys Terminator 2 actor Edward Furlong Michael Edwards Jason Clarke Michael Biehn actor_age 14 47 36 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 SELECT MAX(actor_age) FROM characters WHERE movie=’Terminator 2’
  17. 17. last_name Connor Connor Connor Reese first_name John John John Kyle movie Terminator 2 Terminator 2 Terminator Genisys Terminator 2 actor Edward Furlong Michael Edwards Jason Clarke Michael Biehn actor_age 14 47 36 35 last_name Connor Connor first_name Sarah Sarah movie Terminator Terminator 2 actor Linda Hamilton Linda Hamilton actor_age 28 35 last_name T-800 first_name movie Terminator actor Arnold Schwarzenegger actor_age 37 SELECT MAX(actor_age) FROM characters WHERE movie=’Terminator 2’ Bloom filters FTW
  18. 18. Tablet Server 1 Tablet Server 2 Master
  19. 19. Leader Leader Master Master replica Leader Leader Tablet Server 1 Tablet Server 2 Tablet Server 3
  20. 20. Leader Leader Tablet Server 1 Tablet Server 2 Master Master replica Tablet Server 3 Leader Leader Typically 10-100 tablets per machine.
  21. 21. DiskRowSet • Col A • Col B • … • [Delta store] DiskRowSet • Col A • Col B • … • [Delta store] MemRowSet • Col A • Col B • … In-memory concurrent B-tree, Keeps all recently-inserted rows Each column separately written in a single contiguous block of data Base data Deltas organized by rows (until compaction happens)
  22. 22. Long story short: - 30% faster than Parquet 1.0 (TPC-H) - 16-187 times faster than Phoenix or HBase (TPC-H again) - hundreds of thousands of rows inserted per second on a single tablet server
  23. 23. TPC-H test, scale factor 100, RF 3 - 75 nodes, each: 64 GB RAM, 12 spinning disks, 2x 6-core Xeon - Expansion of 62 GB of data (post-replication, compactions done): - 570 GB in Hbase (9.2x) - 227 GB in Kudu (3.7x) http://getkudu.io/kudu.pdf
  24. 24. http://getkudu.io/ http://getkudu.io/faq.html
  25. 25. pmm@collective-sense.com

×