Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

1,351 views

Published on

Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.

About the Speaker
Carlos Alonso Software Engineer, Job and Talent

Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.

Published in: Software
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Looking For A Job? Positions available now. FT or PT. $10-$30/hr. No exp required. ◆◆◆ http://t.cn/AieX6y8B
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Why there is a hard limit of 200 MB on the last slide ?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

  1. 1. Scalable data modelling by example Carlos Alonso (@calonso)
  2. 2. Carlos Alonso 2 • Ex-Londoner • MSc Salamanca University, Spain • Software Engineer @ Jobandtalent • Cassandra certified developer • Datastax Cassandra MVP 2015 & 2016 • @calonso / http://mrcalonso.com
  3. 3. Jobandtalent 3 • Revolutionising how people find jobs and how businesses hire employees. • Leveraging data to produce a unique job matching technology. • 10M+ users and 150K+ companies worldwide • @jobandtalentEng / http://jobandtalent.com • We are hiring!!
  4. 4. Cassandra Concepts
  5. 5. The data model is the only thing you can’t change once in production.
  6. 6. Data organisation 6 Token
  7. 7. Physical Data Layout 7
  8. 8. Consistent Hashing 8 Hash function “Carlos” 185664 1773456738847666528349 -894763734895827651234
  9. 9. Replication factor How many copies (replicas) for your data 9
  10. 10. Consistency Level How many replicas of your data must acknowledge? 10
  11. 11. A complete read/write example 11 DriverClient Partitioner f81d4fae-… 834 • RF = 3 • CL = QUORUM • SELECT * … WHERE id = f81d4fae-…
  12. 12. 12
  13. 13. 13
  14. 14. Data Modelling
  15. 15. Data Modelling 15 • Understand your data • Decide (know) how you’ll query the data • Define column families to satisfy those queries • Implement and optimise
  16. 16. Data Modelling 16 Conceptual Model Logical Model Physical Model Query-Driven Methodology Analysis & Validation
  17. 17. Query Driven Methodology: goals 17 • Spread data evenly around the cluster • Minimise the number of partitions read • Keep partitions manageable
  18. 18. Query Driven Methodology: process 18 • Entities and relationships: map to tables • Key attributes: map to primary key columns • Equality search attributes: must be at the beginning of the primary key • Inequality search attributes: become clustering columns • Ordering attributes: become clustering columns
  19. 19. The Primary Key 19 PARTITION KEY + CLUSTERING COLUMN(S) CREATE TABLE . . .( fields . . . PRIMARY KEY (part_key, clust1, . . .) );
  20. 20. Analysis & Validation 20 • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M • How much data duplication? (batches)
  21. 21. An E-Library project.
  22. 22. Requirement: 1 22 Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher. Book ISBN K Title Author Genre Publisher QDM Q1 Q1: Find books by ISBN
  23. 23. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (5 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 23 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN
  24. 24. Physical data model 24 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN CREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR ); SELECT * FROM books WHERE ISBN = ‘…’;
  25. 25. Requirement 2 25 Users register into the system uniquely identified by an email and a password. We also want their full name. They will be accessed by email and password or internal unique ID. Users_by_ID KID full_name QDM Q1 Q1: Find users by ID Users_by_login_info email K password K full_name ID Q2 Q2: Find users by login info Q3: Find users by email (to guarantee uniqueness) Q3 C
  26. 26. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (2 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 26 K Users_by_ID ID full_name Q1 Q1: Find users by ID
  27. 27. Physical Data Model 27 CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR ); SELECT * FROM users_by_id WHERE ID = …; K Users_by_ID ID full_name Q1 Q1: Find users by ID
  28. 28. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (4 - 1 - 0) + 0 < 1M • How much data duplication? 1 Analysis & Validation 28 Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  29. 29. Physical Data Model 29 CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) ); SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’]; Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  30. 30. Physical Data Model 30 BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
  31. 31. Requirement 3 31 Users read books. We want to know which books has a user read and show them sorted by title and author Books_read_by_user Kuser_ID QDM Q1: Find all books a logged user has read Q1 ISBN genre publisher title author C C full_name S
  32. 32. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user • How much data duplication? 0 Analysis & Validation 32 Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  33. 33. Physical Data Model 33 CREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) ); SELECT * FROM books_read_by_user WHERE user_ID = …; Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  34. 34. Requirement 4 34 In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have with our site. element type user_ID K Actions_by_user QDM Q1 Q1: Find all actions a user does in a time range time C
  35. 35. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (4 - 1 - 0) + 0 < 1M => 333.333 • How much data duplication? 0 Analysis & Validation 35 K Actions_by_user user_ID Q1 Q1: Find all actions a user does in a time range time element type C
  36. 36. Requirement 4: Bucketing 36 time element type user_ID K Actions_by_user month K C – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 
 per user every <bucket_size> bucket_size = 1 year => 38 actions / h bucket_size = 1 month => 462 actions / h bucket_size = 1 week => 1984 actions / h
  37. 37. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month • How much data duplication? 0 Analysis & Validation 37 K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  38. 38. Physical Data Model 38 CREATE TABLE actions_by_user ( user_ID TIMEUUID, month INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, month), time) ); SELECT * FROM actions_by_user WHERE user_ID = … AND month = … AND time < … AND time > …; K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  39. 39. Further validation 39 ∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB – pk = Partition Key column – sc = Static column – Nr = Number of rows – rc = Regular column – clc = Clustering column – Nv = Number of values
  40. 40. Next Steps 40 • Test your models against your hardware setup – cassandra-stress – http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez) • Monitor everything – DataStax OpsCenter – Graphite – Datadog – . . .
  41. 41. Thanks! Carlos Alonso Software Engineer @calonso

×