Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Data Modeling in Cassandra

743 views

Published on

This is an introduction to data modeling in Cassandra. We cover the concept of denormalization and why and how to embrace that concept using Cassandra. We cover that a CQL table has a primary key that is composed of a partitioning key and clustering columns and why it's so important to get those right! And, we go through some examples.

Published in: Technology

Introduction to Data Modeling in Cassandra

  1. 1. Jim Hatcher DFW Cassandra Users - Meetup 7/12/2016 Introduction to Data Modeling with Apache Cassandra
  2. 2. Agenda • Introduction • How does Cassandra work? • What is CQL? • Embracing Denormalization • Key Structure • Advanced Techniques • Resources
  3. 3. Introduction Jim Hatcher james_hatcher@hotmail.com At IHS, we take raw data and turn it into information and insights for our customers. Automotive Systems (CarFax) Defense Systems (Jane’s) Oil & Gas Systems (Petra) Maritime Systems Technology & Media Systems (Electronic Parts Database, Root Metrics) Sources of Raw Data Structure Data Add Value Customer-facing Systems
  4. 4. How does Cassandra work? CREATE KEYSPACE orders WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 }; CREATE TABLE orders.customer ( customer_id uuid, customer_name varchar, customer_age int, PRIMARY KEY ( customer_id ) ) INSERT INTO customer (customer_id, customer_name, customer_age) VALUES (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35) SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525 Cassandra Cluster B C D E F Client -9223372036854775808 through -6148914691236517207 -6148914691236517206 through -3074457345618258605 -3074457345618258604 through -3 -2 through 3074457345618258599 3074457345618258600 through 6148914691236517201 6148914691236517202 through 9223372036854775808 A
  5. 5. CQL Cassandra Query Language Standard interface for working with Cassandra Very similar to standard SQL, with a few notable exceptions: • No JOIN clauses • No GROUP BY / HAVING clauses • Restricted WHERE clauses • You can only query by key fields in prescribed ways CQL Type Description bigint 64-bit signed long boolean true or false decimal Variable-precision decimal double 64-bit IEEE-754 floating point float 32-bit IEEE-754 floating point int 32-bit signed integer text UTF-8 encoded string timestamp Date plus time, encoded as 8 bytes since epoch timeuuid Type 1 UUID only uuid A UUID in standard UUID format Others: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cql_data_types_c.html CQL Data Types
  6. 6. Normalization In relational databases, we start with understanding how the data relates together. We create a conceptual model. Our physical model often looks identical to the conceptual model. Student Course StudentClassSchedule Class StudentIDPK FirstName DateOfBirth LastName CourseIDPK CourseName Department ClassIDPK Grade CourseDescription ClassIDPK CourseIDFK Semester Professor Section Classroom CourseNumber CourseCode DayAndTime StudentIDPK
  7. 7. Normalization Course CourseID CourseCode CourseNumber CourseName Department C-AAA ENGL 101 American Literature Humanities C-BBB MATH 203 Linear Algebra Mathematics C-CCC BIOL 201 Molecular Biology Science C-DDD HIST 108 World History History C-EEE ENGL 102 British Literature Humanities Class ClassID CourseID Semester Section Professor Classroom DayAndTime SP16-ENGL-101-01 C-AAA Spring 2016 01 Mark Twain XYZ Hall, Room 212 MWF 8:00 AM SP16-MATH-203-01 C-BBB Spring 2016 01 Isaac Newton XYZ Hall, Room 212 TuTh 9:30 AM FA16-BIOL-201-04 C-CCC Fall 2016 04 Charles Darwin XYZ Hall, Room 210 MWF 9:00 AM FA16-HIST-108-03 C-DDD Fall 2016 03 Napoleon Bonaparte XYZ Hall, Room 317 TuTh 12:00 PM FA16-ENGL-102-04 C-EEE Fall 2016 04 Virginia Woolf XYZ Hall, Room 184 MWF 10:00 AM FA16-ENGL-102-04 C-EEE Fall 2016 05 Jane Austen XYZ Hall, Room 185 TuTh 2:00 PM Every piece of data lives in one and only one place. We use our data-layer to enforce referential integrity. Student StudentID FirstName LastName DateOfBirth S-111 Joe Smith 1/1/1970 S-222 Jill Jones 2/2/1972 S-333 Betty Williams 3/3/1973 StudentClassSchedule StudentID ClassID Grade S-111 SP16-ENGL-101-01 A S-111 SP16-MATH-203-01 C S-111 FA16-BIOL-201-04 <null> S-111 FA16-HIST-108-03 <null> S-111 FA16-ENGL-102-04 <null> S-222 FA16-HIST-108-03 <null>
  8. 8. Normalization To satisfy a query, we join tables together. To give a student his/her schedule, we might use this query: SELECT Course.CourseCode, Course.CourseNumber, Course.CourseName, Class.ClassID, Class.Section, Class.Classroom, Class.DayAndTime FROM StudentClassSchedule INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID INNER JOIN Course ON Class.CourseID = Course.CourseID WHERE StudentClassSchedule.StudentID = ‘S-111’ AND Class.Semester = ‘Fall 2016’ To give a professor a class roster, we might use this query SELECT Student.FirstName, Student.LastName, Class.Classroom, Class.DayAndTime FROM Student INNER JOIN StudentClassSchedule ON Student.StudentID = StudentClassSchedule.StudentID INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID WHERE Class.ClassID = ‘FA16-HIST-108-03’
  9. 9. Denormalization Student Schedule for a Given Semester Student Roster for a Given ClassQueries student_schedule class_rosterTables If updates happen to “core data,” we have to have a mechanism to deal with it. For instance, if a class is relocated to a new classroom, we now have to update the classroom field in both of the tables below.
  10. 10. Key Structure CREATE TABLE student_schedule ( student_id text, semester text, class_id text, course_code text, course_number int, section text, classroom text, day_and_time text, PRIMARY KEY ( (student_id), semester, classid ) ) The primary key is the combination of 1. the partitioning key, and 2. the clustering columns Like relational database, it uniquely identifies the row. The values in the primary key cannot by NULL. The first value in the PRIMARY KEY clause is the partitioning key. Any subsequent values are clustering columns. To specify a multi-column partitioning key, wrap it in parentheses. Primary Key
  11. 11. Partition student_id S-111 FALL 2016 : FA16-ENGL-102-04 : course_code ENGL PRIMARY KEY ( (student_id), semester, classid ) Partitioning Key Clustering Columns The partitioning key is responsible for distributing data across the cluster. Separates data. Within a given partition, clustering columns are responsible for clustering data values together. Connects data. SPRING 2016 : SP16-ENGL-101-01 : course_code ENGL This is a representation of how Cassandra stores data on disk. Key Structure ….
  12. 12. When you access Cassandra data via CQL, you retrieve CQL Rows. A “CQL Row” can be (and usually is) different than the physical structure (a partition) with which the data is stored within the Cassandra cluster. Partitioning Key Clustering Columns Must be queried using an equality expression, (i.e., = or IN) If you have a multi-field partitioning key, you must specify all the fields in the partition key to query the data. Can be queried with inequality, (i.e., <, >), or an equality. If you have a multi-field partitioning key, you don’t have to specify all the clustering columns, but you do have to specify them in order. (i.e., you can’t specify clustering column #2 unless you also supply clustering column #1) student_schedule Primary Key Partitioning Key Clustering Columns student_id semester class_id course_code course_number section classroom day_and_time S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM Querying with CQL SELECT * FROM student_schedule;
  13. 13. CQL Acceptable Queries: SELECT * FROM student_schedule WHERE student_id = ‘S-111’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’ AND class_id = ‘SP16-ENGL-101-01’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’ AND class_id >= ‘SP16-ENGL-101-01’ AND class_id < ‘SP16-ENGL-999-99’; UN-acceptable Queries: SELECT * FROM student_schedule WHERE course_code = ‘ENGL’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ OR student_id = ‘S-222’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND class_id = ‘SP16-ENGL-101-01’; Non-key field Non-equality condition against Partitioning Key Specifying a clustering column but not in order Note: Yes, I know I could mention secondary indexes and the ALLOW FILTERING clause at this point; but they’re anti-patterns, so don’t use them. student_schedule Primary Key Partitioning Key Clustering Columns student_id semester class_id course_code course_number section classroom day_and_time S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
  14. 14. Key Structure Partitioning Key - Considerations: 1. Spread data adequately across the cluster so that you don’t create hotspots. 2. Minimize the number of partition reads. Ideally, you can get all your data out of one partition. 3. Updates that happen within the same partition have some atomicity guarantees. Clustering Columns - Considerations: 1. A partition can contain a maximum of 2 billion values clustering column values. 2. A partition should not contain more than 100 MB per partition. GETTING THE KEY STRUCTURE CORRECT IS THE KEY TO GOOD DATA MODELING
  15. 15. CREATE TABLE student_schedule_v1 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id), semester, classid ) ) CREATE TABLE student_schedule_v2 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id, semester), classid ) ) CREATE TABLE student_schedule_v3 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id, semester, classid) ) ) CREATE TABLE student_schedule_v4 ( semester text, student_id text, class_id text, course_code text, …, PRIMARY KEY ( (semester), student_id, classid ) ) Creates a potential hotspot Key Structure Allows for queries: 1) by the student_id only, OR 2) by the student_id and semester Minimizes the number of partition reads. I consider this the winner. Requires that a field by passed to satisfy the query that we don’t necessarily have in our app. SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘Fall 2016’;
  16. 16. On writes, Cassandra always does an upsert (i.e., update if the record exists, and insert if the record doesn’t exist). Suppose you picked a poor key for your table (one that doesn’t make the rows unique and then you inserted this following data. CREATE TABLE student_schedule_BAD_PK ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( student_id ) ) Upserts INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘FALL 2016’, ‘FA16-ENGL-102-04’, … ); INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-ENGL-101-01’, … ); INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-MATH-203-01’, … ); SELECT * FROM student_schedule_BAD_PK WHERE student_id = ‘S-111’; Result? Accidental upserting is a common issue early in your data model testing. It can be tough to track down because it doesn’t throw an error.
  17. 17. 1. CQL Collections (sets, lists, maps) 2. User Defined Types 3. Tuples 4. Static Columns Advanced Techniques
  18. 18. 1. DataStax Academy – Self-paced course https://academy.datastax.com/courses/ds220-data-modeling 2. KillrVideo https://academy.datastax.com/resources/datastax-reference-application-killrvideo/ Resources

×