SlideShare a Scribd company logo
Data modelling
                                  workshop

                                  Richard Low



                           rlow@acunu.com @richardalow



Wednesday, 28 March 2012
Outline
                   • What is data modelling?
                   • What do I need to know to come up with a
                           model?
                   • Options and available tools
                   • Denormalisation
                   • Example and demo: scalable messaging
                           application


Wednesday, 28 March 2012
What is data modelling?




Wednesday, 28 March 2012
Data modelling

                   • How you organise your data
                   • Store all in one big value?
                   • Store as columns in one row or lots of rows?
                   • Use counters?
                   • Can I avoid read-modify-write?

Wednesday, 28 March 2012
Why care about it?

                   • Performance
                   • Ensure good load balancing
                   • Disk usage
                   • Future proofing

Wednesday, 28 March 2012
Performance

                        100
                 • Bad data model: do read-modify-write on
                              0x
                   large column im
                                      pro
                 • Good data model: just overwrite updated data
                                            vem
                 •                                  ent
                   Difference? Could be 100 ops/s vs. 100k ops/s




Wednesday, 28 March 2012
Performance

                   • Cacheability
                    • Ensure your cache isn’t polluted by
                             uncacheable things
                           • Cached reads are ~100x faster than
                             uncached



Wednesday, 28 March 2012
What do you need?



Wednesday, 28 March 2012
Optimise for queries


                   • Data model design starts with queries
                   • What are the common queries?


Wednesday, 28 March 2012
Workload

                   • How many inserts?
                   • How many reads?
                   • Do inserts depend on current data?
                   • Is data write-once?

Wednesday, 28 March 2012
Sizes
                   • How big are the values?
                   • Are some ‘users’ bigger than others?
                   • How cacheable is your data?




Wednesday, 28 March 2012
How do I get this?
        • Back of the envelope calculation
        • Monitor existing solution
        • Prototype a solution




Wednesday, 28 March 2012
Options and tools




Wednesday, 28 March 2012
Keyspaces and Column Families
                    SQL                                    Cassandra

          Database         row/key col_1    col_2
                                                            Keyspace
                              row/key col_1     col_1
                                   row/  col_1    col_1


                Table                                     Column Family




Wednesday, 28 March 2012
Options and tools

                   • Rows
                   • Columns
                    • Supercolumns
                    • Composite columns

Wednesday, 28 March 2012
Rows and columns
                           col1   col2   col3   col4   col5   col6   col7
                row1               x                    x      x
                row2        x      x      x      x      x
                row3               x      x             x      x      x
                row4               x      x      x             x
                row5               x             x      x      x
                row6               x
                row7        x      x             x



Wednesday, 28 March 2012
Column options

                   • Regular columns
                   • Super columns: columns within columns
                   • Composite columns: multi-dimensional
                           column names




Wednesday, 28 March 2012
Composite columns
                           alice: {
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }

                           bob: {
                              m1: {
                                  Sender: alice,
                                  Subject: ‘rock?’, ...
                              }
                           }

                           charlie: {
                              m1: {
                                 Sender: alice,
                                 Subject: ‘rock?’, ...
                              },
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }


Wednesday, 28 March 2012
Tools

                   • Counters: atomic inc and dec
                   • Expiring columns: TTL
                   • Secondary indexes: your WHERE clause


Wednesday, 28 March 2012
Rows vs columns
                   • Row key is the shard key
                   • Need lots of rows for scalability
                   • Don’t be afraid of large-ish rows
                    • But don’t make them too big
                   • Avoid range queries across rows, but use
                           them within rows


Wednesday, 28 March 2012
Range queries
               • Within a row:
                      SELECT col3..col5 FROM
                      Standard1 WHERE KEY=row1


             row1          col1   col2   col5   col6   col8




Wednesday, 28 March 2012
Range queries
             • Across rows:
                    SELECT * FROM table WHERE key >
                    row2 LIMIT 2




Wednesday, 28 March 2012
Range queries
    SELECT * FROM table
    WHERE key > row2                     row4
    LIMIT 2
     > row2, row1
                                                  row2


                                  row3          row1



Wednesday, 28 March 2012
Range queries

                   • Range queries within rows ‘get_slice’ are
                           fine
                   • Avoid range queries across rows
                           ‘get_range_slices’




Wednesday, 28 March 2012
Batching
                   • Overhead on each call
                   • Batch together inserts, better if in the same
                           row
                   • Reduce read ops, use large get_slice reads



Wednesday, 28 March 2012
Denormalisation




Wednesday, 28 March 2012
Denormalisation

                   • Hard drive performance constraints:
                    • Sequential IO at 100s MB/s
                    • Seek at 100 IO/s
                   • Avoid random IO

Wednesday, 28 March 2012
Denormalisation
                   • Store columns accessed at similar times near
                           to each other
                   • => put them in the same row
                   • Involves copying
                   • Copying isn’t bad - pre flood prices <$100
                           per TB



Wednesday, 28 March 2012
Messaging Application
Wednesday, 28 March 2012
Messaging application

                   • Users can send messages to other users
                   • Horizontally scalable
                   • Expect users to send to lots of recipients


Wednesday, 28 March 2012
Messaging

                   • In an RDBMS we might have a table for:
                    • Users
                    • Messages (sender is unique)
                    • Mappings, Message → Receiver


Wednesday, 28 March 2012
A relational model
                                         Msg_Receipt
                                               Id
                                           Message_Id   ∞
                                     ∞      User_Id
                       Users     1          Is_read
                                                            1   Messages
                           Id
                                                                   Id
                      username   1
                                                                 Subject
                                                                 Content
                                                                  Date
            Example Relational                              ∞
                                                                Sender_Id

               DB model

Wednesday, 28 March 2012
Querying
        Most recent 10 messages sent by a user:
                SELECT *
                    FROM Messages
                    WHERE Messages.Sender_Id = <id>
                    ORDER BY Messages.Date DESC
                    LIMIT 10;



         Most recent 10 messages received by a user:
                SELECT Messages.*
                    FROM Messages, Msg_Receipt
                    WHERE Msg_Receipt.User_Id = <id>
                    AND Msg_Receipt.Message_Id = Messages.Id
                    ORDER BY Messages.Date DESC
                    LIMIT 10;


Wednesday, 28 March 2012
Under the hood
                    Msg_Receipt                    Messages
              id           msg_id user_id    id     subject   ...
               0              0      0        0        a
               1              3      1        1        b
               2              4      2        2        c
               3            6000     0        3        d
                                              4        e
                                             ...
                                            6000      x


Wednesday, 28 March 2012
Under the hood

                   • Normalisation => seeks
                   • So denormalise
                    • Hit capacity limit of one node quickly


Wednesday, 28 March 2012
Back of the envelope...

                   • 1 M users
                   • Message size 1 KB
                   • Each user has 5000 messages
                   • => 5 TB data

Wednesday, 28 March 2012
Back of the envelope...

                   • Reading 10 messages => 10 seeks
                   • If 10k active at once, need 100k seeks/s
                   • => need 1000 disks
                   • With 8 disks per node, RF 3, that’s 375
                           nodes



Wednesday, 28 March 2012
Back of the envelope...

                   • Denormalize: messages are immutable
                   • Insert them into everyone’s inbox
                   • Read 10 messages is one seek
                   • Paging is sequential
                   • => 10x fewer nodes: 38 nodes now!

Wednesday, 28 March 2012
In Cassandra

                   • Use a row per user
                   • Composite columns, with TimeUUID as ID
                   • Gives time ordering on messages
                   • Inserts go to all recipients

Wednesday, 28 March 2012
Messaging example
                               From:    alice
                               To:      bob, charlie
                               Subject: rock?


                                                m1

                              alice

                                       sender        subject
                              bob
                                        alice         rock?
                                       sender        subject
                             charlie
                                        alice         rock?
Wednesday, 28 March 2012
Messaging example
                                From:    bob
                                To:      alice, charlie
                                Subject: paper!


                                    m1                      m2

                                                   sender        subject
     alice
                                                    bob          paper!
                           sender        subject
      bob
                            alice         rock?
                           sender        subject   sender        subject
  charlie
                            alice         rock?     bob          paper!
Wednesday, 28 March 2012
Data
                           alice: {
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }

                           bob: {
                              m1: {
                                  Sender: alice,
                                  Subject: ‘rock?’, ...
                              }
                           }

                           charlie: {
                              m1: {
                                 Sender: alice,
                                 Subject: ‘rock?’, ...
                              },
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }


Wednesday, 28 March 2012
Demo

                   • Pycassa
                   • Send message
                   • List messages
                   • Unread count

Wednesday, 28 March 2012

More Related Content

Similar to Cassandra EU 2012 - Data modelling workshop by Richard Low

Mansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprintMansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprint
Al Sayed Gamal
 
3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j
Neo4j
 
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
Alexis Bondu
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
Aditya Parameswaran
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hakky St
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
台灣資料科學年會
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Daniel Austin
 
Intro to NoSQL and MongoDB
 Intro to NoSQL and MongoDB Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
MongoDB
 
Schema less table & dynamic schema
Schema less table & dynamic schemaSchema less table & dynamic schema
Schema less table & dynamic schema
Davide Mauri
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
lehresman
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Yan Cui
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
Steven Francia
 
Program Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARCProgram Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARC
Andrey Zakharevich
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
RohanKharat10
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
Rajini Ramesh
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
DATAVERSITY
 

Similar to Cassandra EU 2012 - Data modelling workshop by Richard Low (20)

Mansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprintMansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprint
 
3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j
 
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
 
Intro to NoSQL and MongoDB
 Intro to NoSQL and MongoDB Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
 
Schema less table & dynamic schema
Schema less table & dynamic schemaSchema less table & dynamic schema
Schema less table & dynamic schema
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 
Program Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARCProgram Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARC
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 

More from Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
Acunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Acunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu
 
All Your Base
All Your BaseAll Your Base
All Your Base
Acunu
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
Acunu
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
Acunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
Acunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
Acunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
Acunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
Acunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
Acunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
Acunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
Acunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Acunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
Acunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Acunu
 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

Cassandra EU 2012 - Data modelling workshop by Richard Low

  • 1. Data modelling workshop Richard Low rlow@acunu.com @richardalow Wednesday, 28 March 2012
  • 2. Outline • What is data modelling? • What do I need to know to come up with a model? • Options and available tools • Denormalisation • Example and demo: scalable messaging application Wednesday, 28 March 2012
  • 3. What is data modelling? Wednesday, 28 March 2012
  • 4. Data modelling • How you organise your data • Store all in one big value? • Store as columns in one row or lots of rows? • Use counters? • Can I avoid read-modify-write? Wednesday, 28 March 2012
  • 5. Why care about it? • Performance • Ensure good load balancing • Disk usage • Future proofing Wednesday, 28 March 2012
  • 6. Performance 100 • Bad data model: do read-modify-write on 0x large column im pro • Good data model: just overwrite updated data vem • ent Difference? Could be 100 ops/s vs. 100k ops/s Wednesday, 28 March 2012
  • 7. Performance • Cacheability • Ensure your cache isn’t polluted by uncacheable things • Cached reads are ~100x faster than uncached Wednesday, 28 March 2012
  • 8. What do you need? Wednesday, 28 March 2012
  • 9. Optimise for queries • Data model design starts with queries • What are the common queries? Wednesday, 28 March 2012
  • 10. Workload • How many inserts? • How many reads? • Do inserts depend on current data? • Is data write-once? Wednesday, 28 March 2012
  • 11. Sizes • How big are the values? • Are some ‘users’ bigger than others? • How cacheable is your data? Wednesday, 28 March 2012
  • 12. How do I get this? • Back of the envelope calculation • Monitor existing solution • Prototype a solution Wednesday, 28 March 2012
  • 14. Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family Wednesday, 28 March 2012
  • 15. Options and tools • Rows • Columns • Supercolumns • Composite columns Wednesday, 28 March 2012
  • 16. Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Wednesday, 28 March 2012
  • 17. Column options • Regular columns • Super columns: columns within columns • Composite columns: multi-dimensional column names Wednesday, 28 March 2012
  • 18. Composite columns alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } } Wednesday, 28 March 2012
  • 19. Tools • Counters: atomic inc and dec • Expiring columns: TTL • Secondary indexes: your WHERE clause Wednesday, 28 March 2012
  • 20. Rows vs columns • Row key is the shard key • Need lots of rows for scalability • Don’t be afraid of large-ish rows • But don’t make them too big • Avoid range queries across rows, but use them within rows Wednesday, 28 March 2012
  • 21. Range queries • Within a row: SELECT col3..col5 FROM Standard1 WHERE KEY=row1 row1 col1 col2 col5 col6 col8 Wednesday, 28 March 2012
  • 22. Range queries • Across rows: SELECT * FROM table WHERE key > row2 LIMIT 2 Wednesday, 28 March 2012
  • 23. Range queries SELECT * FROM table WHERE key > row2 row4 LIMIT 2 > row2, row1 row2 row3 row1 Wednesday, 28 March 2012
  • 24. Range queries • Range queries within rows ‘get_slice’ are fine • Avoid range queries across rows ‘get_range_slices’ Wednesday, 28 March 2012
  • 25. Batching • Overhead on each call • Batch together inserts, better if in the same row • Reduce read ops, use large get_slice reads Wednesday, 28 March 2012
  • 27. Denormalisation • Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s • Avoid random IO Wednesday, 28 March 2012
  • 28. Denormalisation • Store columns accessed at similar times near to each other • => put them in the same row • Involves copying • Copying isn’t bad - pre flood prices <$100 per TB Wednesday, 28 March 2012
  • 30. Messaging application • Users can send messages to other users • Horizontally scalable • Expect users to send to lots of recipients Wednesday, 28 March 2012
  • 31. Messaging • In an RDBMS we might have a table for: • Users • Messages (sender is unique) • Mappings, Message → Receiver Wednesday, 28 March 2012
  • 32. A relational model Msg_Receipt Id Message_Id ∞ ∞ User_Id Users 1 Is_read 1 Messages Id Id username 1 Subject Content Date Example Relational ∞ Sender_Id DB model Wednesday, 28 March 2012
  • 33. Querying Most recent 10 messages sent by a user: SELECT * FROM Messages WHERE Messages.Sender_Id = <id> ORDER BY Messages.Date DESC LIMIT 10; Most recent 10 messages received by a user: SELECT Messages.* FROM Messages, Msg_Receipt WHERE Msg_Receipt.User_Id = <id> AND Msg_Receipt.Message_Id = Messages.Id ORDER BY Messages.Date DESC LIMIT 10; Wednesday, 28 March 2012
  • 34. Under the hood Msg_Receipt Messages id msg_id user_id id subject ... 0 0 0 0 a 1 3 1 1 b 2 4 2 2 c 3 6000 0 3 d 4 e ... 6000 x Wednesday, 28 March 2012
  • 35. Under the hood • Normalisation => seeks • So denormalise • Hit capacity limit of one node quickly Wednesday, 28 March 2012
  • 36. Back of the envelope... • 1 M users • Message size 1 KB • Each user has 5000 messages • => 5 TB data Wednesday, 28 March 2012
  • 37. Back of the envelope... • Reading 10 messages => 10 seeks • If 10k active at once, need 100k seeks/s • => need 1000 disks • With 8 disks per node, RF 3, that’s 375 nodes Wednesday, 28 March 2012
  • 38. Back of the envelope... • Denormalize: messages are immutable • Insert them into everyone’s inbox • Read 10 messages is one seek • Paging is sequential • => 10x fewer nodes: 38 nodes now! Wednesday, 28 March 2012
  • 39. In Cassandra • Use a row per user • Composite columns, with TimeUUID as ID • Gives time ordering on messages • Inserts go to all recipients Wednesday, 28 March 2012
  • 40. Messaging example From: alice To: bob, charlie Subject: rock? m1 alice sender subject bob alice rock? sender subject charlie alice rock? Wednesday, 28 March 2012
  • 41. Messaging example From: bob To: alice, charlie Subject: paper! m1 m2 sender subject alice bob paper! sender subject bob alice rock? sender subject sender subject charlie alice rock? bob paper! Wednesday, 28 March 2012
  • 42. Data alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } } Wednesday, 28 March 2012
  • 43. Demo • Pycassa • Send message • List messages • Unread count Wednesday, 28 March 2012