Cassandra EU 2012 - Data modelling workshop by Richard Low

Data modelling
workshop

Richard Low

rlow@acunu.com @richardalow

Wednesday, 28 March 2012

Outline
• What is data modelling?
• What do I need to know to come up with a
model?
• Options and available tools
• Denormalisation
• Example and demo: scalable messaging
application


What is data modelling?


Data modelling

• How you organise your data
• Store all in one big value?
• Store as columns in one row or lots of rows?
• Use counters?
• Can I avoid read-modify-write?


Why care about it?

• Performance
• Ensure good load balancing
• Disk usage
• Future prooﬁng


Performance

100
• Bad data model: do read-modify-write on
0x
large column im
pro
• Good data model: just overwrite updated data
vem
• ent
Difference? Could be 100 ops/s vs. 100k ops/s


Performance

• Cacheability
• Ensure your cache isn’t polluted by
uncacheable things
• Cached reads are ~100x faster than
uncached


What do you need?


Optimise for queries

• Data model design starts with queries
• What are the common queries?


Workload

• How many inserts?
• How many reads?
• Do inserts depend on current data?
• Is data write-once?


Sizes
• How big are the values?
• Are some ‘users’ bigger than others?
• How cacheable is your data?


How do I get this?
• Back of the envelope calculation
• Monitor existing solution
• Prototype a solution


Options and tools


Keyspaces and Column Families
SQL Cassandra

Database row/key col_1 col_2
Keyspace
row/key col_1 col_1
row/ col_1 col_1

Table Column Family


Options and tools

• Rows
• Columns
• Supercolumns
• Composite columns


Rows and columns
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


Column options

• Regular columns
• Super columns: columns within columns
• Composite columns: multi-dimensional
column names


Composite columns
alice: {
m2: {
Sender: bob,
Subject: ‘paper!’, ...
}
}

bob: {
m1: {
Sender: alice,
Subject: ‘rock?’, ...
}
}

charlie: {
m1: {
Sender: alice,
},
m2: {
Sender: bob,
}
}


Tools

• Counters: atomic inc and dec
• Expiring columns: TTL
• Secondary indexes: your WHERE clause


Rows vs columns
• Row key is the shard key
• Need lots of rows for scalability
• Don’t be afraid of large-ish rows
• But don’t make them too big
• Avoid range queries across rows, but use
them within rows


Range queries
• Within a row:
SELECT col3..col5 FROM
Standard1 WHERE KEY=row1

row1 col1 col2 col5 col6 col8


Range queries
• Across rows:
SELECT * FROM table WHERE key >
row2 LIMIT 2


Range queries
SELECT * FROM table
WHERE key > row2 row4
LIMIT 2
> row2, row1
row2

row3 row1


Range queries

• Range queries within rows ‘get_slice’ are
ﬁne
• Avoid range queries across rows
‘get_range_slices’


Batching
• Overhead on each call
• Batch together inserts, better if in the same
row
• Reduce read ops, use large get_slice reads


Denormalisation


Denormalisation

• Hard drive performance constraints:
• Sequential IO at 100s MB/s
• Seek at 100 IO/s
• Avoid random IO


Denormalisation
• Store columns accessed at similar times near
to each other
• => put them in the same row
• Involves copying
• Copying isn’t bad - pre ﬂood prices <$100
per TB


Messaging Application

Messaging application

• Users can send messages to other users
• Horizontally scalable
• Expect users to send to lots of recipients


Messaging

• In an RDBMS we might have a table for:
• Users
• Messages (sender is unique)
• Mappings, Message → Receiver


A relational model
Msg_Receipt
Id
Message_Id ∞
∞ User_Id
Users 1 Is_read
1 Messages
Id
Id
username 1
Subject
Content
Date
Example Relational ∞
Sender_Id

DB model


Querying
Most recent 10 messages sent by a user:
SELECT *
FROM Messages
WHERE Messages.Sender_Id = <id>
ORDER BY Messages.Date DESC
LIMIT 10;

Most recent 10 messages received by a user:
SELECT Messages.*
FROM Messages, Msg_Receipt
WHERE Msg_Receipt.User_Id = <id>
AND Msg_Receipt.Message_Id = Messages.Id
ORDER BY Messages.Date DESC
LIMIT 10;


Under the hood
Msg_Receipt Messages
id msg_id user_id id subject ...
0 0 0 0 a
1 3 1 1 b
2 4 2 2 c
3 6000 0 3 d
4 e
...
6000 x


Under the hood

• Normalisation => seeks
• So denormalise
• Hit capacity limit of one node quickly


Back of the envelope...

• 1 M users
• Message size 1 KB
• Each user has 5000 messages
• => 5 TB data



• Reading 10 messages => 10 seeks
• If 10k active at once, need 100k seeks/s
• => need 1000 disks
• With 8 disks per node, RF 3, that’s 375
nodes



• Denormalize: messages are immutable
• Insert them into everyone’s inbox
• Read 10 messages is one seek
• Paging is sequential
• => 10x fewer nodes: 38 nodes now!


In Cassandra

• Use a row per user
• Composite columns, with TimeUUID as ID
• Gives time ordering on messages
• Inserts go to all recipients


Messaging example
From: alice
To: bob, charlie
Subject: rock?

m1

alice

sender subject
bob
alice rock?
sender subject
charlie
alice rock?

Messaging example
From: bob
To: alice, charlie
Subject: paper!

m1 m2

sender subject
alice
bob paper!
sender subject
bob
alice rock?
sender subject sender subject
charlie
alice rock? bob paper!

Data
alice: {
m2: {
Sender: bob,
}
}

bob: {
m1: {
Sender: alice,
}
}

charlie: {
m1: {
Sender: alice,
},
m2: {
Sender: bob,
}
}


Demo

• Pycassa
• Send message
• List messages
• Unread count


Cassandra EU 2012 - Data modelling workshop by Richard Low

Recommended

Recommended

More Related Content

Similar to Cassandra EU 2012 - Data modelling workshop by Richard Low

Similar to Cassandra EU 2012 - Data modelling workshop by Richard Low (20)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Cassandra EU 2012 - Data modelling workshop by Richard Low