Data Modeling Deep Dive

Data Modeling:
Four use cases
Toji George
Solutions Architect
MongoDB Inc.

Agenda
• 4 Real World Schemas
– Inbox
– History
– Indexed Attributes
– Multiple Identities
• Conclusions

In MongoDB
Application Development requires Good Schema
Design
Success comes from Proper Data Structure
“Schema-less”?

Design Goals
• Efficiently send new messages to recipients
• Efficiently read inbox

Three (of many) Approaches
• Fan out on Read
• Fan out on Write
• Fan out on Write with Bucketing

Fan out on read
// Shard on "from"
db.shardCollection( "mongodbdays.inbox", { from: 1 } )
// Make sure we have an index to handle inbox reads
db.inbox.ensureIndex( { to: 1, sent: 1 } )
msg = {
from: "Joe",
to: [ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
db.inbox.save( msg )
// Read my inbox
db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )

Fan out on read – I/O
Send
Message
Shard 1 Shard 2 Shard 3

Fan out on read – I/O
Read Inbox
Send
Message

Considerations
• Write: One document per message sent
• Read: Find all messages with my own name in
the recipient field
• Read: Requires scatter-gather on sharded
cluster
• A lot of random I/O on a shard to find everything

Fan out on write
// Shard on “recipient” and “sent”
db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } )
msg = {
from: "Joe",
sent: new Date(),
message: "Hi!",
}
// Send a message
for ( recipient in msg.to ) {
msg.recipient = msg.to[recipient]
db.inbox.save( msg );
}
// Read my inbox
db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )

Fan out on write – I/O
Send
Message

Fan out on write – I/O
Read Inbox
Send
Message

Considerations
• Write: One document per recipient
• Read: Find all of the messages with me as the
recipient
• Can shard on recipient, so inbox reads hit one
shard
• But still lots of random I/O on the shard

Fan out on write with buckets
// Shard on "owner / sequence"
db.shardCollection( "mongodbdays.inbox",
{ owner: 1, sequence: 1 } )
db.shardCollection( "mongodbdays.users", { user_name: 1 } )
msg = {
from: "Joe",
sent: new Date(),
message: "Hi!",
}

// Send a message
for( recipient in msg.to) {
count = db.users.findAndModify({
query: { user_name: msg.to[recipient] },
update: { "$inc": { "msg_count": 1 } },
upsert: true,
new: true }).msg_count;
sequence = Math.floor(count / 50);
db.inbox.update({
owner: msg.to[recipient], sequence: sequence },
{ $push: { "messages": msg } },
{ upsert: true } );
}
// Read my inbox
db.inbox.find( { owner: "Joe" } )
.sort ( { sequence: -1 } ).limit( 2 )

• Each “inbox” document is an array of messages
• Append a message onto “inbox” of recipient
• Bucket inboxes so there’s not too many
messages per document
• Can shard on recipient, so inbox reads hit one
shard
• 1 or 2 documents to read the whole inbox

Fan out on write with buckets – I/O
Send
Message

Fan out on write with buckets – I/O
Read Inbox
Send
Message

Design Goals
• Need to retain a limited amount of history e.g.
– Hours, Days, Weeks
– May be legislative requirement (e.g. HIPPA, SOX,
DPA)
• Need to query efficiently by
– match
– ranges

3 (of many) approaches
• Bucket by Number of messages
• Fixed size array
• Bucket by date + TTL collections

Bucket by number of messages
db.inbox.find()
{ owner: "Joe", sequence: 25,
messages: [
{ from: "Joe",
sent: ISODate("2013-03-01T09:59:42.689Z"),
message: "Hi!"
},
…
] }
// Query with a date range
db.inbox.find ({owner: "friend1",
messages: {
$elemMatch: {sent:{$gte: ISODate("…") }}}})
// Remove elements based on a date
db.inbox.update({owner: "friend1" },
{ $pull: { messages: {
sent: { $gte: ISODate("…") } } } } )

Considerations
• Shrinking documents, space can be reclaimed
with
– db.runCommand ( { compact: '<collection>' } )
• Removing the document after the last element in
the array as been removed
– { "_id" : …, "messages" : [ ], "owner" :
"friend1", "sequence" : 0 }

Fixed size array
msg = {
from: "Your Boss",
to: [ "Bob" ],
sent: new Date(),
message: "CALL ME NOW!"
}
// 2.4 Introduces $each, $sort and $slice for $push
db.messages.update(
{ _id: 1 },
{ $push: { messages: { $each: [ msg ],
$sort: { sent: 1 },
$slice: -50 }
}
}
)

Considerations
• Need to compute the size of the array based on
retention period

TTL Collections
// messages: one doc per user per day
db.inbox.findOne()
{
_id: 1,
to: "Joe",
sequence: ISODate("2013-02-04T00:00:00.392Z"),
messages: [ ]
}
// Auto expires data after 31536000 seconds = 1 year
db.messages.ensureIndex( { sequence: 1 },
{ expireAfterSeconds: 31536000 } )

Design Goal
• Application needs to stored a variable number of
attributes e.g.
– User defined Form
– Meta Data tags
• Queries needed
– Equality
– Range based
• Need to be efficient, regardless of the number of
attributes

2 (of many) Approaches
• Attributes as Embedded Document
• Attributes as Objects in an Array

Attributes as a sub-document
db.files.insert( { _id: "local.0",
attr: { type: "text", size: 64,
created: ISODate("..." } } )
attr: { type: "text", size: 128} } )
db.files.insert( { _id: "mongod",
attr: { type: "binary", size: 256,
created: ISODate("...") } } )
// Need to create an index for each item in the sub-document
db.files.ensureIndex( { "attr.type": 1 } )
db.files.find( { "attr.type": "text"} )
// Can perform range queries
db.files.ensureIndex( { "attr.size": 1 } )
db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )

Considerations
• Each attribute needs an Index
• Each time you extend, you add an index
• Lots and lots of indexes

Attributes as objects in array
db.files.insert( {_id: "local.0",
attr: [ { type: "text" },
{ size: 64 },
{ created: ISODate("...") } ] } )
attr: [ { type: "text" },
{ size: 128 } ] } )
db.files.insert( { _id: "mongod",
attr: [ { type: "binary" },
{ size: 256 },
{ created: ISODate("...") } ] } )
db.files.ensureIndex( { attr: 1 } )

Considerations
• Only one index needed on attr
• Can support range queries, etc.
• Index can be used only once per query

Design Goal
• Ability to look up by a number of different
identities e.g.
- Username
- Email address
- FB handle
- LinkedIn URL

2 (of many) approaches
• Identifiers in a single document
• Separate Identifiers from Content

Single document by user
db.users.findOne()
{ _id: "joe",
email: "joe@example.com,
fb: "joe.smith", // facebook
li: "joe.e.smith", // linkedin
other: {…}
}
// Shard collection by _id
db.shardCollection("mongodbdays.users", { _id: 1 } )
// Create indexes on each key
db.users.ensureIndex( { email: 1} )
db.users.ensureIndex( { fb: 1 } )
db.users.ensureIndex( { li: 1 } )

Read by _id (shard key)
find( { _id: "joe"} )

Read by email (non-shard key)
find ( { email: joe@example.com } )

Considerations
• Lookup by shard key is routed to 1 shard
• Lookup by other identifier is scatter gathered
across all shards
• Secondary keys cannot have a unique index

Document per identity
// Create unique index
db.identities.ensureIndex( { identifier : 1} , { unique: true} )
// Create a document for each users document
db.identities.save(
{ identifier : { hndl: "joe" }, user: "1200-42" } )
db.identities.save(
{ identifier : { email: "joe@abc.com" }, user: "1200-42" } )
db.identities.save(
{ identifier : { li: "joe.e.smith" }, user: "1200-42" } )
db.shardCollection( "mydb.identities", { identifier : 1 } )
// Create unique index
db.users.ensureIndex( { _id: 1} , { unique: true} )
db.shardCollection( "mydb.users", { _id: 1 } )

Read requires 2 reads
db.identities.find({"identifier" : { "hndl"
: "joe" }})
db.users.find( { _id: "1200-42"} )

Considerations
• Lookup to Identities is a routed query
• Lookup to Users is a routed query
• Unique indexes available
• Must do two queries per lookup

Summary
• Multiple ways to model a domain problem
• Understand the key uses cases of your app
• Balance between ease of query vs. ease of write
• Reduce random I/O where possible for better
performance

Data Modeling Deep Dive

More Related Content

What's hot

Viewers also liked

Similar to Data Modeling Deep Dive

More from MongoDB

Recently uploaded

Data Modeling Deep Dive