Data Modeling: 
Four use cases 
Toji George 
Solutions Architect 
MongoDB Inc.
Agenda 
• 4 Real World Schemas 
– Inbox 
– History 
– Indexed Attributes 
– Multiple Identities 
• Conclusions
In MongoDB 
Application Development requires Good Schema 
Design 
Success comes from Proper Data Structure 
“Schema-less”?
#1 –Message Inbox
Lets get social
Sending Messages 
?
Design Goals 
• Efficiently send new messages to recipients 
• Efficiently read inbox
Reading My Inbox 
?
Three (of many) Approaches 
• Fan out on Read 
• Fan out on Write 
• Fan out on Write with Bucketing
Fan out on read 
// Shard on "from" 
db.shardCollection( "mongodbdays.inbox", { from: 1 } ) 
// Make sure we have an index to handle inbox reads 
db.inbox.ensureIndex( { to: 1, sent: 1 } ) 
msg = { 
from: "Joe", 
to: [ "Bob", "Jane" ], 
sent: new Date(), 
message: "Hi!", 
} 
// Send a message 
db.inbox.save( msg ) 
// Read my inbox 
db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )
Fan out on read – I/O 
Send 
Message 
Shard 1 Shard 2 Shard 3
Fan out on read – I/O 
Shard 1 Shard 2 Shard 3 
Read Inbox 
Send 
Message
Considerations 
• Write: One document per message sent 
• Read: Find all messages with my own name in 
the recipient field 
• Read: Requires scatter-gather on sharded 
cluster 
• A lot of random I/O on a shard to find everything
Fan out on write 
// Shard on “recipient” and “sent” 
db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } ) 
msg = { 
from: "Joe", 
to: [ "Bob", "Jane" ], 
sent: new Date(), 
message: "Hi!", 
} 
// Send a message 
for ( recipient in msg.to ) { 
msg.recipient = msg.to[recipient] 
db.inbox.save( msg ); 
} 
// Read my inbox 
db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )
Fan out on write – I/O 
Send 
Message 
Shard 1 Shard 2 Shard 3
Fan out on write – I/O 
Read Inbox 
Send 
Message 
Shard 1 Shard 2 Shard 3
Considerations 
• Write: One document per recipient 
• Read: Find all of the messages with me as the 
recipient 
• Can shard on recipient, so inbox reads hit one 
shard 
• But still lots of random I/O on the shard
Fan out on write with buckets 
// Shard on "owner / sequence" 
db.shardCollection( "mongodbdays.inbox", 
{ owner: 1, sequence: 1 } ) 
db.shardCollection( "mongodbdays.users", { user_name: 1 } ) 
msg = { 
from: "Joe", 
to: [ "Bob", "Jane" ], 
sent: new Date(), 
message: "Hi!", 
}
Fan out on write with buckets 
// Send a message 
for( recipient in msg.to) { 
count = db.users.findAndModify({ 
query: { user_name: msg.to[recipient] }, 
update: { "$inc": { "msg_count": 1 } }, 
upsert: true, 
new: true }).msg_count; 
sequence = Math.floor(count / 50); 
db.inbox.update({ 
owner: msg.to[recipient], sequence: sequence }, 
{ $push: { "messages": msg } }, 
{ upsert: true } ); 
} 
// Read my inbox 
db.inbox.find( { owner: "Joe" } ) 
.sort ( { sequence: -1 } ).limit( 2 )
Fan out on write with buckets 
• Each “inbox” document is an array of messages 
• Append a message onto “inbox” of recipient 
• Bucket inboxes so there’s not too many 
messages per document 
• Can shard on recipient, so inbox reads hit one 
shard 
• 1 or 2 documents to read the whole inbox
Fan out on write with buckets – I/O 
Send 
Message 
Shard 1 Shard 2 Shard 3
Fan out on write with buckets – I/O 
Shard 1 Shard 2 Shard 3 
Read Inbox 
Send 
Message
#2 - History
Design Goals 
• Need to retain a limited amount of history e.g. 
– Hours, Days, Weeks 
– May be legislative requirement (e.g. HIPPA, SOX, 
DPA) 
• Need to query efficiently by 
– match 
– ranges
3 (of many) approaches 
• Bucket by Number of messages 
• Fixed size array 
• Bucket by date + TTL collections
Bucket by number of messages 
db.inbox.find() 
{ owner: "Joe", sequence: 25, 
messages: [ 
{ from: "Joe", 
to: [ "Bob", "Jane" ], 
sent: ISODate("2013-03-01T09:59:42.689Z"), 
message: "Hi!" 
}, 
… 
] } 
// Query with a date range 
db.inbox.find ({owner: "friend1", 
messages: { 
$elemMatch: {sent:{$gte: ISODate("…") }}}}) 
// Remove elements based on a date 
db.inbox.update({owner: "friend1" }, 
{ $pull: { messages: { 
sent: { $gte: ISODate("…") } } } } )
Considerations 
• Shrinking documents, space can be reclaimed 
with 
– db.runCommand ( { compact: '<collection>' } ) 
• Removing the document after the last element in 
the array as been removed 
– { "_id" : …, "messages" : [ ], "owner" : 
"friend1", "sequence" : 0 }
Fixed size array 
msg = { 
from: "Your Boss", 
to: [ "Bob" ], 
sent: new Date(), 
message: "CALL ME NOW!" 
} 
// 2.4 Introduces $each, $sort and $slice for $push 
db.messages.update( 
{ _id: 1 }, 
{ $push: { messages: { $each: [ msg ], 
$sort: { sent: 1 }, 
$slice: -50 } 
} 
} 
)
Considerations 
• Need to compute the size of the array based on 
retention period
TTL Collections 
// messages: one doc per user per day 
db.inbox.findOne() 
{ 
_id: 1, 
to: "Joe", 
sequence: ISODate("2013-02-04T00:00:00.392Z"), 
messages: [ ] 
} 
// Auto expires data after 31536000 seconds = 1 year 
db.messages.ensureIndex( { sequence: 1 }, 
{ expireAfterSeconds: 31536000 } )
#3 – Indexed Attributes
Design Goal 
• Application needs to stored a variable number of 
attributes e.g. 
– User defined Form 
– Meta Data tags 
• Queries needed 
– Equality 
– Range based 
• Need to be efficient, regardless of the number of 
attributes
2 (of many) Approaches 
• Attributes as Embedded Document 
• Attributes as Objects in an Array
Attributes as a sub-document 
db.files.insert( { _id: "local.0", 
attr: { type: "text", size: 64, 
created: ISODate("..." } } ) 
db.files.insert( { _id: "local.1", 
attr: { type: "text", size: 128} } ) 
db.files.insert( { _id: "mongod", 
attr: { type: "binary", size: 256, 
created: ISODate("...") } } ) 
// Need to create an index for each item in the sub-document 
db.files.ensureIndex( { "attr.type": 1 } ) 
db.files.find( { "attr.type": "text"} ) 
// Can perform range queries 
db.files.ensureIndex( { "attr.size": 1 } ) 
db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )
Considerations 
• Each attribute needs an Index 
• Each time you extend, you add an index 
• Lots and lots of indexes
Attributes as objects in array 
db.files.insert( {_id: "local.0", 
attr: [ { type: "text" }, 
{ size: 64 }, 
{ created: ISODate("...") } ] } ) 
db.files.insert( { _id: "local.1", 
attr: [ { type: "text" }, 
{ size: 128 } ] } ) 
db.files.insert( { _id: "mongod", 
attr: [ { type: "binary" }, 
{ size: 256 }, 
{ created: ISODate("...") } ] } ) 
db.files.ensureIndex( { attr: 1 } )
Considerations 
• Only one index needed on attr 
• Can support range queries, etc. 
• Index can be used only once per query
#4 –Multiple Identities
Design Goal 
• Ability to look up by a number of different 
identities e.g. 
- Username 
- Email address 
- FB handle 
- LinkedIn URL
2 (of many) approaches 
• Identifiers in a single document 
• Separate Identifiers from Content
Single document by user 
db.users.findOne() 
{ _id: "joe", 
email: "joe@example.com, 
fb: "joe.smith", // facebook 
li: "joe.e.smith", // linkedin 
other: {…} 
} 
// Shard collection by _id 
db.shardCollection("mongodbdays.users", { _id: 1 } ) 
// Create indexes on each key 
db.users.ensureIndex( { email: 1} ) 
db.users.ensureIndex( { fb: 1 } ) 
db.users.ensureIndex( { li: 1 } )
Read by _id (shard key) 
find( { _id: "joe"} ) 
Shard 1 Shard 2 Shard 3
Read by email (non-shard key) 
find ( { email: joe@example.com } ) 
Shard 1 Shard 2 Shard 3
Considerations 
• Lookup by shard key is routed to 1 shard 
• Lookup by other identifier is scatter gathered 
across all shards 
• Secondary keys cannot have a unique index
Document per identity 
// Create unique index 
db.identities.ensureIndex( { identifier : 1} , { unique: true} ) 
// Create a document for each users document 
db.identities.save( 
{ identifier : { hndl: "joe" }, user: "1200-42" } ) 
db.identities.save( 
{ identifier : { email: "joe@abc.com" }, user: "1200-42" } ) 
db.identities.save( 
{ identifier : { li: "joe.e.smith" }, user: "1200-42" } ) 
// Shard collection by _id 
db.shardCollection( "mydb.identities", { identifier : 1 } ) 
// Create unique index 
db.users.ensureIndex( { _id: 1} , { unique: true} ) 
// Shard collection by _id 
db.shardCollection( "mydb.users", { _id: 1 } )
Read requires 2 reads 
db.identities.find({"identifier" : { "hndl" 
: "joe" }}) 
db.users.find( { _id: "1200-42"} ) 
Shard 1 Shard 2 Shard 3
Considerations 
• Lookup to Identities is a routed query 
• Lookup to Users is a routed query 
• Unique indexes available 
• Must do two queries per lookup
Conclusion
Summary 
• Multiple ways to model a domain problem 
• Understand the key uses cases of your app 
• Balance between ease of query vs. ease of write 
• Reduce random I/O where possible for better 
performance
Data Modeling Deep Dive

Data Modeling Deep Dive

  • 2.
    Data Modeling: Fouruse cases Toji George Solutions Architect MongoDB Inc.
  • 3.
    Agenda • 4Real World Schemas – Inbox – History – Indexed Attributes – Multiple Identities • Conclusions
  • 4.
    In MongoDB ApplicationDevelopment requires Good Schema Design Success comes from Proper Data Structure “Schema-less”?
  • 5.
  • 6.
  • 7.
  • 8.
    Design Goals •Efficiently send new messages to recipients • Efficiently read inbox
  • 9.
  • 10.
    Three (of many)Approaches • Fan out on Read • Fan out on Write • Fan out on Write with Bucketing
  • 11.
    Fan out onread // Shard on "from" db.shardCollection( "mongodbdays.inbox", { from: 1 } ) // Make sure we have an index to handle inbox reads db.inbox.ensureIndex( { to: 1, sent: 1 } ) msg = { from: "Joe", to: [ "Bob", "Jane" ], sent: new Date(), message: "Hi!", } // Send a message db.inbox.save( msg ) // Read my inbox db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )
  • 12.
    Fan out onread – I/O Send Message Shard 1 Shard 2 Shard 3
  • 13.
    Fan out onread – I/O Shard 1 Shard 2 Shard 3 Read Inbox Send Message
  • 14.
    Considerations • Write:One document per message sent • Read: Find all messages with my own name in the recipient field • Read: Requires scatter-gather on sharded cluster • A lot of random I/O on a shard to find everything
  • 15.
    Fan out onwrite // Shard on “recipient” and “sent” db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } ) msg = { from: "Joe", to: [ "Bob", "Jane" ], sent: new Date(), message: "Hi!", } // Send a message for ( recipient in msg.to ) { msg.recipient = msg.to[recipient] db.inbox.save( msg ); } // Read my inbox db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )
  • 16.
    Fan out onwrite – I/O Send Message Shard 1 Shard 2 Shard 3
  • 17.
    Fan out onwrite – I/O Read Inbox Send Message Shard 1 Shard 2 Shard 3
  • 18.
    Considerations • Write:One document per recipient • Read: Find all of the messages with me as the recipient • Can shard on recipient, so inbox reads hit one shard • But still lots of random I/O on the shard
  • 19.
    Fan out onwrite with buckets // Shard on "owner / sequence" db.shardCollection( "mongodbdays.inbox", { owner: 1, sequence: 1 } ) db.shardCollection( "mongodbdays.users", { user_name: 1 } ) msg = { from: "Joe", to: [ "Bob", "Jane" ], sent: new Date(), message: "Hi!", }
  • 20.
    Fan out onwrite with buckets // Send a message for( recipient in msg.to) { count = db.users.findAndModify({ query: { user_name: msg.to[recipient] }, update: { "$inc": { "msg_count": 1 } }, upsert: true, new: true }).msg_count; sequence = Math.floor(count / 50); db.inbox.update({ owner: msg.to[recipient], sequence: sequence }, { $push: { "messages": msg } }, { upsert: true } ); } // Read my inbox db.inbox.find( { owner: "Joe" } ) .sort ( { sequence: -1 } ).limit( 2 )
  • 21.
    Fan out onwrite with buckets • Each “inbox” document is an array of messages • Append a message onto “inbox” of recipient • Bucket inboxes so there’s not too many messages per document • Can shard on recipient, so inbox reads hit one shard • 1 or 2 documents to read the whole inbox
  • 22.
    Fan out onwrite with buckets – I/O Send Message Shard 1 Shard 2 Shard 3
  • 23.
    Fan out onwrite with buckets – I/O Shard 1 Shard 2 Shard 3 Read Inbox Send Message
  • 24.
  • 26.
    Design Goals •Need to retain a limited amount of history e.g. – Hours, Days, Weeks – May be legislative requirement (e.g. HIPPA, SOX, DPA) • Need to query efficiently by – match – ranges
  • 27.
    3 (of many)approaches • Bucket by Number of messages • Fixed size array • Bucket by date + TTL collections
  • 28.
    Bucket by numberof messages db.inbox.find() { owner: "Joe", sequence: 25, messages: [ { from: "Joe", to: [ "Bob", "Jane" ], sent: ISODate("2013-03-01T09:59:42.689Z"), message: "Hi!" }, … ] } // Query with a date range db.inbox.find ({owner: "friend1", messages: { $elemMatch: {sent:{$gte: ISODate("…") }}}}) // Remove elements based on a date db.inbox.update({owner: "friend1" }, { $pull: { messages: { sent: { $gte: ISODate("…") } } } } )
  • 29.
    Considerations • Shrinkingdocuments, space can be reclaimed with – db.runCommand ( { compact: '<collection>' } ) • Removing the document after the last element in the array as been removed – { "_id" : …, "messages" : [ ], "owner" : "friend1", "sequence" : 0 }
  • 30.
    Fixed size array msg = { from: "Your Boss", to: [ "Bob" ], sent: new Date(), message: "CALL ME NOW!" } // 2.4 Introduces $each, $sort and $slice for $push db.messages.update( { _id: 1 }, { $push: { messages: { $each: [ msg ], $sort: { sent: 1 }, $slice: -50 } } } )
  • 31.
    Considerations • Needto compute the size of the array based on retention period
  • 32.
    TTL Collections //messages: one doc per user per day db.inbox.findOne() { _id: 1, to: "Joe", sequence: ISODate("2013-02-04T00:00:00.392Z"), messages: [ ] } // Auto expires data after 31536000 seconds = 1 year db.messages.ensureIndex( { sequence: 1 }, { expireAfterSeconds: 31536000 } )
  • 33.
    #3 – IndexedAttributes
  • 34.
    Design Goal •Application needs to stored a variable number of attributes e.g. – User defined Form – Meta Data tags • Queries needed – Equality – Range based • Need to be efficient, regardless of the number of attributes
  • 35.
    2 (of many)Approaches • Attributes as Embedded Document • Attributes as Objects in an Array
  • 36.
    Attributes as asub-document db.files.insert( { _id: "local.0", attr: { type: "text", size: 64, created: ISODate("..." } } ) db.files.insert( { _id: "local.1", attr: { type: "text", size: 128} } ) db.files.insert( { _id: "mongod", attr: { type: "binary", size: 256, created: ISODate("...") } } ) // Need to create an index for each item in the sub-document db.files.ensureIndex( { "attr.type": 1 } ) db.files.find( { "attr.type": "text"} ) // Can perform range queries db.files.ensureIndex( { "attr.size": 1 } ) db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )
  • 37.
    Considerations • Eachattribute needs an Index • Each time you extend, you add an index • Lots and lots of indexes
  • 38.
    Attributes as objectsin array db.files.insert( {_id: "local.0", attr: [ { type: "text" }, { size: 64 }, { created: ISODate("...") } ] } ) db.files.insert( { _id: "local.1", attr: [ { type: "text" }, { size: 128 } ] } ) db.files.insert( { _id: "mongod", attr: [ { type: "binary" }, { size: 256 }, { created: ISODate("...") } ] } ) db.files.ensureIndex( { attr: 1 } )
  • 39.
    Considerations • Onlyone index needed on attr • Can support range queries, etc. • Index can be used only once per query
  • 40.
  • 41.
    Design Goal •Ability to look up by a number of different identities e.g. - Username - Email address - FB handle - LinkedIn URL
  • 42.
    2 (of many)approaches • Identifiers in a single document • Separate Identifiers from Content
  • 43.
    Single document byuser db.users.findOne() { _id: "joe", email: "joe@example.com, fb: "joe.smith", // facebook li: "joe.e.smith", // linkedin other: {…} } // Shard collection by _id db.shardCollection("mongodbdays.users", { _id: 1 } ) // Create indexes on each key db.users.ensureIndex( { email: 1} ) db.users.ensureIndex( { fb: 1 } ) db.users.ensureIndex( { li: 1 } )
  • 44.
    Read by _id(shard key) find( { _id: "joe"} ) Shard 1 Shard 2 Shard 3
  • 45.
    Read by email(non-shard key) find ( { email: joe@example.com } ) Shard 1 Shard 2 Shard 3
  • 46.
    Considerations • Lookupby shard key is routed to 1 shard • Lookup by other identifier is scatter gathered across all shards • Secondary keys cannot have a unique index
  • 47.
    Document per identity // Create unique index db.identities.ensureIndex( { identifier : 1} , { unique: true} ) // Create a document for each users document db.identities.save( { identifier : { hndl: "joe" }, user: "1200-42" } ) db.identities.save( { identifier : { email: "joe@abc.com" }, user: "1200-42" } ) db.identities.save( { identifier : { li: "joe.e.smith" }, user: "1200-42" } ) // Shard collection by _id db.shardCollection( "mydb.identities", { identifier : 1 } ) // Create unique index db.users.ensureIndex( { _id: 1} , { unique: true} ) // Shard collection by _id db.shardCollection( "mydb.users", { _id: 1 } )
  • 48.
    Read requires 2reads db.identities.find({"identifier" : { "hndl" : "joe" }}) db.users.find( { _id: "1200-42"} ) Shard 1 Shard 2 Shard 3
  • 49.
    Considerations • Lookupto Identities is a routed query • Lookup to Users is a routed query • Unique indexes available • Must do two queries per lookup
  • 50.
  • 51.
    Summary • Multipleways to model a domain problem • Understand the key uses cases of your app • Balance between ease of query vs. ease of write • Reduce random I/O where possible for better performance