10gen Presents Schema Design and Data Modeling

Schema Design
with MongoDB

Antoine Girbal

antoine@10gen.com
@antoinegirbal

So why model data?

http://www.flickr.com/photos/42304632@N00/493639870/

Normalization
• Goals
• Avoid anomalies when inserting, updating or
deleting
• Minimize redesign when extending the
schema
• Avoid bias toward a particular query
• Make use of all SQL features
• In MongoDB
• Similar goals apply but rules are different
• Denormalization for optimization is an option:
most features still exist, contrary to BLOBS

Terminology

RDBMS MongoDB
Table Collection
Row(s) JSON Document
Index Index
Join Embedding & Linking
Partition Shard
Partition Key Shard Key

Collections Basics
• Equivalent to a Table in SQL
• Cheap to create (max 24000)
• Collections don’t have a fixed schema
• Common for documents in a collection
to share a schema
• Document schema can evolve
• Consider using multiple related
collections tied together by a naming
convention:
• e.g. LogData-2011-02-08

Document basics
• Elements are name/value pairs,
equivalent to column value in SQL
• elements can be nested
• Rich data types for values
• JSON for the human eye
• BSON for all internals
• 16MB maximum size (many books..)
• What you see is what is stored

Schema Design - MongoDB
embedding

Schema Design - MongoDB
embedding

linking

Design Session

Design documents that simply map to your application

> post = { author: "Hergé",
date: ISODate("2011-09-18T09:56:06.298Z"),
text: "Destination Moon",
tags: ["comic", "adventure"]
}

> db.blogs.save(post)

Find the document
> db.blogs.find()

{ _id: ObjectId("4c4ba5c0672c685e5e8aabf3"),
author: "Hergé",
date: ISODate("2011-09-18T09:56:06.298Z"),
text: "Destination Moon",
tags: [ "comic", "adventure" ]
}

Notes:
• ID must be unique, but can be anything you’d like
• MongoDB will generate a default ID if one is not supplied

Add and index, find via Index

Secondary index for “author”

// 1 means ascending, -1 means descending
> db.blogs.ensureIndex( { author: 1 } )

> db.blogs.find( { author: 'Hergé' } )

{ _id: ObjectId("4c4ba5c0672c685e5e8aabf3"),
date: ISODate("2011-09-18T09:56:06.298Z"),
author: "Hergé",
... }

Examine the query plan

> db.blogs.find( { author: "Hergé" } ).explain()
{
"cursor" : "BtreeCursor author_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 5,
"indexBounds" : {
"author" : [
[
"Hergé",
"Hergé"
]
]
}
}

Query operators
Conditional operators:
$ne, $in, $nin, $mod, $all, $size, $exists, $type, ..
$lt, $lte, $gt, $gte, $ne...

// find posts with any tags
> db.blogs.find( { tags: { $exists: true } } )

Query operators


Regular expressions:
// posts where author starts with h
> db.blogs.find( { author: /^h/ } )

Query operators


Regular expressions:
// posts where author starts with h
> db.blogs.find( { author: /^h/ } )

Counting:
// number of posts written by Hergé
> db.blogs.find( { author: "Hergé" } ).count()

Extending the Schema
> new_comment =
{ author: "Kyle",
date: new Date(),
text: "great book" }

> db.blogs.update(
{ text: "Destination Moon" },
{ "$push": { comments: new_comment },
"$inc": { comments_count: 1 }
})

> db.blogs.find( { author: "Hergé"} )

{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "Hergé",
date : ISODate("2011-09-18T09:56:06.298Z"),
text : "Destination Moon",
tags : [ "comic", "adventure" ],
comments : [
{
author : "Kyle",
date : ISODate("2011-09-19T09:56:06.298Z"),
text : "great book"
}
],
comments_count: 1
}

// create index on nested documents:
> db.blogs.ensureIndex( { "comments.author": 1 } )

> db.blogs.find( { "comments.author": "Kyle" } )



// find last 5 posts:
> db.blogs.find().sort( { date: -1 } ).limit(5)



// find last 5 posts:
> db.blogs.find().sort( { date: -1 } ).limit(5)

// most commented post:
> db.blogs.find().sort( { comments_count: -1 } ).limit(1)

When sorting, check if you need an index

Common Patterns

Patterns:
• Inheritance
• one to one
• one to many
• many to many

Single Table Inheritance -
MongoDB
shapes table
id type area radius length width

1 circle 3.14 1

2 square 4 2

3 rect 10 5 2

MongoDB
> db.shapes.find()
{ _id: "1", type: "c", area: 3.14, radius: 1}
{ _id: "2", type: "s", area: 4, length: 2}
{ _id: "3", type: "r", area: 10, length: 5, width: 2}

missing values
not stored!

MongoDB
> db.shapes.find()

// find shapes where radius > 0
> db.shapes.find( { radius: { $gt: 0 } } )

MongoDB
> db.shapes.find()

// find shapes where radius > 0
> db.shapes.find( { radius: { $gt: 0 } } )

// create index
> db.shapes.ensureIndex( { radius: 1 }, { sparse:true } )

index only
values present!

One to Many
Either:

•Embedded Array / Document:
• improves read speed
• simplifies schema
•Normalize:
• if list grows significantly
• if sub items are updated often
• if sub items are more than 1 level
deep and need updating

One to Many
Embedded Array:
•$slice operator to return subset of comments
•some queries become harder (e.g find latest comments across all blogs)
blogs: {
author : "Hergé",
date : ISODate("2011-09-18T09:56:06.298Z"),
comments : [
{
author : "Kyle",
date : ISODate("2011-09-19T09:56:06.298Z"),
text : "great book"
}
]
}

One to Many
Normalized (2 collections)
•most flexible
•more queries
blogs: { _id: 1000,
author: "Hergé",
date: ISODate("2011-09-18T09:56:06.298Z") }

comments : { _id : 1,
blogId: 1000,
author : "Kyle",
date : ISODate("2011-09-19T09:56:06.298Z") }

> blog = db.blogs.find( { text: "Destination Moon" } );

> db.ensureIndex( { blogId: 1 } ) // important!
> db.comments.find( { blogId: blog._id } );

Many - Many
Example:

• Product can be in many categories
• Category can have many products

Many - Many
// Each product list the IDs of the categories
products:
{ _id: 10, name: "Destination Moon",
category_ids: [ 20, 30 ] }

Many - Many
products:

// Each category lists the IDs of the products
categories:
{ _id: 20, name: "adventure",
product_ids: [ 10, 11, 12 ] }

categories:
{ _id: 21, name: "movie",
product_ids: [ 10 ] }

Many - Many
products:

// Each category lists the IDs of the products
categories:
{ _id: 20, name: "adventure",
product_ids: [ 10, 11, 12 ] }

categories:
{ _id: 21, name: "movie",
product_ids: [ 10 ] }

Cuts mapping table and 2 indexes, but:
• potential consistency issue
• lists can grow too large

Alternative
products:

// Association not stored on the categories
categories:
{ _id: 20,
name: "adventure"}

Alternative
products:

// Association not stored on the categories
categories:
{ _id: 20,
name: "adventure"}

// All products for a given category
> db.products.ensureIndex( { category_ids: 1} ) // yes!
> db.products.find( { category_ids: 20 } )

Common Use Cases

Use cases:
• Trees
• Time Series

Trees

Hierarchical information

Trees

Full Tree in Document

{ retweet: [
{ who: “Kyle”, text: “...”,
retweet: [
{who: “James”, text: “...”,
retweet: []}
]}
]
}

Pros: Single Document, Performance, Intuitive

Cons: Hard to search or update, document can easily get
too large

Array of Ancestors A B C
// Store all Ancestors of a node E D
{ _id: "a" }
{ _id: "b", tree: [ "a" ], retweet: "a" } F
{ _id: "c", tree: [ "a", "b" ], retweet: "b" }
{ _id: "d", tree: [ "a", "b" ], retweet: "b" }
{ _id: "e", tree: [ "a" ], retweet: "a" }
{ _id: "f", tree: [ "a", "e" ], retweet: "e" }

// find all direct retweets of "b"
> db.tweets.find( { retweet: "b" } )

{ _id: "a" }


// find all retweets of "e" anywhere in tree
> db.tweets.find( { tree: "e" } )

{ _id: "a" }


// find all retweets of "e" anywhere in tree
> db.tweets.find( { tree: "e" } )

// find tweet history of f:
> tweets = db.tweets.findOne( { _id: "f" } ).tree
> db.tweets.find( { _id: { $in : tweets } } )

Trees as Paths A B C
Store hierarchy as a path expression E D
• Separate each node by a delimiter, e.g. “,”
• Use text search for find parts of a tree F
• search must be left-rooted and use an index!
{ retweets: [
{ _id: "a", text: "initial tweet",
path: "a" },
{ _id: "b", text: "reweet with comment",
path: "a,b" },
{ _id: "c", text: "reply to retweet",
path : "a,b,c"} ] }

// Find the conversations "a" started
> db.tweets.find( { path: /^a/ } )
// Find the conversations under a branch
> db.tweets.find( { path: /^a,b/ } )

Time Series

• Records stats by
• Day, Hour, Minute

• Show time series

Time Series

// Time series buckets, hour and minute sub-docs
{ _id: "20111209-1231",
ts: ISODate("2011-12-09T00:00:00.000Z")
daily: 67,
hourly: { 0: 3, 1: 14, 2: 19 ... 23: 72 },
minute: { 0: 0, 1: 4, 2: 6 ... 1439: 0 }
}

// Add one to the last minute before midnight
> db.votes.update(
{ _id: "20111209-1231",
ts: ISODate("2011-12-09T00:00:00.037Z") },
{ $inc: { "hourly.23": 1 },
$inc: { "minute.1439": 1 })

BSON Storage

• Sequence of key/value pairs
• NOT a hash map
• Optimized to scan quickly

0 1 2 3 ... 1439
What is the cost of update the minute before
midnight?

BSON Storage

• Can skip sub-documents

0 ... 23
0 1 ... 59 1380 ... 1439

How could this change the schema?

Time Series
Use more of a Tree structure by nesting!

// Time series buckets, each hour a sub-document
{ _id: "20111209-1231",
ts: ISODate("2011-12-09T00:00:00.000Z")
daily: 67,
minute: { 0: { 0: 0, 1: 7, ... 59: 2 },
...
23: { 0: 15, ... 59: 6 }
}
}

// Add one to the last second before midnight
> db.votes.update(
{ _id: "20111209-1231" },
ts: ISODate("2011-12-09T00:00:00.000Z") },
{ $inc: { "minute.23.59": 1 } })

Duplicate data
Document to represent a shopping order:

{ _id: 1234,
ts: ISODate("2011-12-09T00:00:00.000Z")
customerId: 67,
total_price: 1050,
items: [{ sku: 123, quantity: 2, price: 50,
name: “macbook”, thumbnail: “macbook.png” },
{ sku: 234, quantity: 1, price: 20,
name: “iphone”, thumbnail: “iphone.png” },
...
}
}

The item information is duplicated in every order that reference it.
Mongo’s flexible schema makes it easy!

Duplicate data
• Pros:
• only 1 query to get all information needed to display
the order
• processing on the db is as fast as a BLOB
• can achieve much higher performance

• Cons:
• more storage used ... cheap enough
• updates are much more complicated ... just consider
fields immutable

Summary
• Basic data design principles stay the same ...
• But MongoDB is more flexible and brings possibilities
• embed or duplicate data to speed up operations, cut down
the number of collections and indexes

• watch for documents growing too large
• make sure to use the proper indexes for querying and sorting
• schema should feel natural to your application!

download at mongodb.org

conferences, appearances, and meetups
http://www.10gen.com/events

Facebook | Twitter | LinkedIn
http://bit.ly/mongofb @mongodb http://linkd.in/joinmongo

10gen Presents Schema Design and Data Modeling

More Related Content

What's hot

Similar to 10gen Presents Schema Design and Data Modeling

More from DATAVERSITY

Recently uploaded

10gen Presents Schema Design and Data Modeling