• Like

10gen Presents Schema Design and Data Modeling

  • 1,110 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,110
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
53
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Schema Design with MongoDB Antoine Girbal antoine@10gen.com @antoinegirbal
  • 2. So why model data? http://www.flickr.com/photos/42304632@N00/493639870/
  • 3. Normalization• Goals• Avoid anomalies when inserting, updating or deleting• Minimize redesign when extending the schema• Avoid bias toward a particular query• Make use of all SQL features• In MongoDB• Similar goals apply but rules are different• Denormalization for optimization is an option: most features still exist, contrary to BLOBS
  • 4. Terminology RDBMS MongoDB Table Collection Row(s) JSON Document Index Index Join Embedding & Linking Partition Shard Partition Key Shard Key
  • 5. Collections Basics• Equivalent to a Table in SQL• Cheap to create (max 24000)• Collections don’t have a fixed schema• Common for documents in a collection to share a schema• Document schema can evolve• Consider using multiple related collections tied together by a naming convention: • e.g. LogData-2011-02-08
  • 6. Document basics• Elements are name/value pairs, equivalent to column value in SQL• elements can be nested• Rich data types for values• JSON for the human eye• BSON for all internals• 16MB maximum size (many books..)• What you see is what is stored
  • 7. Schema Design - Relational
  • 8. Schema Design - MongoDB
  • 9. Schema Design - MongoDB embedding
  • 10. Schema Design - MongoDB embedding linking
  • 11. Design SessionDesign documents that simply map to your application> post = { author: "Hergé", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: ["comic", "adventure"] }> db.blogs.save(post)
  • 12. Find the document> db.blogs.find() { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), author: "Hergé", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: [ "comic", "adventure" ] }Notes:• ID must be unique, but can be anything you’d like• MongoDB will generate a default ID if one is not supplied
  • 13. Add and index, find via IndexSecondary index for “author”// 1 means ascending, -1 means descending> db.blogs.ensureIndex( { author: 1 } )> db.blogs.find( { author: Hergé } ) { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), date: ISODate("2011-09-18T09:56:06.298Z"), author: "Hergé", ... }
  • 14. Examine the query plan> db.blogs.find( { author: "Hergé" } ).explain(){ "cursor" : "BtreeCursor author_1", "nscanned" : 1, "nscannedObjects" : 1, "n" : 1, "millis" : 5, "indexBounds" : { "author" : [ [ "Hergé", "Hergé" ] ] }}
  • 15. Examine the query plan> db.blogs.find( { author: "Hergé" } ).explain(){ "cursor" : "BtreeCursor author_1", "nscanned" : 1, "nscannedObjects" : 1, "n" : 1, "millis" : 5, "indexBounds" : { "author" : [ [ "Hergé", "Hergé" ] ] }}
  • 16. Query operatorsConditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...// find posts with any tags> db.blogs.find( { tags: { $exists: true } } )
  • 17. Query operatorsConditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...// find posts with any tags> db.blogs.find( { tags: { $exists: true } } )Regular expressions:// posts where author starts with h> db.blogs.find( { author: /^h/ } )
  • 18. Query operatorsConditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...// find posts with any tags> db.blogs.find( { tags: { $exists: true } } )Regular expressions:// posts where author starts with h> db.blogs.find( { author: /^h/ } )Counting:// number of posts written by Hergé> db.blogs.find( { author: "Hergé" } ).count()
  • 19. Extending the Schema> new_comment = { author: "Kyle", date: new Date(), text: "great book" }> db.blogs.update( { text: "Destination Moon" }, { "$push": { comments: new_comment }, "$inc": { comments_count: 1 } })
  • 20. Extending the Schema> db.blogs.find( { author: "Hergé"} ){ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Hergé", date : ISODate("2011-09-18T09:56:06.298Z"), text : "Destination Moon", tags : [ "comic", "adventure" ], comments : [ { author : "Kyle", date : ISODate("2011-09-19T09:56:06.298Z"), text : "great book" } ], comments_count: 1}
  • 21. Extending the Schema// create index on nested documents:> db.blogs.ensureIndex( { "comments.author": 1 } )> db.blogs.find( { "comments.author": "Kyle" } )
  • 22. Extending the Schema// create index on nested documents:> db.blogs.ensureIndex( { "comments.author": 1 } )> db.blogs.find( { "comments.author": "Kyle" } )// find last 5 posts:> db.blogs.find().sort( { date: -1 } ).limit(5)
  • 23. Extending the Schema// create index on nested documents:> db.blogs.ensureIndex( { "comments.author": 1 } )> db.blogs.find( { "comments.author": "Kyle" } )// find last 5 posts:> db.blogs.find().sort( { date: -1 } ).limit(5)// most commented post:> db.blogs.find().sort( { comments_count: -1 } ).limit(1)When sorting, check if you need an index
  • 24. Common Patterns Patterns: • Inheritance • one to one • one to many • many to many
  • 25. Inheritance
  • 26. Single Table Inheritance -MongoDB shapes table id type area radius length width 1 circle 3.14 1 2 square 4 2 3 rect 10 5 2
  • 27. Single Table Inheritance -MongoDB> db.shapes.find(){ _id: "1", type: "c", area: 3.14, radius: 1}{ _id: "2", type: "s", area: 4, length: 2}{ _id: "3", type: "r", area: 10, length: 5, width: 2} missing values not stored!
  • 28. Single Table Inheritance -MongoDB> db.shapes.find(){ _id: "1", type: "c", area: 3.14, radius: 1}{ _id: "2", type: "s", area: 4, length: 2}{ _id: "3", type: "r", area: 10, length: 5, width: 2}// find shapes where radius > 0> db.shapes.find( { radius: { $gt: 0 } } )
  • 29. Single Table Inheritance -MongoDB> db.shapes.find(){ _id: "1", type: "c", area: 3.14, radius: 1}{ _id: "2", type: "s", area: 4, length: 2}{ _id: "3", type: "r", area: 10, length: 5, width: 2}// find shapes where radius > 0> db.shapes.find( { radius: { $gt: 0 } } )// create index> db.shapes.ensureIndex( { radius: 1 }, { sparse:true } ) index only values present!
  • 30. One to Many Either: •Embedded Array / Document: • improves read speed • simplifies schema •Normalize: • if list grows significantly • if sub items are updated often • if sub items are more than 1 level deep and need updating
  • 31. One to ManyEmbedded Array:•$slice operator to return subset of comments•some queries become harder (e.g find latest comments across all blogs)blogs: { author : "Hergé", date : ISODate("2011-09-18T09:56:06.298Z"), comments : [ { author : "Kyle", date : ISODate("2011-09-19T09:56:06.298Z"), text : "great book" } ]}
  • 32. One to ManyNormalized (2 collections)•most flexible•more queriesblogs: { _id: 1000, author: "Hergé", date: ISODate("2011-09-18T09:56:06.298Z") }comments : { _id : 1, blogId: 1000, author : "Kyle", date : ISODate("2011-09-19T09:56:06.298Z") }> blog = db.blogs.find( { text: "Destination Moon" } );> db.ensureIndex( { blogId: 1 } ) // important!> db.comments.find( { blogId: blog._id } );
  • 33. Many - ManyExample:• Product can be in many categories• Category can have many products
  • 34. Many - Many// Each product list the IDs of the categoriesproducts: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }
  • 35. Many - Many// Each product list the IDs of the categoriesproducts: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }// Each category lists the IDs of the productscategories: { _id: 20, name: "adventure", product_ids: [ 10, 11, 12 ] }categories: { _id: 21, name: "movie", product_ids: [ 10 ] }
  • 36. Many - Many// Each product list the IDs of the categoriesproducts: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }// Each category lists the IDs of the productscategories: { _id: 20, name: "adventure", product_ids: [ 10, 11, 12 ] }categories: { _id: 21, name: "movie", product_ids: [ 10 ] }Cuts mapping table and 2 indexes, but:• potential consistency issue• lists can grow too large
  • 37. Alternative// Each product list the IDs of the categoriesproducts: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }// Association not stored on the categoriescategories: { _id: 20, name: "adventure"}
  • 38. Alternative// Each product list the IDs of the categoriesproducts: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }// Association not stored on the categoriescategories: { _id: 20, name: "adventure"}// All products for a given category> db.products.ensureIndex( { category_ids: 1} ) // yes!> db.products.find( { category_ids: 20 } )
  • 39. Common Use Cases Use cases: • Trees • Time Series
  • 40. TreesHierarchical information
  • 41. TreesFull Tree in Document{ retweet: [ { who: “Kyle”, text: “...”, retweet: [ {who: “James”, text: “...”, retweet: []} ]} ]}Pros: Single Document, Performance, IntuitiveCons: Hard to search or update, document can easily gettoo large
  • 42. Array of Ancestors A B C// Store all Ancestors of a node E D { _id: "a" } { _id: "b", tree: [ "a" ], retweet: "a" } F { _id: "c", tree: [ "a", "b" ], retweet: "b" } { _id: "d", tree: [ "a", "b" ], retweet: "b" } { _id: "e", tree: [ "a" ], retweet: "a" } { _id: "f", tree: [ "a", "e" ], retweet: "e" }// find all direct retweets of "b"> db.tweets.find( { retweet: "b" } )
  • 43. Array of Ancestors A B C// Store all Ancestors of a node E D { _id: "a" } { _id: "b", tree: [ "a" ], retweet: "a" } F { _id: "c", tree: [ "a", "b" ], retweet: "b" } { _id: "d", tree: [ "a", "b" ], retweet: "b" } { _id: "e", tree: [ "a" ], retweet: "a" } { _id: "f", tree: [ "a", "e" ], retweet: "e" }// find all direct retweets of "b"> db.tweets.find( { retweet: "b" } )// find all retweets of "e" anywhere in tree> db.tweets.find( { tree: "e" } )
  • 44. Array of Ancestors A B C// Store all Ancestors of a node E D { _id: "a" } { _id: "b", tree: [ "a" ], retweet: "a" } F { _id: "c", tree: [ "a", "b" ], retweet: "b" } { _id: "d", tree: [ "a", "b" ], retweet: "b" } { _id: "e", tree: [ "a" ], retweet: "a" } { _id: "f", tree: [ "a", "e" ], retweet: "e" }// find all direct retweets of "b"> db.tweets.find( { retweet: "b" } )// find all retweets of "e" anywhere in tree> db.tweets.find( { tree: "e" } )// find tweet history of f:> tweets = db.tweets.findOne( { _id: "f" } ).tree> db.tweets.find( { _id: { $in : tweets } } )
  • 45. Trees as Paths A B CStore hierarchy as a path expression E D• Separate each node by a delimiter, e.g. “,”• Use text search for find parts of a tree F• search must be left-rooted and use an index!{ retweets: [ { _id: "a", text: "initial tweet", path: "a" }, { _id: "b", text: "reweet with comment", path: "a,b" }, { _id: "c", text: "reply to retweet", path : "a,b,c"} ] }// Find the conversations "a" started> db.tweets.find( { path: /^a/ } )// Find the conversations under a branch> db.tweets.find( { path: /^a,b/ } )
  • 46. Time Series• Records stats by • Day, Hour, Minute• Show time series
  • 47. Time Series// Time series buckets, hour and minute sub-docs{ _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, hourly: { 0: 3, 1: 14, 2: 19 ... 23: 72 }, minute: { 0: 0, 1: 4, 2: 6 ... 1439: 0 }}// Add one to the last minute before midnight> db.votes.update( { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.037Z") }, { $inc: { "hourly.23": 1 }, $inc: { "minute.1439": 1 })
  • 48. BSON Storage• Sequence of key/value pairs• NOT a hash map• Optimized to scan quickly 0 1 2 3 ... 1439What is the cost of update the minute beforemidnight?
  • 49. BSON Storage• Can skip sub-documents 0 ... 23 0 1 ... 59 1380 ... 1439How could this change the schema?
  • 50. Time SeriesUse more of a Tree structure by nesting!// Time series buckets, each hour a sub-document{ _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, minute: { 0: { 0: 0, 1: 7, ... 59: 2 }, ... 23: { 0: 15, ... 59: 6 } }}// Add one to the last second before midnight> db.votes.update( { _id: "20111209-1231" }, ts: ISODate("2011-12-09T00:00:00.000Z") }, { $inc: { "minute.23.59": 1 } })
  • 51. Duplicate dataDocument to represent a shopping order:{ _id: 1234, ts: ISODate("2011-12-09T00:00:00.000Z") customerId: 67, total_price: 1050, items: [{ sku: 123, quantity: 2, price: 50, name: “macbook”, thumbnail: “macbook.png” }, { sku: 234, quantity: 1, price: 20, name: “iphone”, thumbnail: “iphone.png” }, ... }}The item information is duplicated in every order that reference it.Mongo’s flexible schema makes it easy!
  • 52. Duplicate data• Pros: • only 1 query to get all information needed to display the order • processing on the db is as fast as a BLOB • can achieve much higher performance• Cons: • more storage used ... cheap enough • updates are much more complicated ... just consider fields immutable
  • 53. Summary• Basic data design principles stay the same ...• But MongoDB is more flexible and brings possibilities• embed or duplicate data to speed up operations, cut downthe number of collections and indexes• watch for documents growing too large• make sure to use the proper indexes for querying and sorting• schema should feel natural to your application!
  • 54. download at mongodb.org conferences, appearances, and meetups http://www.10gen.com/events Facebook | Twitter | LinkedInhttp://bit.ly/mongofb @mongodb http://linkd.in/joinmongo