Mongodb in-anger-boston-rb-2011
Upcoming SlideShare
Loading in...5

Mongodb in-anger-boston-rb-2011






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Mongodb in-anger-boston-rb-2011 Mongodb in-anger-boston-rb-2011 Presentation Transcript

    • Using MongoDB in Anger Techniques and Considerations
    • Kyle and @hwaet
    • Four topics:Schema designIndexingConcurrencyDurability
    • I. Schema design
    • Document sizeKeys are stored in the documentsthemselvesFor large data sets, you should use smallkey names.
    • > doc = { _id: ObjectId("4e94886ebd15f15834ff63c4"), username: Kyle, date_of_birth: new Date(1970, 1, 1), site_visits: 1027 }> Object.bsonsize( doc );85
    • > doc = { _id: ObjectId("4e94886ebd15f15834ff63c4"), name: Kyle, dob: new Date(1970, 1, 1), v: 1027 }> Object.bsonsize( doc );61 // 28% smaller!
    • Document growthCertain schema designs require documentsto grow significantly.This can be expensive.
    • // Sample: user with followers{ _id: ObjectId("4e94886ebd15f15834ff63c4"), name: Kyle followers: [ { user_id: ObjectId("4e94875fbd15f15834ff63c3") name: arussell }, { user_id: ObjectId("4e94875fbd15f15834ff63c4") name: bsmith } ]}
    • An initial design:// Update using $push will grow the documentnew_follower = { user_id: ObjectId("4e94875fbd15f15834ff63c5") name: jcampbell }db.users.update({name: Kyle}, { $push: {friends: { $push: new_follower } } )
    • Lets break this down...At first, documents are inserted with noextra space.But updates that change the size of thedocuments will alter the padding factor.Even with a large padding factor,documents that grow unbounded will stilleventually have to be moved.
    • Relocation is expensive:All index entry pointers must be updated.Entire document must be rewritten in a newplace on disk (possibly not in RAM).May cause fragmentation. Increases thenumber of entries in the free list.
    • A better design:// User collection{ _id: ObjectId("4e94886ebd15f15834ff63c4"), name: Kyle}// Followers collection{ friend_id: ObjectId("4e94875fbd15f15834ff63c3") name: arussell },{ friend_id: ObjectId("4e94875fbd15f15834ff63c4") name: bsmith }
    • The upshot?Rich documents are still useful. Theysimplify the representation of objects andcan increase query performance because oftheir pre-joined structure.However, if your documents are going togrow unbounded, its best to separate theminto multiple collections.
    • Pre-aggregation
    • AggregationMap-reduceand group are adequate, butmay not be fast enough for large data sets.MongoDB 2.2 has a new, fast aggregationframework!Still, pre-aggregation will be faster thanpost-aggregation in a lot of cases. For real-time apps, its almost a necessity.
    • Example: a counter cache.// User collection{ _id: ObjectId("4e94886ebd15f15834ff63c4"), name: Kyle, follower_ct: 4}
    • Using the $inc operator:// This increment is in-place.// (i.e., no rewriting of the document).db.users.update({name: Kyle}, {$inc: {follower_ct: 1}})
    • Need a real-world example?
    • A sophisticated example of pre- aggregation.{ _id: { uri: BinData("0beec7b5ea3f0fdbc95d0dd47f35"), day: 2011-5-1 }, total: 2820, hrs: { 0: 500, 1: 700, 2: 450, 3: 343, // ... 4-23 go here } // Minutes are rolling. This gives real-time // numbers for the last hour. So when you increment // minute n, you need to $set minute n-1 to 0. mins: { 1: 12, 2: 10, 3: 5, 4: 34 // ... 5-60 go here }}
    • Schema design summaryThink hard about the size of yourdocuments. Optimize keys and data types(not discussed).If your documents are growing unbounded,you may have the wrong schema design.Consider operations that rewrite documents(and individual values) in-place. $inc and(sometimes) $set is great examples of this.
    • II. Indexing
    • Its all about efficiency:Fundamental, but widely misunderstood.The right indexes gives you the mostefficient use of your hardware (RAM, disk,and CPU).The wrong indexes, or no indexesaltogether, make trivial workloadsimpossible to run, even on high-endhardware.
    • The BasicsEvery query should use an index. Use theMongoDB log or the query profiler to identifyqueries not using an index. The value ofnscanned should be low.Know about compound-key index. Knowwhich indexes can be utilized for sorts,ranges, etc. Learn to use explain().Good resources on indexing: MongoDB inAction and High Performance MySQL.
    • For the best performance, you should have Working setenough RAM to contain indexes andworking set.Working set is the portion of your total datasize thats regularly used by the application.For some applications, working set might be50% of data size. For others, its close to100%.For example, think about Foursquarescheckins database. Because checkins areconstantly queried to calculate badges,checkins must live in RAM. So working seton this database is 100%.
    • Working set (cont.)On the other end of the spectrum, Craigslistuses MongoDB as a listing archive. Thisarchive is rarely queried. Therefore, itdoesnt matter if data size is much largerthan RAM, since the working set is small.
    • Special indexing features...
    • Sparse indexesUse a sparse index to reduce index size. Asparse include will include only thosedocument having the indexed key.For example, suppose you have 10 millionusers, of which only 100K are payingsubscribers. You can index only those fieldsrelevant to paid subscriptions with a sparseindex.
    • A sparse index:db.users.ensureIndex({expiration: 1}, {sparse: true})// All users whose accounts expire next monthdb.users.find({expiration: {$lte: new Date(2011, 11, 30), $gte: new Date(2011, 11, 1)})
    • Index-only queriesIf you only need a few values, you canreturn those values directly from the index.This eliminates the indirection from index todata files on the server.Specify the fields you want, and exclude the_id field.The explain() method will display{indexOnly: true}.
    • An index-only query:db.users.ensureIndex({follower_ct: 1, name: 1})// This will be index-only.db.users.find({}, {follower_ct: 1, name: 1, _id: 0}).sort({follower_ct: -1})
    • Indexing summaryLearn about indexing.Ensure that your queries are using the mostefficient index.Investigate sparse indexes and index-onlyqueries for performance-intensive apps.
    • Concurrency
    • Current implementation:Concurrency is still somewhat coarse-grained. For any given mongod, theres aserver-wide reader-writer lock, with a varietyof yielding optimizations.For example, in MongoDB 2.0, the serverwont hold a write lock around a page fault.On the roadmap are database-level locking,collection-level locking, and extent-basedlocking.
    • To avoid concurrency-related bottlenecks: Separate orthogonal concerns into multiple smaller deployments. For example, one for analytics and another for the rest of the app. Ensure that your indexes and working set fit in RAM. Do not attempt to scale reads with secondary nodes unless your application is mostly read-heavy.
    • mostly read-heavy. IV. Durability
    • Four topics:StorageJournalingWrite concernReplication
    • StorageEach file is mapped to virtual memory.All writes to data files are to a virtualmemory address.Sync to disk is handled by the OS, with aforced flush every 60 seconds.
    • Virtual Memory Physical(Per Process) Memory RAM Disk
    • JournalingData written to an append-only log, andsynced every 100ms.This imposes a write penalty, especially onslow drives.If you use journaling, you may want tomount a separate drive for the journaldirectory.Enabled by default in MongoDB 2.0.
    • ReplicationFast, automatic failover.Simplifies backups.If you dont want to use journaling, you canuse replication instead. Recovery can betrickier, but writes will be faster.
    • Write concern
    • A default, fire-and-forget write:@users.insert( {name => Kyle} )
    • Write with a round trip:@users.insert( {name => Kyle}, :safe => true )
    • Write to two nodes with a 1000ms timeout:@users.insert( {name => Kyle}, :safe => {:w => 2, :wtimeout => 1000})
    • Write concern advice:Use a level of write concern appropriate tothe data youre writing.By default, use {:safe => true}. That is,ensure a single round trip.For especially sensitive data, use replicationacknowledgment.For analyics, clicks, logging, etc., use fire-and-forget.
    • Durability in angerUse replication for durability. You can,optionally, keep a single, passive replicawith durability enabled.Use write concern judiciously.
    • Topics we didnt cover:Hardware and deployment practices.Sharding and schema design at scale.(Lots of videos on these at!)
    • Announcements, Questions, and Credits
    • Thank you