MongoDB London 2013:Understanding MongoDB Storage for Performance and Data Safety by Christian Kvalheim, 10gen

MongoDB London 2013:Understanding MongoDB Storage for Performance and Data Safety by Christian Kvalheim, 10gen



In this deep dive, we'll look under the hood at how the MongoDB storage engine works to give you greater insight into both performance and data safety. You'll learn about storage layout, indexes, ...

In this deep dive, we'll look under the hood at how the MongoDB storage engine works to give you greater insight into both performance and data safety. You'll learn about storage layout, indexes, memory mapping, journaling, and fragmentation. This is an internals session intended for those who already have a basic understanding of MongoDB.



Total Views
Views on SlideShare
Embed Views



5 Embeds 260 150 97 5 5 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • -I’m particularly interested in low level programming
  • Just need colum 1 4 last covert to mb- Each database has one or more data files, all in same folder (e.g. test.0, test.1, …)Those files get larger and larger, up to 2GBThe journal folder contains the journal files
  • We’ll cover how data is laid out in more detail later, but at a very high level this is what’s contained in each data file.Extents are allocated within files as needed until space inside the file is fully consumed by extents, then another file is allocated.As more and more data is inserted into a collection, the size of the allocated extents grows exponentially up to 2GB.If you’re writing to many small collections, extents will be interleaved within a single database file, but as the collection grows, eventually a single 2GB file will be devoted to data for that single collection.
  • Imagine your data laid out in these (possibly) sparse extents like a backbone, with linked lists of data records hanging off of each of themGo over table scan logic. $natural  (sort({$natural:-1})When executing a find() with no parameters, the database returns objects in forward natural order. To do a TABLE SCAN, you would start at the first extent indicated in the NamespaceDetails and then traverse the records within.
  • Highlight the important parts
  • Highlight the important parts
  • Can use pmap to view this layout on linux.So here I have a single document located on disk and show how it’s mapped into memory.Let’s look at how we go about actually accessing this document once it’s loaded into memory
  • Break into 2 slides
  • Without journaling, the approach is quite straightforward, there is a one-to-one mapping of data files to memory and when either the OS or an explicit fsync happens, your data is now safe on disk.With journaling we do some tricks.Write ahead log, that is, we write the data to the journal before we update the data itself.Each file is mapped twice, once to a private view which is marked copy-on-write, and once to the shared view – shared in the context that the disk has access to this memory.Every time we do a write, we keep a list of the region of memory that was written to.Batches into group commits, compresses and appends in a group commit to disk by appending to a special journal fileOnce that data has been written to disk, we then do a remapping phase which copies the changes into the shared view, at which point those changes can then be synced to disk.Once that data is synced to disk then it’s safe (barring hardware failure). If there is a failure before the shared/storage view is written to disk, we simply need to apply all the changes in order to the data files since the last time it was synced and we get back to a consistent view of the data
  • *** Journal is Compressed using snappy to reduce the write volumeLSN (Last Sequence Number) is written to disk as a way to provide a marker to which section of the journal was last flushed to

MongoDB London 2013:Understanding MongoDB Storage for Performance and Data Safety by Christian Kvalheim, 10gen MongoDB London 2013:Understanding MongoDB Storage for Performance and Data Safety by Christian Kvalheim, 10gen Presentation Transcript

  • Engineering Team lead, 10gen, @christkvChristian Amor Kvalheim#fosdemUnderstanding MongoDBStorage for Performanceand Data Safety?
  • Journaling and StorageWho Am I• Engineering lead at 10gen• Work on the Mongo DB Node.js driver• Started at 10gen a year and a half ago• @christkv•
  • Journaling and StorageWhy Pop the Hood?• Understanding data safety• Estimating RAM / disk requirements• Optimizing performance
  • Storage Layout
  • drwxr-xr-x 136 Nov 19 10:12 journal-rw------- 16777216 Oct 25 14:58 test.0-rw------- 134217728 Mar 13 2012 test.1-rw------- 268435456 Mar 13 2012 test.2-rw------- 536870912 May 11 2012 test.3-rw------- 1073741824 May 11 2012 test.4-rw------- 2146435072 Nov 19 10:14 test.5-rw------- 16777216 Nov 19 10:13 test.nsDirectory LayoutJournaling and Storage
  • Directory Layout• Aggressive pre-allocation (always 1 spare file)• There is one namespace file per db which canhold 18000 entries per default• A namespace is a collection or an indexJournaling and Storage
  • Tuning with Options• Use --directoryperdb to separate dbs into ownfolders which allows to use different volumes(isolation, performance)• Use --smallfiles to keep data files smaller• If using many databases, use –nopreallocateand --smallfiles to reduce storage size• If using thousands of collections & indexes,increase namespace capacity with --nssizeJournaling and Storage
  • Internal Structure
  • Internal File FormatJournaling and Storage
  • Extent StructureJournaling and Storage
  • Extents and RecordsJournaling and Storage
  • To Sum Up: Internal FileFormat• Files on disk are broken into extents whichcontain the documents• A collection has one or more extents• Extent grow exponentially up to 2GB• Namespace entries in the ns file point to the firstextent for that collectionJournaling and Storage
  • What About Indexes?
  • Indexes• Indexes are BTree structures serialized to disk• They are stored in the same files as data butusing own extentsJournaling and Storage
  • > db.stats(){"db" : "test","collections" : 22,"objects" : 17000383, ## number of documents"avgObjSize" : 44.33690276272011,"dataSize" : 753744328, ## size of data"storageSize" : 1159569408, ## size of all containing extents"numExtents" : 81,"indexes" : 85,"indexSize" : 624204896, ## separate index storage size"fileSize" : 4176478208, ## size of data files on disk"nsSizeMB" : 16,"ok" : 1}The DB StatsJournaling and Storage
  • > db.large.stats(){"ns" : "test.large","count" : 5000000, ## number of documents"size" : 280000024, ## size of data"avgObjSize" : 56.0000048,"storageSize" : 409206784, ## size of all containing extents"numExtents" : 18,"nindexes" : 1,"lastExtentSize" : 74846208,"paddingFactor" : 1, ## amount of padding"systemFlags" : 0,"userFlags" : 0,"totalIndexSize" : 162228192, ## separate index storage size"indexSizes" : {"_id_" : 162228192},"ok" : 1}The Collection StatsJournaling and Storage
  • What’s MemoryMapping?
  • Memory Mapped Files• All data files are memory mapped to VirtualMemory by the OS• MongoDB just reads / writes to RAM in thefilesystem cache• OS takes care of the rest!• Virtual process size = total files size + overhead(connections, heap)• If journal is on, the virtual size will be roughlydoubledJournaling and Storage
  • Virtual Address SpaceJournaling and Storage
  • Memory Map, Love It or HateIt• Pros:– No complex memory / disk code in MongoDB, huge win!– The OS is very good at caching for any type of storage– Least Recently Used behavior– Cache stays warm across MongoDB restarts• Cons:– RAM usage is affected by disk fragmentation– RAM usage is affected by high read-ahead– LRU behavior does not prioritize things (like indexes)Journaling and Storage
  • How Much Data is in RAM?• Resident memory the best indicator of howmuch data in RAM• Resident is: process overhead (connections,heap) + FS pages in RAM that were accessed• Means that it resets to 0 upon restart eventhough data is still in RAM due to FS cache• Use free command to check on FS cache size• Can be affected by fragmentation and read-aheadJournaling and Storage
  • Journaling
  • The ProblemChanges in memory mapped files are not appliedin order and different parts of the file can be fromdifferent points in timeYou want a consistent point-in-time snapshotwhen restarting after a crashJournaling and Storage
  • corruption
  • Solution – Use a Journal• Data gets written to a journal before making it tothe data files• Operations written to a journal buffer in RAM thatgets flushed every 100ms by default or 100MB• Once the journal is written to disk, the data issafe• Journal prevents corruption and allowsdurability• Can be turned off, but don’t!Journaling and Storage
  • Journal FormatJournaling and Storage
  • Can I Lose Data on a HardCrash?• Maximum data loss is 100ms (journal flush). Thiscan be reduced with –journalCommitInterval• For durability (data is on disk when ack’ed) use theJOURNAL_SAFE write concern (“j” option).• Note that replication can reduce the data lossfurther. Use the REPLICAS_SAFE write concern(“w” option).• As write guarantees increase, latency increases.To maintain performance, use more connections!Journaling and Storage
  • What is the Cost of aJournal?• On read-heavy systems, no impact• Write performance is reduced by 5-30%• If using separate drive for journal, as low as 3%• For apps that are write-heavy (1000+ writes perserver) there can be slowdown due to mix ofjournal and data flushes. Use a separate drive!Journaling and Storage
  • Fragmentation
  • Journaling and StorageWhat it Looks LikeBoth on disk and in RAM!
  • Fragmentation• Files can get fragmented over time if remove()and update() are issued.• It gets worse if documents have varied sizes• Fragmentation wastes disk space and RAM• Also makes writes scattered and slower• Fragmentation can be checked by comparingsize to storageSize in the collection’s stats.Journaling and Storage
  • How to CombatFragmentation• compact command (maintenance op)• Normalize schema more (documents don’tgrow)• Pre-pad documents (documents don’t grow)• Use separate collections over time, then usecollection.drop() instead ofcollection.remove(query)• --usePowerOf2sizes option makes disk bucketsmore reusableJournaling and Storage
  • In Review
  • In Review• Understand disk layout and footprint• See how much data is actually in RAM• Memory mapping is cool• Answer how much data is ok to lose• Check on fragmentation and avoid it• and Storage
  • Engineering Team lead, 10gen, @christkvChristian Amor Kvalheim#fosdemQuestions?