Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1	  
Directory Layout•  Separate files per database•  Aggressive preallocation•  Files contain one or more extents  -rw------- ...
Memory Mapping 0x7fffffffffff	                            STACK!                            …!                          LI...
Data Structures•  DiskLoc  •  Stores file number and offset of data on disk  •  Record *r = mmap base + DiskLoc.offset!  •...
Namespace Details•    Holds metadata about a collection or index•    Stored in 1KB buckets in <dbname>.ns file•    .ns fil...
Extent Structure  Extent	         Extent	      length	         length	       xNext	          xNext	       xPrev	          ...
Extents>	  db.foo.validate(	  {	  full	  :	  true	  }	  ).extents.forEach(	  	  	  	  	  	  	  	  	  	  	  function(z){	  ...
Index Extents>	  db.system.namespaces.find()	  {	  "name"	  :	  "test.foo"	  }	  {	  "name"	  :	  "test.system.indexes"	  ...
Extents and RecordsExtent	     length	      xNext	                    Data	  Record	      xPrev	       length	     Documen...
Extents and RecordsExtent	     length	      xNext	                    Data	  Record	      xPrev	       length	     Documen...
Extents and RecordsExtent	     length	      xNext	                    Data	  Record	                          Data	  Recor...
BSON Format        {	  hello:	  “world”	  }	    Doc	  Length	       Value	  Type	    x16x00x00x00 x02hellox00 !  x06x00x00...
Index ExtentsExtent	     length	                    Index	  Record	                               Index	  Record	      xNe...
Index Extents                                                                       	                                     ...
Journaling•  Write ahead logging•  Operations written to journal before memory  mapped regions  •  Private view  •  Shared...
Journal FormatJHeader	                                   •  Section	  contains	  single	  group	  commit	  JSectHeader	  [...
Journal Performance•  On 99.9% read systems, no impact•  Write performance degraded 5-30% when   journal on same drive•  S...
Journal Admin•  Journal stored in /dbpath/journal folder•  If faster, three 1gb files may be preallocated•  Can symlink to...
Fragmentation•  Files may become fragmented over time if   documents change size•  Free lists also contribute to fragmenta...
Compaction•  1.8 and previous: repairDatabase•  2.0+ : compact command  •  Currently resets paddingFactor, but can be     ...
Planned Changes•  Split data and indexes into different files•  Indexes could by symlinked to a different   drive (SSD)•  ...
Download	  MongoDB	  http://www.mongodb.org/downloads	                 	         Ben	  Becker	    ben.becker@10gen.com	  
Upcoming SlideShare
Loading in …5
×

MongoDB Journaling and the Storage Enginer

4,165 views

Published on

Published in: Technology, Business
  • Be the first to comment

MongoDB Journaling and the Storage Enginer

  1. 1. 1  
  2. 2. Directory Layout•  Separate files per database•  Aggressive preallocation•  Files contain one or more extents -rw------- 1 ben ben 64M May 1 19:14 test.0! -rw------- 1 ben ben 128M May 1 19:14 test.1! -rw------- 1 ben ben 256M May 1 18:25 test.2! -rw------- 1 ben ben 512M May 1 19:14 test.3! -rw------- 1 ben ben 1.0G May 1 19:14 test.4! -rw------- 1 ben ben 2.0G May 1 18:58 test.5! -rw------- 1 ben ben 16M May 1 19:14 test.ns! 2  
  3. 3. Memory Mapping 0x7fffffffffff   STACK! …! LIBS! …! test.ns! Disk   test.0! test.1! …! ! …! HEAP! {  …  }   MONGOD! NULL! 0x0   Document   Process  Virtual  Memory  
  4. 4. Data Structures•  DiskLoc •  Stores file number and offset of data on disk •  Record *r = mmap base + DiskLoc.offset! •  Max offset is 2^31 (2GB)!•  NamespaceDetails •  Stores collection metadata!•  Extent! •  Stores contiguous blocks within a namespace •  Max extent size is 2GB  •  Record! •  Holds a BSON document or B-tree bucket •  DeletedRecord overwrites a Record! •  Includes Padding
  5. 5. Namespace Details•  Holds metadata about a collection or index•  Stored in 1KB buckets in <dbname>.ns file•  .ns file fixed size of 16MB•  Maintains document count•  Contains heads of linked lists NamespaceDetails   firstExtent   lastExtent   _indexes[]   stats   freeList[]  
  6. 6. Extent Structure Extent   Extent   length   length   xNext   xNext   xPrev   xPrev   firstRecord   firstRecord   lastRecord   lastRecord  
  7. 7. Extents>  db.foo.validate(  {  full  :  true  }  ).extents.forEach(                      function(z){  print(  z.loc  +  "tt"  +  z.size  );  }  )  0:3000    20480  0:12000    81920  0:26000    327680  0:76000    1310720  0:1da000  5242880  0:76a000  6291456  0:d6a000  7553024  0:16de000  9064448  0:1f83000  10878976  0:29e3000  13058048  1:2000    15671296  1:ef4000  18808832  1:29e4000  22573056  
  8. 8. Index Extents>  db.system.namespaces.find()  {  "name"  :  "test.foo"  }  {  "name"  :  "test.system.indexes"  }  {  "name"  :  "test.foo.$_id_"  }    >  db["foo.$_id_"].validate(  {  full  :  true  }  ).extents.forEach(                      function(z){  print(  z.loc  +  "tt"  +  z.size  );  }  )  0:9000    36864  0:1b6000  147456  0:6da000  589824  0:149e000  2359296  1:20e4000  9437184  
  9. 9. Extents and RecordsExtent   length   xNext   Data  Record   xPrev   length   Document   {     rNext   firstRecord      _id:  “foo”,      ...     rPrev   }   lastRecord  
  10. 10. Extents and RecordsExtent   length   xNext   Data  Record   xPrev   length   Document   {     rNext   firstRecord      _id:  “foo”,      ...     rPrev   }   lastRecord  
  11. 11. Extents and RecordsExtent   length   xNext   Data  Record   Data  Record   xPrev   length   Document   length   Document   {     {     rNext   rNext   firstRecord      _id:  “foo”,      _id:  “foo”,      ...        ...     rPrev   }   rPrev   }   lastRecord  
  12. 12. BSON Format {  hello:  “world”  }   Doc  Length   Value  Type   x16x00x00x00 x02hellox00 ! x06x00x00x00 worldx00x00! Value  Length  
  13. 13. Index ExtentsExtent   length   Index  Record   Index  Record   xNext   xPrev   length   Bucket   length   Bucket   parent   parent   rNext   rNext   firstRecord   numKeys   numKeys   rPrev   K         rPrev     lastRecord   {  Document  }  
  14. 14. Index Extents   4   9   1   3   5   6   8   A   BExtent   length   Index  Record   Index  Record   xNext   xPrev   length   Bucket   length   Bucket   parent   parent   rNext   rNext   firstRecord   numKeys   numKeys   rPrev   K         rPrev     lastRecord   {  Document  }  
  15. 15. Journaling•  Write ahead logging•  Operations written to journal before memory mapped regions •  Private view •  Shared view•  Once journal written, data safe unless hardware problem•  By default, journal flushed every 100ms, 100mb of writes, or on write concern of j=true •  User configurable with --journalCommitInterval
  16. 16. Journal FormatJHeader   •  Section  contains  single  group  commit  JSectHeader  [LSN  3]   •  Applied  all-­‐or-­‐nothing   DurOp   DurOp   DurOp   Op_DbContext   Set  database  context  for  JSectFooter   length   subsequent  operations   offset  JSectHeader  [LSN  7]   fileNo   DurOp   data[length]   DurOp   length   offset   Write  Operation   DurOp   fileNo   data[length]  JSectFooter   length  …   offset   fileNo   data[length]  
  17. 17. Journal Performance•  On 99.9% read systems, no impact•  Write performance degraded 5-30% when journal on same drive•  Separate drive as low as 3%
  18. 18. Journal Admin•  Journal stored in /dbpath/journal folder•  If faster, three 1gb files may be preallocated•  Can symlink to a different spindle•  --journalCommitInterval* (2ms - 300ms)•  When to journal •  Single node: required for data integrity •  Replica set: at least 1 node •  All nodes: removes possible need to resync
  19. 19. Fragmentation•  Files may become fragmented over time if documents change size•  Free lists also contribute to fragmentation •  2.0 reduced scanning to reasonable amounts •  2.2 will change allocation strategy •  Need to re-write free list to do online compaction
  20. 20. Compaction•  1.8 and previous: repairDatabase•  2.0+ : compact command •  Currently resets paddingFactor, but can be changed. •  Index (re)generation is now concurrent, so compaction can be N times faster•  Generally causes some extra allocation •  Does not delete or truncate files
  21. 21. Planned Changes•  Split data and indexes into different files•  Indexes could by symlinked to a different drive (SSD)•  Improved allocation strategy
  22. 22. Download  MongoDB  http://www.mongodb.org/downloads     Ben  Becker   ben.becker@10gen.com  

×