Storage talk

C
#fosdem




How does Mongo store my
data?
Christian Amor Kvalheim
Engineering Team lead, 10gen, @christkv
Who Am I
• Engineering lead at 10gen

• Work on the MongoDB Node.js driver

• @christkv

• christkv@10gen.com




                        Journaling and Storage
Why Pop the Hood?
• Understanding data safety

• Estimating RAM / disk requirements

• Optimizing performance




                        Journaling and Storage
Storage Layout
Directory Layout
drwxr-xr-x   136           Nov 19 10:12            journal
-rw-------   16777216      Oct 25 14:58            test.0
-rw-------   134217728     Mar 13 2012             test.1
-rw-------   268435456     Mar 13 2012             test.2
-rw-------   536870912     May 11 2012             test.3
-rw-------   1073741824    May 11 2012             test.4
-rw-------   2146435072    Nov 19 10:14            test.5
-rw-------   16777216      Nov 19 10:13            test.ns




                          Journaling and Storage
Directory Layout
• Aggressive pre-allocation (always 1 spare file)
• There is one namespace file per db which can
 hold 24000 entries per default
• A namespace is a collection or an index




                     Journaling and Storage
Tuning with Options
• Use --directoryperdb to separate dbs into own folders
  which allows to use different volumes (isolation,
  performance)
• You can use --nopreallocate to prevent preallocation

• Use --smallfiles to keep data files smaller

• If using many databases, use –nopreallocate and --
  smallfiles to reduce storage size
• If using thousands of collections & indexes, increase
  namespace capacity with --nssize
                         Journaling and Storage
Internal Structure
Internal File Format




            Journaling and Storage
Extent Structure




           Journaling and Storage
Extents and Records




           Journaling and Storage
To Sum Up: Internal File
Format
• Files on disk are broken into extents which
 contain the documents
• A collection has 1 to many extents
• Extent grow exponentially up to 2GB
• Namespace entries in the ns file point to the first
 extent for that collection




                     Journaling and Storage
What About Indexes?
Indexes
• Indexes are BTree structures serialized to disk
• They are stored in the same files as data but
 using own extents




                    Journaling and Storage
The DB Stats
> db.stats()
{
    "db" : "test",
    "collections" : 22,
    "objects" : 17000383, ## number of documents
    "avgObjSize" : 44.33690276272011,
    "dataSize" : 753744328, ## size of data
    "storageSize" : 1159569408, ## size of all containing extents
    "numExtents" : 81,
    "indexes" : 85,
    "indexSize" : 624204896, ## separate index storage size
    "fileSize" : 4176478208, ## size of data files on disk
    "nsSizeMB" : 16,
    "ok" : 1
}



                            Journaling and Storage
The Collection Stats
> db.large.stats()
{
    "ns" : "test.large",
    "count" : 5000000, ## number of documents
    "size" : 280000024, ## size of data
    "avgObjSize" : 56.0000048,
    "storageSize" : 409206784, ## size of all containing extents
    "numExtents" : 18,
    "nindexes" : 1,
    "lastExtentSize" : 74846208,
    "paddingFactor" : 1, ## amount of padding
    "systemFlags" : 0,
    "userFlags" : 0,
    "totalIndexSize" : 162228192, ## separate index storage size
    "indexSizes" : {
        "_id_" : 162228192
    },
    "ok" : 1
}


                            Journaling and Storage
What’s Memory
Mapping?
Memory Mapped Files
• All data files are memory mapped to Virtual
 Memory by the OS
• MongoDB just reads / writes to RAM in the
 filesystem cache
• OS takes care of the rest!

• Virtual process size = total files size + overhead
 (connections, heap)
• If journal is on, the virtual size will be roughly
 doubled
                         Journaling and Storage
Virtual Address Space




           Journaling and Storage
Memory Map, Love It or Hate
It
• Pros:
   –   No complex memory / disk code in MongoDB, huge win!
   –   The OS is very good at caching for any type of storage
   –   Least Recently Used behavior
   –   Cache stays warm across MongoDB restarts
• Cons:
   – RAM usage is affected by disk fragmentation
   – RAM usage is affected by high read-ahead
   – LRU behavior does not prioritize things (like indexes)



                          Journaling and Storage
How Much Data is in RAM?
• Resident memory the best indicator of how
 much data in RAM
• Resident is: process overhead
 (connections, heap) + FS pages in RAM that
 were accessed
• Means that it resets to 0 upon restart even
 though data is still in RAM due to FS cache
• Use free command to check on FS cache size
• Can be affected by fragmentation and read-
 ahead               Journaling and Storage
Journaling
The Problem
Changed in memory mapped files are not applied
in order and different parts of the file can be from
different points in time


You want a consistent point-in-time snapshot
when restarting after a crash




                     Journaling and Storage
Storage talk
corruptio
n
Solution – Use a Journal
• Data gets written to a journal before making it to
 the data files
• Operations written to a journal buffer in RAM that
 gets flushed every 100ms by default or 100MB
• Once journal is written to disk, data is safe
• Journal prevents corruption and allows
 durability
• Can be turned off, but don’t!

                     Journaling and Storage
Journal Format




           Journaling and Storage
Can I Lose Data on a Hard
Crash?
• Maximum data loss is 100ms (journal flush). This
 can be reduced with –journalCommitInterval
• For durability (data is on disk when ack’ed) use the
 JOURNAL_SAFE write concern (“j” option).
• Note that replication can reduce the data loss
 further. Use the REPLICAS_SAFE write concern
 (“w” option).
• As write guarantees increase, latency increases.
 To maintain performance, use more connections!
                      Journaling and Storage
What is the Cost of a
Journal?
• On read-heavy systems, no impact
• Write performance is reduced by 5-30%
• If using separate drive for journal, as low as 3%
• For apps that are write-heavy (1000+ writes per
 server) there can be slowdown due to mix of
 journal and data flushes. Use a separate drive!




                     Journaling and Storage
Fragmentation
What it Looks Like
Both on disk and in RAM!




                   Journaling and Storage
Fragmentation
• Files can get fragmented over time if remove()
 and update() are issued.
• It gets worse if documents have varied sizes
• Fragmentation wastes disk space and RAM
• Also makes writes scattered and slower
• Fragmentation can be checked by comparing
 size to storageSize in the collection’s stats.


                     Journaling and Storage
How to Combat
Fragmentation
• compact command (maintenance op)
• Normalize schema more (documents don’t
 grow)
• Pre-pad documents (documents don’t grow)
• Use separate collections over time, then use
 collection.drop() instead of
 collection.remove(query)
• --usePowerOf2sizes option makes disk buckets
 more reusable
                     Journaling and Storage
In Review
In Review
• Understand disk layout and footprint
• See how much data is actually in RAM
• Memory mapping is cool
• Answer how much data is ok to lose
• Check on fragmentation and avoid it
• https://github.com/10gen-labs/storage-viz



                    Journaling and Storage
#fosdem




Questions?
Christian Amor Kvalheim
Engineering Team lead, 10gen, @christkv
1 of 36

Recommended

Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final by
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
395 views38 slides
MongoDB Miami Meetup 1/26/15: Introduction to WiredTiger by
MongoDB Miami Meetup 1/26/15: Introduction to WiredTigerMongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
MongoDB Miami Meetup 1/26/15: Introduction to WiredTigerValeri Karpov
1.7K views27 slides
MongoDB Internals by
MongoDB InternalsMongoDB Internals
MongoDB InternalsSiraj Memon
1.2K views31 slides
MongoDB WiredTiger Internals by
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger InternalsNorberto Leite
12.7K views66 slides
A Technical Introduction to WiredTiger by
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
14.3K views42 slides
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine by
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB
805 views34 slides

More Related Content

What's hot

A Technical Introduction to WiredTiger by
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
2.4K views38 slides
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15) by
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
5.4K views29 slides
Let the Tiger Roar! - MongoDB 3.0 + WiredTiger by
Let the Tiger Roar! - MongoDB 3.0 + WiredTigerLet the Tiger Roar! - MongoDB 3.0 + WiredTiger
Let the Tiger Roar! - MongoDB 3.0 + WiredTigerJon Rangel
4.3K views38 slides
Mongo db dhruba by
Mongo db dhrubaMongo db dhruba
Mongo db dhrubaDhrubaji Mandal ♛
400 views54 slides
Building Hybrid data cluster using PostgreSQL and MongoDB by
Building Hybrid data cluster using PostgreSQL and MongoDBBuilding Hybrid data cluster using PostgreSQL and MongoDB
Building Hybrid data cluster using PostgreSQL and MongoDBAshnikbiz
3.6K views32 slides
MongoDB Pros and Cons by
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
16.8K views47 slides

What's hot(20)

A Technical Introduction to WiredTiger by MongoDB
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
MongoDB2.4K views
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15) by MongoDB
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB5.4K views
Let the Tiger Roar! - MongoDB 3.0 + WiredTiger by Jon Rangel
Let the Tiger Roar! - MongoDB 3.0 + WiredTigerLet the Tiger Roar! - MongoDB 3.0 + WiredTiger
Let the Tiger Roar! - MongoDB 3.0 + WiredTiger
Jon Rangel4.3K views
Building Hybrid data cluster using PostgreSQL and MongoDB by Ashnikbiz
Building Hybrid data cluster using PostgreSQL and MongoDBBuilding Hybrid data cluster using PostgreSQL and MongoDB
Building Hybrid data cluster using PostgreSQL and MongoDB
Ashnikbiz3.6K views
MongoDB Pros and Cons by johnrjenson
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
johnrjenson16.8K views
Microsoft Hekaton by Siraj Memon
Microsoft HekatonMicrosoft Hekaton
Microsoft Hekaton
Siraj Memon782 views
Webinar: Deploying MongoDB to Production in Data Centers and the Cloud by MongoDB
Webinar: Deploying MongoDB to Production in Data Centers and the CloudWebinar: Deploying MongoDB to Production in Data Centers and the Cloud
Webinar: Deploying MongoDB to Production in Data Centers and the Cloud
MongoDB828 views
Scaling with MongoDB by Rick Copeland
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
Rick Copeland15.2K views
WiredTiger & What's New in 3.0 by MongoDB
WiredTiger & What's New in 3.0WiredTiger & What's New in 3.0
WiredTiger & What's New in 3.0
MongoDB1.3K views
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_... by MongoDB
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
MongoDB70.7K views
Agility and Scalability with MongoDB by MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB2.5K views
Gruter TECHDAY 2014 Realtime Processing in Telco by Gruter
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter3.6K views
What'sNnew in 3.0 Webinar by MongoDB
What'sNnew in 3.0 WebinarWhat'sNnew in 3.0 Webinar
What'sNnew in 3.0 Webinar
MongoDB8.6K views
Introduction to MongoDB by Ravi Teja
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja53.5K views
HBaseConAsia2018 Track3-6: HBase at Meituan by Michael Stack
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at Meituan
Michael Stack515 views
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB by Athiq Ahamed
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed7.6K views
A complete guide to azure storage by Himanshu Sahu
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
Himanshu Sahu1.6K views
CouchDB by codebits
CouchDBCouchDB
CouchDB
codebits19.7K views

Similar to Storage talk

Webinar: Understanding Storage for Performance and Data Safety by
Webinar: Understanding Storage for Performance and Data SafetyWebinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data SafetyMongoDB
1.2K views34 slides
Tuning Linux Windows and Firebird for Heavy Workload by
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadMarius Adrian Popa
203 views60 slides
Capacity Planning by
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
4.4K views58 slides
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA... by
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum
638 views59 slides
Managing Security At 1M Events a Second using Elasticsearch by
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
557 views27 slides
MariaDB Performance Tuning and Optimization by
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB plc
11.3K views51 slides

Similar to Storage talk(20)

Webinar: Understanding Storage for Performance and Data Safety by MongoDB
Webinar: Understanding Storage for Performance and Data SafetyWebinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data Safety
MongoDB1.2K views
Tuning Linux Windows and Firebird for Heavy Workload by Marius Adrian Popa
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
Marius Adrian Popa203 views
Capacity Planning by MongoDB
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB4.4K views
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA... by Maaz Anjum
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum638 views
Managing Security At 1M Events a Second using Elasticsearch by Joe Alex
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex557 views
MariaDB Performance Tuning and Optimization by MariaDB plc
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
MariaDB plc11.3K views
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015) by Lars Marowsky-Brée
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Lars Marowsky-Brée1.3K views
MariaDB Server Performance Tuning & Optimization by MariaDB plc
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
MariaDB plc8.1K views
Maximizing performance via tuning and optimization by MariaDB plc
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
MariaDB plc104 views
Maximizing performance via tuning and optimization by MariaDB plc
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
MariaDB plc642 views
Deployment Strategy by MongoDB
Deployment StrategyDeployment Strategy
Deployment Strategy
MongoDB1.4K views
Deployment Strategies by MongoDB
Deployment StrategiesDeployment Strategies
Deployment Strategies
MongoDB1.1K views
Accelerating hbase with nvme and bucket cache by David Grier
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
David Grier571 views
Geek Sync | Guide to Understanding and Monitoring Tempdb by IDERA Software
Geek Sync | Guide to Understanding and Monitoring TempdbGeek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring Tempdb
IDERA Software724 views
VLDB Administration Strategies by Murilo Miranda
VLDB Administration StrategiesVLDB Administration Strategies
VLDB Administration Strategies
Murilo Miranda1.1K views
Elastic storage in the cloud session 5224 final v2 by BradDesAulniers2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2
BradDesAulniers285 views
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas... by Red_Hat_Storage
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage1.1K views
Best Practices with PostgreSQL on Solaris by Jignesh Shah
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
Jignesh Shah2K views

More from christkv

From SQL to MongoDB by
From SQL to MongoDBFrom SQL to MongoDB
From SQL to MongoDBchristkv
184 views41 slides
New in MongoDB 2.6 by
New in MongoDB 2.6New in MongoDB 2.6
New in MongoDB 2.6christkv
1.4K views29 slides
Lessons from 4 years of driver develoment by
Lessons from 4 years of driver develomentLessons from 4 years of driver develoment
Lessons from 4 years of driver develomentchristkv
1.7K views42 slides
Mongo db ecommerce by
Mongo db ecommerceMongo db ecommerce
Mongo db ecommercechristkv
3.8K views21 slides
Cdr stats-vo ip-analytics_solution_mongodb_meetup by
Cdr stats-vo ip-analytics_solution_mongodb_meetupCdr stats-vo ip-analytics_solution_mongodb_meetup
Cdr stats-vo ip-analytics_solution_mongodb_meetupchristkv
1.1K views18 slides
Mongodb intro by
Mongodb introMongodb intro
Mongodb introchristkv
4.1K views37 slides

More from christkv(9)

From SQL to MongoDB by christkv
From SQL to MongoDBFrom SQL to MongoDB
From SQL to MongoDB
christkv184 views
New in MongoDB 2.6 by christkv
New in MongoDB 2.6New in MongoDB 2.6
New in MongoDB 2.6
christkv1.4K views
Lessons from 4 years of driver develoment by christkv
Lessons from 4 years of driver develomentLessons from 4 years of driver develoment
Lessons from 4 years of driver develoment
christkv1.7K views
Mongo db ecommerce by christkv
Mongo db ecommerceMongo db ecommerce
Mongo db ecommerce
christkv3.8K views
Cdr stats-vo ip-analytics_solution_mongodb_meetup by christkv
Cdr stats-vo ip-analytics_solution_mongodb_meetupCdr stats-vo ip-analytics_solution_mongodb_meetup
Cdr stats-vo ip-analytics_solution_mongodb_meetup
christkv1.1K views
Mongodb intro by christkv
Mongodb introMongodb intro
Mongodb intro
christkv4.1K views
Schema design by christkv
Schema designSchema design
Schema design
christkv731 views
Node js mongodriver by christkv
Node js mongodriverNode js mongodriver
Node js mongodriver
christkv9K views
Node.js and ruby by christkv
Node.js and rubyNode.js and ruby
Node.js and ruby
christkv2.2K views

Storage talk

  • 1. #fosdem How does Mongo store my data? Christian Amor Kvalheim Engineering Team lead, 10gen, @christkv
  • 2. Who Am I • Engineering lead at 10gen • Work on the MongoDB Node.js driver • @christkv • christkv@10gen.com Journaling and Storage
  • 3. Why Pop the Hood? • Understanding data safety • Estimating RAM / disk requirements • Optimizing performance Journaling and Storage
  • 5. Directory Layout drwxr-xr-x 136 Nov 19 10:12 journal -rw------- 16777216 Oct 25 14:58 test.0 -rw------- 134217728 Mar 13 2012 test.1 -rw------- 268435456 Mar 13 2012 test.2 -rw------- 536870912 May 11 2012 test.3 -rw------- 1073741824 May 11 2012 test.4 -rw------- 2146435072 Nov 19 10:14 test.5 -rw------- 16777216 Nov 19 10:13 test.ns Journaling and Storage
  • 6. Directory Layout • Aggressive pre-allocation (always 1 spare file) • There is one namespace file per db which can hold 24000 entries per default • A namespace is a collection or an index Journaling and Storage
  • 7. Tuning with Options • Use --directoryperdb to separate dbs into own folders which allows to use different volumes (isolation, performance) • You can use --nopreallocate to prevent preallocation • Use --smallfiles to keep data files smaller • If using many databases, use –nopreallocate and -- smallfiles to reduce storage size • If using thousands of collections & indexes, increase namespace capacity with --nssize Journaling and Storage
  • 9. Internal File Format Journaling and Storage
  • 10. Extent Structure Journaling and Storage
  • 11. Extents and Records Journaling and Storage
  • 12. To Sum Up: Internal File Format • Files on disk are broken into extents which contain the documents • A collection has 1 to many extents • Extent grow exponentially up to 2GB • Namespace entries in the ns file point to the first extent for that collection Journaling and Storage
  • 14. Indexes • Indexes are BTree structures serialized to disk • They are stored in the same files as data but using own extents Journaling and Storage
  • 15. The DB Stats > db.stats() { "db" : "test", "collections" : 22, "objects" : 17000383, ## number of documents "avgObjSize" : 44.33690276272011, "dataSize" : 753744328, ## size of data "storageSize" : 1159569408, ## size of all containing extents "numExtents" : 81, "indexes" : 85, "indexSize" : 624204896, ## separate index storage size "fileSize" : 4176478208, ## size of data files on disk "nsSizeMB" : 16, "ok" : 1 } Journaling and Storage
  • 16. The Collection Stats > db.large.stats() { "ns" : "test.large", "count" : 5000000, ## number of documents "size" : 280000024, ## size of data "avgObjSize" : 56.0000048, "storageSize" : 409206784, ## size of all containing extents "numExtents" : 18, "nindexes" : 1, "lastExtentSize" : 74846208, "paddingFactor" : 1, ## amount of padding "systemFlags" : 0, "userFlags" : 0, "totalIndexSize" : 162228192, ## separate index storage size "indexSizes" : { "_id_" : 162228192 }, "ok" : 1 } Journaling and Storage
  • 18. Memory Mapped Files • All data files are memory mapped to Virtual Memory by the OS • MongoDB just reads / writes to RAM in the filesystem cache • OS takes care of the rest! • Virtual process size = total files size + overhead (connections, heap) • If journal is on, the virtual size will be roughly doubled Journaling and Storage
  • 19. Virtual Address Space Journaling and Storage
  • 20. Memory Map, Love It or Hate It • Pros: – No complex memory / disk code in MongoDB, huge win! – The OS is very good at caching for any type of storage – Least Recently Used behavior – Cache stays warm across MongoDB restarts • Cons: – RAM usage is affected by disk fragmentation – RAM usage is affected by high read-ahead – LRU behavior does not prioritize things (like indexes) Journaling and Storage
  • 21. How Much Data is in RAM? • Resident memory the best indicator of how much data in RAM • Resident is: process overhead (connections, heap) + FS pages in RAM that were accessed • Means that it resets to 0 upon restart even though data is still in RAM due to FS cache • Use free command to check on FS cache size • Can be affected by fragmentation and read- ahead Journaling and Storage
  • 23. The Problem Changed in memory mapped files are not applied in order and different parts of the file can be from different points in time You want a consistent point-in-time snapshot when restarting after a crash Journaling and Storage
  • 26. Solution – Use a Journal • Data gets written to a journal before making it to the data files • Operations written to a journal buffer in RAM that gets flushed every 100ms by default or 100MB • Once journal is written to disk, data is safe • Journal prevents corruption and allows durability • Can be turned off, but don’t! Journaling and Storage
  • 27. Journal Format Journaling and Storage
  • 28. Can I Lose Data on a Hard Crash? • Maximum data loss is 100ms (journal flush). This can be reduced with –journalCommitInterval • For durability (data is on disk when ack’ed) use the JOURNAL_SAFE write concern (“j” option). • Note that replication can reduce the data loss further. Use the REPLICAS_SAFE write concern (“w” option). • As write guarantees increase, latency increases. To maintain performance, use more connections! Journaling and Storage
  • 29. What is the Cost of a Journal? • On read-heavy systems, no impact • Write performance is reduced by 5-30% • If using separate drive for journal, as low as 3% • For apps that are write-heavy (1000+ writes per server) there can be slowdown due to mix of journal and data flushes. Use a separate drive! Journaling and Storage
  • 31. What it Looks Like Both on disk and in RAM! Journaling and Storage
  • 32. Fragmentation • Files can get fragmented over time if remove() and update() are issued. • It gets worse if documents have varied sizes • Fragmentation wastes disk space and RAM • Also makes writes scattered and slower • Fragmentation can be checked by comparing size to storageSize in the collection’s stats. Journaling and Storage
  • 33. How to Combat Fragmentation • compact command (maintenance op) • Normalize schema more (documents don’t grow) • Pre-pad documents (documents don’t grow) • Use separate collections over time, then use collection.drop() instead of collection.remove(query) • --usePowerOf2sizes option makes disk buckets more reusable Journaling and Storage
  • 35. In Review • Understand disk layout and footprint • See how much data is actually in RAM • Memory mapping is cool • Answer how much data is ok to lose • Check on fragmentation and avoid it • https://github.com/10gen-labs/storage-viz Journaling and Storage

Editor's Notes

  1. Just need colum 1 4 last covert to mb- Each database has one or more data files, all in same folder (e.g. test.0, test.1, …)Those files get larger and larger, up to 2GBThe journal folder contains the journal files
  2. We’ll cover how data is laid out in more detail later, but at a very high level this is what’s contained in each data file.Extents are allocated within files as needed until space inside the file is fully consumed by extents, then another file is allocated.As more and more data is inserted into a collection, the size of the allocated extents grows exponentially up to 2GB.If you’re writing to many small collections, extents will be interleaved within a single database file, but as the collection grows, eventually a single 2GB file will be devoted to data for that single collection.
  3. Imagine your data laid out in these (possibly) sparse extents like a backbone, with linked lists of data records hanging off of each of themGo over table scan logic. $natural  (sort({$natural:-1})When executing a find() with no parameters, the database returns objects in forward natural order. To do a TABLE SCAN, you would start at the first extent indicated in the NamespaceDetails and then traverse the records within.
  4. Highlight the important parts
  5. Highlight the important parts
  6. Can use pmap to view this layout on linux.So here I have a single document located on disk and show how it’s mapped into memory.Let’s look at how we go about actually accessing this document once it’s loaded into memory
  7. Break into 2 slides
  8. Without journaling, the approach is quite straightforward, there is a one-to-one mapping of data files to memory and when either the OS or an explicit fsync happens, your data is now safe on disk.With journaling we do some tricks.Write ahead log, that is, we write the data to the journal before we update the data itself.Each file is mapped twice, once to a private view which is marked copy-on-write, and once to the shared view – shared in the context that the disk has access to this memory.Every time we do a write, we keep a list of the region of memory that was written to.Batches into group commits, compresses and appends in a group commit to disk by appending to a special journal fileOnce that data has been written to disk, we then do a remapping phase which copies the changes into the shared view, at which point those changes can then be synced to disk.Once that data is synced to disk then it’s safe (barring hardware failure). If there is a failure before the shared/storage view is written to disk, we simply need to apply all the changes in order to the data files since the last time it was synced and we get back to a consistent view of the data
  9. *** Journal is Compressed using snappy to reduce the write volumeLSN (Last Sequence Number) is written to disk as a way to provide a marker to which section of the journal was last flushed to