#MDBE17
O2 Intercontinental
SCALING AND TRANSACTION
FUTURES
#MDBE17
Senior Staff Engineer, MongoDB Inc.
KEITH BOSTIC
keith.bostic@mongodb.com
#MDBE17
THE WIREDTIGER STORAGE ENGINE
Storage engine:
The “storage engine” is the MongoDB database server code
responsible for single-node durability as well as being the primary
enforcer of data consistency guarantees.
• MongoDB has a pluggable storage engine architecture
‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB
• WiredTiger is the default storage engine
#MDBE17
SCALING MONGODB
A NEW TRANSACTIONAL MODEL
#MDBE17
SCALING MONGODB
... TO A MILLION COLLECTIONS
#MDBE17
PROBLEM 1: LOTS OF COLLECTIONS
• MongoDB applications create collections to hold their documents
• Each collection has some set of indexes
‒ Documents indexed in multiple ways
‒ Hundreds of collections
• But some applications create A LOT of collections:
‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine
‒ Multi-tenant applications
‒ Creative schema design: time-series data
“640K [of memory]
ought to be enough for anybody.”
Said nobody ever
#MDBE17
PROBLEM 2: INVALID ASSUMPTIONS
• WiredTiger designed for applications with known workloads
‒ WiredTiger design based on this assumption
‒ But MongoDB is used for all kinds of things!
• Application writers make assumptions, too!
‒ MMAPv1 built on top of mmap: different performance characteristics
‒ Most MMAPv1 users migrated without problems
• Engineering is a process of continual improvement
#MDBE17
WHAT DID WE DO ABOUT IT?
• Got better at measuring applications
‒ Full-time data capture (FTDC)
‒ Identifying bottlenecks
• WiredTiger with lots of collections:
‒ Handle caches didn’t scale
‒ Page cache eviction inefficient with lots of trees
‒ Especially when access patterns are skewed
#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
T
T
T * C
#MDBE17
FIRST, FIND A TUNABLE WORKLOAD
• Runs standalone on a modern server
• 64 client threads doing 10K updates / second (total)
• Keep the data working set constant
• But with a small cache size so eviction is exercised
• Vary the number of collections
‒ And nothing else!
‒ Workload spread across an increasing number of collections
• Stop when average latency > 50ms per operation.
#MDBE17
RESULTS – BASELINES
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0
#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Hash table lookups instead of scanning a list
‒ Assumption: short lists, handle lookup uncommon
‒ Reality: every time a cursor is opened
• Singly-linked lists equal slow deletes
‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler
‒ Reality: deletes common, removing from a singly-linked list requires a scan
#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Global lock means terrible concurrency
‒ Assumption: short lists, use an exclusive lock
‒ Reality: many operations read-only, shared read-write locks better
#MDBE17
SOLUTION: SMARTER EVICTION
• WiredTiger evicts some pages from every tree
‒ Assumption: uniformity of data across collections
• Finding the data is a significant problem
‒ Retrieval data structures are all you have
• Skewed data access
‒ Lots of trees are idle in common applications
‒ Multi-tenant or time-series data are prime examples
‒ Often 1-5 trees dominates a cache of 10K trees
#MDBE17
RESULTS – HANDLE CACHE AND EVICTION
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak
#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Assumptions:
‒ Checkpoints are rare events, high-end applications configure journaling
‒ Exclusive lock while finding handles that need to be checkpointed
‒ Drops are rare events, and scheduled by the application
• Reality:
‒ Checkpoints continuous, every 60 seconds for historic reasons
‒ With 100K trees, exclusive lock held for far too long
‒ Drops happen randomly
#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Skewed access patterns
‒ Reviewing handles with no data is wasted effort
‒ Won’t hold locks as long if we skip clean handles
• Split checkpoints into two phases
‒ Phase 1: most I/O happens, multithreaded
‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded
• “Delayed drops” feature to allow drops during checkpoints
#MDBE17
SOLUTION: AGGRESSIVE SWEEPING
• Assumption: lazily sweep the handle list
• Reality: 1M handles takes too long to walk
‒ Aggressively discard cached handles we don’t need
#MDBE17
RESULTS – IMPROVE CHECKPOINTS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak eviction + sweep
#MDBE17
SOLUTION: GROUP COLLECTIONS
• Assumption: map each MongoDB collection/index to a table
• Reality:
‒ Makes all handle caches big
‒ Relies on fast caches and a fast filesystem
‒ 1M files in a directory problematic for some filesystems
• Add a “—groupCollections” option to MongoDB
‒ 2 tables per database (collections, indexes)
‒ Adds a prefix to keys
‒ Transparent to applications, although requires configuration
#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
2
2
2 * C
#MDBE17
RESULTS – GROUP COLLECTIONS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak eviction + sweep grouped
#MDBE17
RESULTS – SUMMARY
50,000 10,000
800,000
1,000,000 +
Maximum Collections
MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections
#MDBE17
MILLION COLLECTIONS PROGRESS
2014
MongoDB 3.0:
WiredTiger
integration
2015
MongoDB 3.2
Handle cache
Checkpoints
2016
MongoDB 3.4
Concurrency
Smarter Eviction
2017+
Grouped
collections
#MDBE17
A MILLION COLLECTIONS: SUMMARY
• Got better at measuring performance
• Examined and changed our assumptions
• Tuned data structures and algorithms
• New data representation: grouped collections
“It’s not what you don’t know that gets
you into trouble -- it’s what you know
that just isn’t true.”
Said nobody ever
#MDBE17
A MILLION COLLECTIONS DELIVERABLES
• All tuning work included in the MongoDB 3.6 release.
• Grouped collections feature pushed out of the 3.6 release
‒ Improvements sufficient without requiring application API change?
‒ Increased focus on new transactional features
• More tuning is happening for the next MongoDB release
‒ Integrating the MongoDB and WiredTiger caching
#MDBE17
(We’re not done, the second part starts in 5 minutes!)
QUESTIONS?
#MDBE17
THE MONGODB JOURNEY TO
A NEW TRANSACTIONAL MODEL
#MDBE17
TO ACCOMMODATE NEW APPLICATIONS
• MongoDB designed for a no-SQL, schema-less world
‒ Transactional semantics less of an application requirement
• MongoDB application domain growing
‒ Supporting more traditional applications
‒ Often, applications surrounding the existing MongoDB space
• Also, simplifying existing applications
#MDBE17
TRANSACTIONS: ACID
• Atomicity
‒ All or nothing.
• Consistency
‒ Database constraints aren’t violated (“constraints” is individually defined)
• Isolation
‒ Transaction integrity and visibility
• Durability
‒ Permanence in the face of bad stuff happening
#MDBE17
CAP THEOREM
Availability
Partition
Tolerance
Consistency
#MDBE17
MONGODB’S PRESENT
• ACID, of course
• Single-document transactions
‒ Atomically update multiple fields of a document (and indices)
‒ Transaction cannot span multiple documents or collections
‒ Applications implement some version of two-phase commit
• Single server consistency
‒ Eventual consistency on the secondaries
#MDBE17
MONGODB’S FUTURE:
MULTI-DOCUMENT TRANSACTIONS
• Application developers want them:
‒ Some workloads require them
‒ Developers struggle with error handling
‒ Increase application performance, decrease application complexity
• MongoDB developers want them:
‒ Chunk migration to balance content on shards
‒ Changing shard keys
#MDBE17
NECESSARY RISK:
INCREASING SHARD ENTANGLEMENT
• Increasing inter-shard entanglement
‒ The wrong answer is easy, the right answer takes more communication
• Chunk balance should not affect correctness
• Shards can’t simply abort transactions to get unstuck
• Additional migration complexity
• Shard entanglement impacts availability
#MDBE17
OTHER RISKS AND KNOCK-ON EFFECTS
• Developers use transactions rather than appropriate schemas
‒ Long-running transactions are seductive
• Inevitably, the rate of concurrency collisions increases
• Significant technical complexity
‒ Multi-year project
‒ Every part of the server team: replication, sharding, query, storage
‒ Significantly increases pressure on the storage engines
#MDBE17
FEATURES ALONG THE WAY
• Automatically avoid dirty secondary reads (3.6!)
• Retryable writes (3.6!)
‒ Applications don’t have to manage write collisions
• Global point-in-time reads
‒ Single system-wide clock ordering operations
• Multi-document transactions
#MDBE17
WIREDTIGER TRANSACTIONS
#MDBE17
WIREDTIGER: SINGLE-NODE TRANSACTION
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
• Transactions are implicit or explicit
‒ session.begin_transaction()
‒ session.commit_transaction()
‒ session.rollback_transaction()
• Transactions can already span objects and data-sources!
#MDBE17
WIREDTIGER SINGLE-NODE TRANSACTION
cursor = session.open_cursor()
session.begin_transaction()
cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert()
cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update()
session.commit_transaction()
cursor.close()
#MDBE17
TRANSACTION INFORMATION
• 8B transaction ID
• Isolation level and snapshot information
‒ Read-uncommitted: everything
‒ Read-committed: committed updates after start
‒ Snapshot: committed updates before start
• Linked list of change records, called “updates”
‒ For logging on commit
‒ For discard on rollback
#MDBE17
UPDATE INFORMATION
• Updates include
‒ Transaction ID which embodies “state” (committed or not)
‒ Data package
Transaction ID
+
Data
Key
#MDBE17
MULTI-VERSION CONCURRENCY CONTROL
• Key references
‒ Chain of updates in most recently modified order
‒ Original value, the update visible to everybody
Transaction ID
+
Data
Key
Transaction ID
+
Data
Globally
Visible
Data
#MDBE17
WIREDTIGER NAMED SNAPSHOTS FEATURE
• Snapshot: a point-in-time
• Snapshots can be named
‒ Transactions can be started “as of” that snapshot
‒ Readers use this to access data as of a point in time.
• But... snapshots keep data pinned in cache
‒ Newer data cannot be discarded
#MDBE17
MONGODB ON TOP OF WIREDTIGER MODEL
• MongoDB maps document changes into this model
‒ For example, a single document change involves indexes
‒ Glue layer below the pluggable storage engine API
• Read concern majority
‒ In other words, it won’t disappear
‒ Requires –enableMajorityReadConcern configuration
‒ Built on WiredTiger’s named snapshots
#MDBE17
INTRODUCING SYSTEM TIMESTAMPS
• Applications have their own notion of transactions and time
‒ Defines an expected commit order
‒ Defines durability for a set of systems
• WiredTiger takes a fixed-length byte-string transaction ID
‒ Simply increasing (but not necessarily monotonic)
‒ A “most significant bit first” hexadecimal string
‒ 8B but expected to grow to encompass system-wide ordering
‒ Mix-and-match with native WiredTiger transactions
#MDBE17
MONGODB USES AN “AS OF” TIMESTAMP
• Updates now include a timestamp transaction ID
‒ Timestamp tracked in WiredTiger’s update
‒ Smaller is better, as a significant overhead for small updates
• Commit “as of” a timestamp
‒ Set during the update or later, at transaction commit
• Read “as of” a timestamp
‒ Set at transaction begin
‒ Point-in-time reads: largest timestamp less than or equal to value
#MDBE17
MONGODB SETS THE “OLDEST” TIMESTAMP
• Limit future reads
• The point at which WiredTiger can discard history
• Cannot go backward, must be updated frequently
#MDBE17
MONGODB SETS THE “STABLE” TIMESTAMP
• Limits future durability rollbacks
‒ Imagine an election where the primary hasn’t seen a committed update
• WiredTiger writes checkpoints at the stable timestamp
‒ The storage engine can’t write what might be rolled back
• Cannot go backward, must be updated frequently
#MDBE17
READ CONCERN MAJORITY FEATURE
• In 3.4 implemented with WiredTiger named snapshots
‒ Every write a named snapshot
‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem
• In 3.6 implemented with read “as of”
‒ Light-weight and fast
‒ Configuration is now a no-op, “always on”
#MDBE17
OPLOG IMPROVEMENTS
• MongoDB does replication by copying its “journal”
‒ Oplog is bulk-loaded on secondaries
‒ Oplog is loaded out-of-order for performance
• Scanning cursor has strict visibility order requirements
‒ No skipping records
‒ No updates visible after the oldest uncommitted update
#MDBE17
OPLOG IMPROVEMENTS
• In 3.4, implemented using WiredTiger named snapshots
• JIRA ticket:
“Under heavy insert load on a 2-node replica set, WiredTiger eviction
appears to hang on the secondary.”
• In 3.6, implemented using timestamps
#MDBE17
A NEW TRANSACTIONAL MODEL SUMMARY
• Significant storage engine changes
• Enhancing transactional consistency for new applications
• Features and improvements in MongoDB 3.6
‒ Retryable writes
‒ Safe secondary reads
‒ Significantly improved performance
#MDBE17
keith.bostic@mongodb.com
QUESTIONS?

Scaling and Transaction Futures

  • 1.
  • 2.
    #MDBE17 Senior Staff Engineer,MongoDB Inc. KEITH BOSTIC keith.bostic@mongodb.com
  • 3.
    #MDBE17 THE WIREDTIGER STORAGEENGINE Storage engine: The “storage engine” is the MongoDB database server code responsible for single-node durability as well as being the primary enforcer of data consistency guarantees. • MongoDB has a pluggable storage engine architecture ‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB • WiredTiger is the default storage engine
  • 4.
    #MDBE17 SCALING MONGODB A NEWTRANSACTIONAL MODEL
  • 5.
    #MDBE17 SCALING MONGODB ... TOA MILLION COLLECTIONS
  • 6.
    #MDBE17 PROBLEM 1: LOTSOF COLLECTIONS • MongoDB applications create collections to hold their documents • Each collection has some set of indexes ‒ Documents indexed in multiple ways ‒ Hundreds of collections • But some applications create A LOT of collections: ‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine ‒ Multi-tenant applications ‒ Creative schema design: time-series data
  • 7.
    “640K [of memory] oughtto be enough for anybody.” Said nobody ever
  • 8.
    #MDBE17 PROBLEM 2: INVALIDASSUMPTIONS • WiredTiger designed for applications with known workloads ‒ WiredTiger design based on this assumption ‒ But MongoDB is used for all kinds of things! • Application writers make assumptions, too! ‒ MMAPv1 built on top of mmap: different performance characteristics ‒ Most MMAPv1 users migrated without problems • Engineering is a process of continual improvement
  • 9.
    #MDBE17 WHAT DID WEDO ABOUT IT? • Got better at measuring applications ‒ Full-time data capture (FTDC) ‒ Identifying bottlenecks • WiredTiger with lots of collections: ‒ Handle caches didn’t scale ‒ Page cache eviction inefficient with lots of trees ‒ Especially when access patterns are skewed
  • 10.
  • 11.
    #MDBE17 FIRST, FIND ATUNABLE WORKLOAD • Runs standalone on a modern server • 64 client threads doing 10K updates / second (total) • Keep the data working set constant • But with a small cache size so eviction is exercised • Vary the number of collections ‒ And nothing else! ‒ Workload spread across an increasing number of collections • Stop when average latency > 50ms per operation.
  • 12.
    #MDBE17 RESULTS – BASELINES 0 10 20 30 40 50 100010000 100000 Averagelatency(ms) Number of collections 3.2.0 3.4.0
  • 13.
    #MDBE17 SOLUTION: IMPROVED HANDLECACHES • Hash table lookups instead of scanning a list ‒ Assumption: short lists, handle lookup uncommon ‒ Reality: every time a cursor is opened • Singly-linked lists equal slow deletes ‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler ‒ Reality: deletes common, removing from a singly-linked list requires a scan
  • 14.
    #MDBE17 SOLUTION: IMPROVED HANDLECACHES • Global lock means terrible concurrency ‒ Assumption: short lists, use an exclusive lock ‒ Reality: many operations read-only, shared read-write locks better
  • 15.
    #MDBE17 SOLUTION: SMARTER EVICTION •WiredTiger evicts some pages from every tree ‒ Assumption: uniformity of data across collections • Finding the data is a significant problem ‒ Retrieval data structures are all you have • Skewed data access ‒ Lots of trees are idle in common applications ‒ Multi-tenant or time-series data are prime examples ‒ Often 1-5 trees dominates a cache of 10K trees
  • 16.
    #MDBE17 RESULTS – HANDLECACHE AND EVICTION 0 10 20 30 40 50 1000 10000 100000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak
  • 17.
    #MDBE17 SOLUTION: IMPROVE CHECKPOINTS •Assumptions: ‒ Checkpoints are rare events, high-end applications configure journaling ‒ Exclusive lock while finding handles that need to be checkpointed ‒ Drops are rare events, and scheduled by the application • Reality: ‒ Checkpoints continuous, every 60 seconds for historic reasons ‒ With 100K trees, exclusive lock held for far too long ‒ Drops happen randomly
  • 18.
    #MDBE17 SOLUTION: IMPROVE CHECKPOINTS •Skewed access patterns ‒ Reviewing handles with no data is wasted effort ‒ Won’t hold locks as long if we skip clean handles • Split checkpoints into two phases ‒ Phase 1: most I/O happens, multithreaded ‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded • “Delayed drops” feature to allow drops during checkpoints
  • 19.
    #MDBE17 SOLUTION: AGGRESSIVE SWEEPING •Assumption: lazily sweep the handle list • Reality: 1M handles takes too long to walk ‒ Aggressively discard cached handles we don’t need
  • 20.
    #MDBE17 RESULTS – IMPROVECHECKPOINTS 0 10 20 30 40 50 1000 10000 100000 1000000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak eviction + sweep
  • 21.
    #MDBE17 SOLUTION: GROUP COLLECTIONS •Assumption: map each MongoDB collection/index to a table • Reality: ‒ Makes all handle caches big ‒ Relies on fast caches and a fast filesystem ‒ 1M files in a directory problematic for some filesystems • Add a “—groupCollections” option to MongoDB ‒ 2 tables per database (collections, indexes) ‒ Adds a prefix to keys ‒ Transparent to applications, although requires configuration
  • 22.
  • 23.
    #MDBE17 RESULTS – GROUPCOLLECTIONS 0 10 20 30 40 50 1000 10000 100000 1000000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak eviction + sweep grouped
  • 24.
    #MDBE17 RESULTS – SUMMARY 50,00010,000 800,000 1,000,000 + Maximum Collections MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections
  • 25.
    #MDBE17 MILLION COLLECTIONS PROGRESS 2014 MongoDB3.0: WiredTiger integration 2015 MongoDB 3.2 Handle cache Checkpoints 2016 MongoDB 3.4 Concurrency Smarter Eviction 2017+ Grouped collections
  • 26.
    #MDBE17 A MILLION COLLECTIONS:SUMMARY • Got better at measuring performance • Examined and changed our assumptions • Tuned data structures and algorithms • New data representation: grouped collections
  • 27.
    “It’s not whatyou don’t know that gets you into trouble -- it’s what you know that just isn’t true.” Said nobody ever
  • 28.
    #MDBE17 A MILLION COLLECTIONSDELIVERABLES • All tuning work included in the MongoDB 3.6 release. • Grouped collections feature pushed out of the 3.6 release ‒ Improvements sufficient without requiring application API change? ‒ Increased focus on new transactional features • More tuning is happening for the next MongoDB release ‒ Integrating the MongoDB and WiredTiger caching
  • 29.
    #MDBE17 (We’re not done,the second part starts in 5 minutes!) QUESTIONS?
  • 30.
    #MDBE17 THE MONGODB JOURNEYTO A NEW TRANSACTIONAL MODEL
  • 31.
    #MDBE17 TO ACCOMMODATE NEWAPPLICATIONS • MongoDB designed for a no-SQL, schema-less world ‒ Transactional semantics less of an application requirement • MongoDB application domain growing ‒ Supporting more traditional applications ‒ Often, applications surrounding the existing MongoDB space • Also, simplifying existing applications
  • 32.
    #MDBE17 TRANSACTIONS: ACID • Atomicity ‒All or nothing. • Consistency ‒ Database constraints aren’t violated (“constraints” is individually defined) • Isolation ‒ Transaction integrity and visibility • Durability ‒ Permanence in the face of bad stuff happening
  • 33.
  • 34.
    #MDBE17 MONGODB’S PRESENT • ACID,of course • Single-document transactions ‒ Atomically update multiple fields of a document (and indices) ‒ Transaction cannot span multiple documents or collections ‒ Applications implement some version of two-phase commit • Single server consistency ‒ Eventual consistency on the secondaries
  • 35.
    #MDBE17 MONGODB’S FUTURE: MULTI-DOCUMENT TRANSACTIONS •Application developers want them: ‒ Some workloads require them ‒ Developers struggle with error handling ‒ Increase application performance, decrease application complexity • MongoDB developers want them: ‒ Chunk migration to balance content on shards ‒ Changing shard keys
  • 36.
    #MDBE17 NECESSARY RISK: INCREASING SHARDENTANGLEMENT • Increasing inter-shard entanglement ‒ The wrong answer is easy, the right answer takes more communication • Chunk balance should not affect correctness • Shards can’t simply abort transactions to get unstuck • Additional migration complexity • Shard entanglement impacts availability
  • 37.
    #MDBE17 OTHER RISKS ANDKNOCK-ON EFFECTS • Developers use transactions rather than appropriate schemas ‒ Long-running transactions are seductive • Inevitably, the rate of concurrency collisions increases • Significant technical complexity ‒ Multi-year project ‒ Every part of the server team: replication, sharding, query, storage ‒ Significantly increases pressure on the storage engines
  • 38.
    #MDBE17 FEATURES ALONG THEWAY • Automatically avoid dirty secondary reads (3.6!) • Retryable writes (3.6!) ‒ Applications don’t have to manage write collisions • Global point-in-time reads ‒ Single system-wide clock ordering operations • Multi-document transactions
  • 39.
  • 40.
    #MDBE17 WIREDTIGER: SINGLE-NODE TRANSACTION •Per-thread “session” structure embodies a transaction • Session structure references data-sources: cursors • Transactions are implicit or explicit ‒ session.begin_transaction() ‒ session.commit_transaction() ‒ session.rollback_transaction() • Transactions can already span objects and data-sources!
  • 41.
    #MDBE17 WIREDTIGER SINGLE-NODE TRANSACTION cursor= session.open_cursor() session.begin_transaction() cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert() cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update() session.commit_transaction() cursor.close()
  • 42.
    #MDBE17 TRANSACTION INFORMATION • 8Btransaction ID • Isolation level and snapshot information ‒ Read-uncommitted: everything ‒ Read-committed: committed updates after start ‒ Snapshot: committed updates before start • Linked list of change records, called “updates” ‒ For logging on commit ‒ For discard on rollback
  • 43.
    #MDBE17 UPDATE INFORMATION • Updatesinclude ‒ Transaction ID which embodies “state” (committed or not) ‒ Data package Transaction ID + Data Key
  • 44.
    #MDBE17 MULTI-VERSION CONCURRENCY CONTROL •Key references ‒ Chain of updates in most recently modified order ‒ Original value, the update visible to everybody Transaction ID + Data Key Transaction ID + Data Globally Visible Data
  • 45.
    #MDBE17 WIREDTIGER NAMED SNAPSHOTSFEATURE • Snapshot: a point-in-time • Snapshots can be named ‒ Transactions can be started “as of” that snapshot ‒ Readers use this to access data as of a point in time. • But... snapshots keep data pinned in cache ‒ Newer data cannot be discarded
  • 46.
    #MDBE17 MONGODB ON TOPOF WIREDTIGER MODEL • MongoDB maps document changes into this model ‒ For example, a single document change involves indexes ‒ Glue layer below the pluggable storage engine API • Read concern majority ‒ In other words, it won’t disappear ‒ Requires –enableMajorityReadConcern configuration ‒ Built on WiredTiger’s named snapshots
  • 47.
    #MDBE17 INTRODUCING SYSTEM TIMESTAMPS •Applications have their own notion of transactions and time ‒ Defines an expected commit order ‒ Defines durability for a set of systems • WiredTiger takes a fixed-length byte-string transaction ID ‒ Simply increasing (but not necessarily monotonic) ‒ A “most significant bit first” hexadecimal string ‒ 8B but expected to grow to encompass system-wide ordering ‒ Mix-and-match with native WiredTiger transactions
  • 48.
    #MDBE17 MONGODB USES AN“AS OF” TIMESTAMP • Updates now include a timestamp transaction ID ‒ Timestamp tracked in WiredTiger’s update ‒ Smaller is better, as a significant overhead for small updates • Commit “as of” a timestamp ‒ Set during the update or later, at transaction commit • Read “as of” a timestamp ‒ Set at transaction begin ‒ Point-in-time reads: largest timestamp less than or equal to value
  • 49.
    #MDBE17 MONGODB SETS THE“OLDEST” TIMESTAMP • Limit future reads • The point at which WiredTiger can discard history • Cannot go backward, must be updated frequently
  • 50.
    #MDBE17 MONGODB SETS THE“STABLE” TIMESTAMP • Limits future durability rollbacks ‒ Imagine an election where the primary hasn’t seen a committed update • WiredTiger writes checkpoints at the stable timestamp ‒ The storage engine can’t write what might be rolled back • Cannot go backward, must be updated frequently
  • 51.
    #MDBE17 READ CONCERN MAJORITYFEATURE • In 3.4 implemented with WiredTiger named snapshots ‒ Every write a named snapshot ‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem • In 3.6 implemented with read “as of” ‒ Light-weight and fast ‒ Configuration is now a no-op, “always on”
  • 52.
    #MDBE17 OPLOG IMPROVEMENTS • MongoDBdoes replication by copying its “journal” ‒ Oplog is bulk-loaded on secondaries ‒ Oplog is loaded out-of-order for performance • Scanning cursor has strict visibility order requirements ‒ No skipping records ‒ No updates visible after the oldest uncommitted update
  • 53.
    #MDBE17 OPLOG IMPROVEMENTS • In3.4, implemented using WiredTiger named snapshots • JIRA ticket: “Under heavy insert load on a 2-node replica set, WiredTiger eviction appears to hang on the secondary.” • In 3.6, implemented using timestamps
  • 54.
    #MDBE17 A NEW TRANSACTIONALMODEL SUMMARY • Significant storage engine changes • Enhancing transactional consistency for new applications • Features and improvements in MongoDB 3.6 ‒ Retryable writes ‒ Safe secondary reads ‒ Significantly improved performance
  • 55.

Editor's Notes

  • #3 Member of the storage group: storage is part of the server development group. Server is the core MongoDB database product.
  • #4 Storage underlies the technology and features you’ll hear about today. define durability and consistency (isolation visibility), as a consequence, storage owns concurrency. Pluggable architecture: per-workload storage engines Default: acceptable behavior for all workloads.
  • #9 Engineering process discussion WT began as a separate product, was integrated in 2014 as part of MongoDB 3.0 MMAPv1: lots of collections, fast in place updates
  • #10 Parallel effort at MongoDB to measure performance FTDC data is heavily compressed where measurement doesn’t change. Lots of small collections makes it hard to spot pages to discard, especially when few are hot. Assumed uniformity across large objects, found skewed access across tiny objects
  • #11 Single-node overview Layered diagram shows caching at each layer MongoDB session / cursor cache (next area of work) WT cursor cache WT data handle cache WT file handle cache 10,000 connections * 10,000 tables, add indexes, and that’s a multiplier 1M files is problematic for some filesystems.
  • #12 Design a workload for tuning: there are too many moving parts in MongoDB. Risks losing the problem. In an ideal world, increasing the number of collections would make no difference.
  • #13 We didn’t know we were making the problem worse. 3.4 degrades much more quickly than 3.2: logarithmic scale!
  • #14 Data structures assumed we’d never have lots of collections or frequently change them
  • #16 Modern eviction algorithms don’t have any kind of real queue, it’s too slow. Pages reside elsewhere, and there’s information that let’s you know the “age” Assumes uniformity of the data across collections. Multi-tenant workloads are skewed. Once idle trees empty out, even looking at them is a waste of time.
  • #17 Still nowhere near 1M, but at least back to where we were in 3.2
  • #18 Obvious data structures and tuning changes. Checkpoints hold exclusive locks and slow everything down.
  • #21 Note the x axis scale change, we can now see the 1M target
  • #22 Someday a middle ground, we’ll need to create subdirectories for data Security becomes more interesting: data is co-resident.
  • #23 Architecture with Grouped Collections Revisits main architecture diagram with changes for grouped collections Assuming a single database here, and now the cursor cache size is only based on the number of connections, plus we have ways to limit how big it gets in practice.
  • #24 Collections gets us to 1M with < 10ms average latency. We get to about 250K with < 10ms average latency without changing the API, 800K < 30ms.
  • #25 Tipping point (great than 50ms) This graph shows how many collections we can support: more is better.
  • #28 It’s what you “know” that just isn’t true... It’s all about changing our assumptions to handle more workloads.
  • #29 Because the tuning efforts were successful (800,000 collections), reaching 1M less important. Additionally, there are significant tuning, space and application-API issues with respect to grouped collections: for example, compaction, collection drop, security and so on. Solving without a new feature API is better. If you change your mind later, and want to split the two files up?
  • #33 Define the terms and get everybody on the same page Everybody offers a version of ACID, including MongoDB Differences generally around relaxing consistency guarantees
  • #34 MongoDB’s traditional applications have CAP tradeoffs MongoDB’s original design chose partition tolerance and availability over consistency (JD ???) Extending to support more consistency rules.
  • #35 MongoDB supports ACID, but it only applies to individual write operations. “write operations” is a high-level concept, indexes are kept consistent. In 3.4: linearizable reads: write the primary, force read from a secondary to block until it sees the write.
  • #36 Application developers want to shift complexity into the database. Application developer skill set not suited to building database applications.
  • #37 Golden Rule: may not impact the performance of applications not using transactions.
  • #39 safe secondary reads: automatically avoiding dirty reads (?) global point-in-time reads: applications read as of a single point in the causal chain retryable writes: retry automatically so applications don’t have to manage write collisions multi-document transactions: modify multiple documents/collections atomically
  • #41 storage engine semantics: a relatively standard single-node model. two types of durability: checkpoint & journalling standard write-ahead-logging log records are redo only; entire change record must fit into memory
  • #42 cursors iterate, remove, standard CRUD operations key-value store: MongoDB maps to documents & indexes
  • #44 Updates and inserts Transaction ID is an identifier into table of information.
  • #45 Inserts are single entries, with lists of updates When a cursor encounters a key, compare cursor and key/update transaction IDs
  • #47 --enableMajorityReadConcern: visible data must have been written to a majority of the replica set
  • #48 allow the distributed layer to define/order transactions 8B is fast, and lockless on 64b machines; will grow to incorporate cluster-wide clock information MSB (memcmp) MongoDB and WiredTiger transactions co-exist: Allows applications to mix-and-match where threads don’t care about timestamps you can get into trouble: operations on an item must be in timestamp order
  • #49 commit timestamps must be ahead of any read timestamp setting a read timestamp forces snapshot isolation at that timestamp
  • #50 Oldest possible reader, including replicated state To avoid caching infinite updates, must move forward the read “as of” timestamp Moving complexity into the storage engine, particularly around caching.
  • #51 the distributed engine can roll-back locally “durable” events write-concern majority: one node might have seen committed event, but if master never saw it, rolled back. generally expected a well-behaved replica set won’t fall behind local crash safe because checkpoint happens at stable event
  • #52 Benefits in 3.6: complete a read query using the same specified snapshot for its entirety, on a replica set. every write is a WT “snapshot” when secondary receives a majority read request, finds a snapshot that’s majority confirmed to use required all requests over a single socket
  • #53 oplog is the source of truth
  • #54 Snapshots pin memory: two nodes running on a 24-cpu box, 32GB RAM, pushing 16 threads with vectored writes of 100 tiny documents at a time. oplog was created for capped collections took on a replication role since it looks a lot like a shared log