Scaling and Transaction Futures

#MDBE17
O2 Intercontinental
SCALING AND TRANSACTION
FUTURES

#MDBE17
Senior Staff Engineer, MongoDB Inc.
KEITH BOSTIC
keith.bostic@mongodb.com

#MDBE17
THE WIREDTIGER STORAGE ENGINE
Storage engine:
The “storage engine” is the MongoDB database server code
responsible for single-node durability as well as being the primary
enforcer of data consistency guarantees.
• MongoDB has a pluggable storage engine architecture
‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB
• WiredTiger is the default storage engine

#MDBE17
SCALING MONGODB
A NEW TRANSACTIONAL MODEL

#MDBE17
SCALING MONGODB
... TO A MILLION COLLECTIONS

#MDBE17
PROBLEM 1: LOTS OF COLLECTIONS
• MongoDB applications create collections to hold their documents
• Each collection has some set of indexes
‒ Documents indexed in multiple ways
‒ Hundreds of collections
• But some applications create A LOT of collections:
‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine
‒ Multi-tenant applications
‒ Creative schema design: time-series data

“640K [of memory]
ought to be enough for anybody.”
Said nobody ever

#MDBE17
PROBLEM 2: INVALID ASSUMPTIONS
• WiredTiger designed for applications with known workloads
‒ WiredTiger design based on this assumption
‒ But MongoDB is used for all kinds of things!
• Application writers make assumptions, too!
‒ MMAPv1 built on top of mmap: different performance characteristics
‒ Most MMAPv1 users migrated without problems
• Engineering is a process of continual improvement

#MDBE17
WHAT DID WE DO ABOUT IT?
• Got better at measuring applications
‒ Full-time data capture (FTDC)
‒ Identifying bottlenecks
• WiredTiger with lots of collections:
‒ Handle caches didn’t scale
‒ Page cache eviction inefficient with lots of trees
‒ Especially when access patterns are skewed

#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
T
T
T * C

#MDBE17
FIRST, FIND A TUNABLE WORKLOAD
• Runs standalone on a modern server
• 64 client threads doing 10K updates / second (total)
• Keep the data working set constant
• But with a small cache size so eviction is exercised
• Vary the number of collections
‒ And nothing else!
‒ Workload spread across an increasing number of collections
• Stop when average latency > 50ms per operation.

#MDBE17
RESULTS – BASELINES
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0

#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Hash table lookups instead of scanning a list
‒ Assumption: short lists, handle lookup uncommon
‒ Reality: every time a cursor is opened
• Singly-linked lists equal slow deletes
‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler
‒ Reality: deletes common, removing from a singly-linked list requires a scan

#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Global lock means terrible concurrency
‒ Assumption: short lists, use an exclusive lock
‒ Reality: many operations read-only, shared read-write locks better

#MDBE17
SOLUTION: SMARTER EVICTION
• WiredTiger evicts some pages from every tree
‒ Assumption: uniformity of data across collections
• Finding the data is a significant problem
‒ Retrieval data structures are all you have
• Skewed data access
‒ Lots of trees are idle in common applications
‒ Multi-tenant or time-series data are prime examples
‒ Often 1-5 trees dominates a cache of 10K trees

#MDBE17
RESULTS – HANDLE CACHE AND EVICTION
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak

#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Assumptions:
‒ Checkpoints are rare events, high-end applications configure journaling
‒ Exclusive lock while finding handles that need to be checkpointed
‒ Drops are rare events, and scheduled by the application
• Reality:
‒ Checkpoints continuous, every 60 seconds for historic reasons
‒ With 100K trees, exclusive lock held for far too long
‒ Drops happen randomly

#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Skewed access patterns
‒ Reviewing handles with no data is wasted effort
‒ Won’t hold locks as long if we skip clean handles
• Split checkpoints into two phases
‒ Phase 1: most I/O happens, multithreaded
‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded
• “Delayed drops” feature to allow drops during checkpoints

#MDBE17
SOLUTION: AGGRESSIVE SWEEPING
• Assumption: lazily sweep the handle list
• Reality: 1M handles takes too long to walk
‒ Aggressively discard cached handles we don’t need

#MDBE17
RESULTS – IMPROVE CHECKPOINTS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak eviction + sweep

#MDBE17
SOLUTION: GROUP COLLECTIONS
• Assumption: map each MongoDB collection/index to a table
• Reality:
‒ Makes all handle caches big
‒ Relies on fast caches and a fast filesystem
‒ 1M files in a directory problematic for some filesystems
• Add a “—groupCollections” option to MongoDB
‒ 2 tables per database (collections, indexes)
‒ Adds a prefix to keys
‒ Transparent to applications, although requires configuration

#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
2
2
2 * C

#MDBE17
RESULTS – GROUP COLLECTIONS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak eviction + sweep grouped

#MDBE17
RESULTS – SUMMARY
50,000 10,000
800,000
1,000,000 +
Maximum Collections
MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections

#MDBE17
MILLION COLLECTIONS PROGRESS
2014
MongoDB 3.0:
WiredTiger
integration
2015
MongoDB 3.2
Handle cache
Checkpoints
2016
MongoDB 3.4
Concurrency
Smarter Eviction
2017+
Grouped
collections

#MDBE17
A MILLION COLLECTIONS: SUMMARY
• Got better at measuring performance
• Examined and changed our assumptions
• Tuned data structures and algorithms
• New data representation: grouped collections

“It’s not what you don’t know that gets
you into trouble -- it’s what you know
that just isn’t true.”
Said nobody ever

#MDBE17
A MILLION COLLECTIONS DELIVERABLES
• All tuning work included in the MongoDB 3.6 release.
• Grouped collections feature pushed out of the 3.6 release
‒ Improvements sufficient without requiring application API change?
‒ Increased focus on new transactional features
• More tuning is happening for the next MongoDB release
‒ Integrating the MongoDB and WiredTiger caching

#MDBE17
(We’re not done, the second part starts in 5 minutes!)
QUESTIONS?

#MDBE17
THE MONGODB JOURNEY TO
A NEW TRANSACTIONAL MODEL

#MDBE17
TO ACCOMMODATE NEW APPLICATIONS
• MongoDB designed for a no-SQL, schema-less world
‒ Transactional semantics less of an application requirement
• MongoDB application domain growing
‒ Supporting more traditional applications
‒ Often, applications surrounding the existing MongoDB space
• Also, simplifying existing applications

#MDBE17
TRANSACTIONS: ACID
• Atomicity
‒ All or nothing.
• Consistency
‒ Database constraints aren’t violated (“constraints” is individually defined)
• Isolation
‒ Transaction integrity and visibility
• Durability
‒ Permanence in the face of bad stuff happening

#MDBE17
CAP THEOREM
Availability
Partition
Tolerance
Consistency

#MDBE17
MONGODB’S PRESENT
• ACID, of course
• Single-document transactions
‒ Atomically update multiple fields of a document (and indices)
‒ Transaction cannot span multiple documents or collections
‒ Applications implement some version of two-phase commit
• Single server consistency
‒ Eventual consistency on the secondaries

#MDBE17
MONGODB’S FUTURE:
MULTI-DOCUMENT TRANSACTIONS
• Application developers want them:
‒ Some workloads require them
‒ Developers struggle with error handling
‒ Increase application performance, decrease application complexity
• MongoDB developers want them:
‒ Chunk migration to balance content on shards
‒ Changing shard keys

#MDBE17
NECESSARY RISK:
INCREASING SHARD ENTANGLEMENT
• Increasing inter-shard entanglement
‒ The wrong answer is easy, the right answer takes more communication
• Chunk balance should not affect correctness
• Shards can’t simply abort transactions to get unstuck
• Additional migration complexity
• Shard entanglement impacts availability

#MDBE17
OTHER RISKS AND KNOCK-ON EFFECTS
• Developers use transactions rather than appropriate schemas
‒ Long-running transactions are seductive
• Inevitably, the rate of concurrency collisions increases
• Significant technical complexity
‒ Multi-year project
‒ Every part of the server team: replication, sharding, query, storage
‒ Significantly increases pressure on the storage engines

#MDBE17
FEATURES ALONG THE WAY
• Automatically avoid dirty secondary reads (3.6!)
• Retryable writes (3.6!)
‒ Applications don’t have to manage write collisions
• Global point-in-time reads
‒ Single system-wide clock ordering operations
• Multi-document transactions

#MDBE17
WIREDTIGER TRANSACTIONS

#MDBE17
WIREDTIGER: SINGLE-NODE TRANSACTION
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
• Transactions are implicit or explicit
‒ session.begin_transaction()
‒ session.commit_transaction()
‒ session.rollback_transaction()
• Transactions can already span objects and data-sources!

#MDBE17
WIREDTIGER SINGLE-NODE TRANSACTION
cursor = session.open_cursor()
session.begin_transaction()
cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert()
cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update()
session.commit_transaction()
cursor.close()

#MDBE17
TRANSACTION INFORMATION
• 8B transaction ID
• Isolation level and snapshot information
‒ Read-uncommitted: everything
‒ Read-committed: committed updates after start
‒ Snapshot: committed updates before start
• Linked list of change records, called “updates”
‒ For logging on commit
‒ For discard on rollback

#MDBE17
UPDATE INFORMATION
• Updates include
‒ Transaction ID which embodies “state” (committed or not)
‒ Data package
Transaction ID
+
Data
Key

#MDBE17
MULTI-VERSION CONCURRENCY CONTROL
• Key references
‒ Chain of updates in most recently modified order
‒ Original value, the update visible to everybody
Transaction ID
+
Data
Key
Transaction ID
+
Data
Globally
Visible
Data

#MDBE17
WIREDTIGER NAMED SNAPSHOTS FEATURE
• Snapshot: a point-in-time
• Snapshots can be named
‒ Transactions can be started “as of” that snapshot
‒ Readers use this to access data as of a point in time.
• But... snapshots keep data pinned in cache
‒ Newer data cannot be discarded

#MDBE17
MONGODB ON TOP OF WIREDTIGER MODEL
• MongoDB maps document changes into this model
‒ For example, a single document change involves indexes
‒ Glue layer below the pluggable storage engine API
• Read concern majority
‒ In other words, it won’t disappear
‒ Requires –enableMajorityReadConcern configuration
‒ Built on WiredTiger’s named snapshots

#MDBE17
INTRODUCING SYSTEM TIMESTAMPS
• Applications have their own notion of transactions and time
‒ Defines an expected commit order
‒ Defines durability for a set of systems
• WiredTiger takes a fixed-length byte-string transaction ID
‒ Simply increasing (but not necessarily monotonic)
‒ A “most significant bit first” hexadecimal string
‒ 8B but expected to grow to encompass system-wide ordering
‒ Mix-and-match with native WiredTiger transactions

#MDBE17
MONGODB USES AN “AS OF” TIMESTAMP
• Updates now include a timestamp transaction ID
‒ Timestamp tracked in WiredTiger’s update
‒ Smaller is better, as a significant overhead for small updates
• Commit “as of” a timestamp
‒ Set during the update or later, at transaction commit
• Read “as of” a timestamp
‒ Set at transaction begin
‒ Point-in-time reads: largest timestamp less than or equal to value

#MDBE17
MONGODB SETS THE “OLDEST” TIMESTAMP
• Limit future reads
• The point at which WiredTiger can discard history
• Cannot go backward, must be updated frequently

#MDBE17
MONGODB SETS THE “STABLE” TIMESTAMP
• Limits future durability rollbacks
‒ Imagine an election where the primary hasn’t seen a committed update
• WiredTiger writes checkpoints at the stable timestamp
‒ The storage engine can’t write what might be rolled back
• Cannot go backward, must be updated frequently

#MDBE17
READ CONCERN MAJORITY FEATURE
• In 3.4 implemented with WiredTiger named snapshots
‒ Every write a named snapshot
‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem
• In 3.6 implemented with read “as of”
‒ Light-weight and fast
‒ Configuration is now a no-op, “always on”

#MDBE17
OPLOG IMPROVEMENTS
• MongoDB does replication by copying its “journal”
‒ Oplog is bulk-loaded on secondaries
‒ Oplog is loaded out-of-order for performance
• Scanning cursor has strict visibility order requirements
‒ No skipping records
‒ No updates visible after the oldest uncommitted update

#MDBE17
OPLOG IMPROVEMENTS
• In 3.4, implemented using WiredTiger named snapshots
• JIRA ticket:
“Under heavy insert load on a 2-node replica set, WiredTiger eviction
appears to hang on the secondary.”
• In 3.6, implemented using timestamps

#MDBE17
A NEW TRANSACTIONAL MODEL SUMMARY
• Significant storage engine changes
• Enhancing transactional consistency for new applications
• Features and improvements in MongoDB 3.6
‒ Retryable writes
‒ Safe secondary reads
‒ Significantly improved performance

#MDBE17
keith.bostic@mongodb.com
QUESTIONS?

Scaling and Transaction Futures

More Related Content

What's hot

Viewers also liked

Similar to Scaling and Transaction Futures

More from MongoDB

Scaling and Transaction Futures

Editor's Notes