MongoDB: A New Transactional Model
keith.bostic@mongodb.com
Keith Bostic
keith.bostic@mongodb.com
Distinguished Engineer, MongoDB Inc.
Welcome!
MongoDB: A New Transactional Model
keith.bostic@mongodb.com
Review & Motivations
Relational vs document data modeling
Goals: performance and complexity
Goals:
application domains, programmer models
Goals: checkboxes and vendor lock-in
Application developers are the focus
But not the only audience
“ACID transactions are a key capability for business critical
transactional systems, specifically around commerce processing.
No other database has both the power of NoSQL and cross
collection ACID transaction support.This combination will make it
easy for developers to write mission critical applications leveraging
the power of MongoDB.”
-- Dharmesh Panchmatia,
Director of E-commerce, Cisco Systems
ACID
Transactions: Acid
Transactions: aCid
Consistency, availability, partition tolerance
Transactions: acId
Transactions: aciD
MongoDB World talk!
Transactions and Durability: Putting the “D” in ACID
-- Sue LoVerso, Senior Engineer, MongoDB
The Path
Multi-year, all-hands company effort
• The storage layer,
Multi-year, all-hands company effort
• The storage layer,
• Sharding architecture,
Multi-year, all-hands company effort
• The storage layer,
• Sharding architecture,
• Introducing a global logical clock,
Multi-year, all-hands company effort
• The storage layer,
• Sharding architecture,
• Introducing a global logical clock,
• Replication consensus protocol,
Multi-year, all-hands company effort
• The storage layer,
• Sharding architecture,
• Introducing a global logical clock,
• Replication consensus protocol,
• Metadata management,
… and that’s just part of the list!
Multi-year, all-hands company effort
• The storage layer,
• Sharding architecture,
• Introducing a global logical clock,
• Replication consensus protocol,
• Metadata management,
… and that’s just part of the list!
Not forgetting the driver teams, documentation and education.
MongoDB 2.6: mmapv1 with ACID
2.6
P
S
S
S
The “storage engine” is part of the MongoDB database server.
It’s where the actual data lives.
Detour: the storage engine
networking, sharding, replication, drivers
analytics, middleware, query optimizer
storage engine
Detour: the storage engine
Single
Node
MongoDB’s pluggable storage architecture
• MMAPv1
• RocksDB (Facebook)
• TokuMX (TokuTek)
• WiredTiger
MongoDB 3.0: a new storage engine
2.6 3.0
B-D
Keys
E-I
Keys
J-N
Keys
F Keys &
Values
E Keys &
Values
G Keys &
Values
Write-heavy workloads
Document level locking
Compaction
Encryption
Compression
In-Memory
MongoDB 3.0: transactional features
2.6 3.0
MongoDB 3.2
P
S
S
S:w “majority”
readConcern
2.6 3.0 3.2
MongoDB 3.4
2.6 3.0 3.43.2
MongoDB 3.6: transactional features
2.6 3.0 3.63.43.2
P
S
S
S
logical sessions
global clock
… leads to causal consistency
MongoDB 3.6: enhanced consistency
2.6 3.0 3.63.43.2
P
S
S
S
secondary reads
retryable writes
read concern majority
MongoDB University video
Implementation of Cluster-Wide Causal Consistency in MongoDB
-- Misha Tyulenev, Engineer, MongoDB
MongoDB 4.0: multi-document transactions
2.6 3.0 3.63.43.2 4.0
P
S
S
S
MongoDB 4.0: prepared transactions
2.6 3.0 3.63.43.2 4.0
Commit?
Yes
Yes
Commit!
Phase One Phase Two
MongoDB 4.X
2.6 3.0 3.63.43.2 4.0
P
S
S
S
PS
S
4.X
MongoDB World talk!
What’s Next? The Path to Sharded Transactions
-- Andy Schwerin, VP, Engineering, MongoDB
Why is this work hard?
Reason #1:
single-node ordering vs. group ordering
MongoDB WiredTiger
Replication example
... 37 38 39
... 37 38 39
... 37 38 39
... 37 38 39
Primary Oplog
Secondary reads
Multi-threaded secondary oplog application
OpLog
... 37 38 39 40 4241
Primary oplog order
Secondary apply order
40 39 41 38
“People assume that time is a strict progression
of cause to effect, but actually, from a non-linear,
non-subjective point of view, it is more like a big
ball of wibbily-wobbly timey-wimey … stuff.”
-- Doctor Who
Reason #2: increasing entanglement
Getting the right answer requires communication.
Timestamps
MongoDB’s timestamp
... 40 4138 39 42
ACID for a set of systems
durability based on replication
WiredTiger’s transaction ID
... 40 4138 39 42
ACID for a single-node
durability based on checkpoints and journaling
MongoDB timestamp overrides WiredTiger ID
64-bit counter
fast comparisons
strictly increasing
MSB hex API
MongoDB secondary example
Set the timestamp at update or commit
OpLog
... 37 38 39 40 4241
Primary oplog order
Secondary timestamp order
... 37 38 39 40 4241
Timestamp stored with each MVCC update
FRUIT
mango
apple
banana
TXN ID
TIMESTAMP
Overlapping reads with oplog application
Secondary timestamp order
... 37 38 39 40 4241
Timestamp #1 Timestamp #2
3.6
4.0
Blocking
Parallel
Queries can be “as of” a timestamp
set at transaction begin
largest LTE value
The “oldest” timestamp
Durability based on the “stable” timestamp
Replication rollback for a single server
... 37 38 39 40 4241
Primary Oplog
... 37 38 39 117 119118
Corrected (Secondary) Oplog
Replacements
The “stable” timestamp
Sydney & New York
Keith Bostic
keith.bostic@mongodb.com
Distinguished Engineer, MongoDB Inc.
Thank you!
WiredTiger and transactions
• Per-thread “session” structure embodies a transaction
WiredTiger and transactions
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
WiredTiger and transactions
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
• Transactions are implicit or explicit (scope determined by the application)
• session.begin_transaction()
• session.commit_transaction()
• session.rollback_transaction() // Abort
• Transactions could always span objects and data-sources
Sample key-value store CRUD transaction
cursor = session.open_cursor(“some collection”);
session.begin_transaction();
Sample key-value store CRUD transaction
cursor = session.open_cursor(“some collection”);
session.begin_transaction();
cursor.set_key(“fruit”);
cursor.set_value(“apple”);
cursor.insert();
Sample key-value store CRUD transaction
cursor = session.open_cursor(“some collection”);
session.begin_transaction();
cursor.set_key(“fruit”);
cursor.set_value(“apple”);
cursor.insert();
cursor.set_key(“fruit”);
cursor.set_value(“banana”);
cursor.update();
Sample key-value store CRUD transaction
cursor = session.open_cursor(“some collection”);
session.begin_transaction();
cursor.set_key(“fruit”);
cursor.set_value(“apple”);
cursor.insert();
cursor.set_key(“fruit”);
cursor.set_value(“banana”);
cursor.update();
session.commit_transaction();
cursor.close();
MongoDB on top of WiredTiger
• MongoDB maps the document model on top of this key/value model
• For example
• A single document change involves indexes
• Multiple cursors updating a collection and its indexes
• Glue layer below the pluggable storage API
• 14K lines of code
WiredTiger transaction information
• 64-bit transaction ID
WiredTiger transaction information
• 64-bit transaction ID
• Isolation level and other snapshot information
• Read-uncommitted: everything
• Read-committed: committed updates after start
• Snapshot: committed updates as of start
WiredTiger transaction information
• 64-bit transaction ID
• Isolation level and snapshot information
• Read-uncommitted: everything
• Read-committed: committed updates after start
• Snapshot: committed updates as of start
• Linked list of change records
• For logging on commit, discard on rollback
MongoDB 4.0
• Multi-document transactions in a replica set
• Storage engine support for prepared transactions
• Replica set point-in-time reads
• Recovery to a timestamp
• Performance enhancement so operations only rolled forward
Right the order in getting
... 37 38 39
... 37 39 38
Oplog
WiredTiger Operations
The storage problem is concurrency control
• Engines support lots of concurrency to increase throughput
• Traditionally, the storage layer decides how operations interleave
Well, there’s durability as well
• Engines support lots of concurrency to increase throughput
• Traditionally, the storage layer decides how operations interleave
• The storage layer is also responsible for single-node crash recovery
If the storage layer owns concurrency and durability
… the rest of the system can ignore the details
Future use of timestamps: 4.0++
• Global snapshot reads
• 2-phase commit for transactions that span shards
• Unified oplog and WiredTiger journal
• Reducing pain of write amplification
• Originally 4 data copies (oplog and collection, plus 2 x journal)
Background: majority commits
... 37 38 39
... 37 38 39
... 37 38 39
... 37 38 39
Primary Oplog
Secondary reads
WiredTiger
• Multi-row and multi-table transactions
• Document level locking
• Multi-version concurrency control (MVCC)
FRUIT
MANGO
APPLE
BANANA
Future use of timestamps: 4.0++
• Global snapshot reads
2-phase commit for transactions that span shards
Rollback all changes if any shard cannot commit
Make commits visible at a common timestamp across shards

MongoDB World 2018: Building a New Transactional Model