MongoDB WiredTiger Internals: Journey To Transactions
Presenter by
Manosh Malai
Senior Devops and DB/Cloud Consultant @ Mydbops
Ranjith A
DBA @ Mydbops
www.mydbops.com info@mydbops.com
Manosh Malai
Senior Devops and DB/Cloud Consultant @Mydbops
mmalai@mydbops.com
@ManoshMalai
Ranjith A
DBA
ranjith@mydbops.com
About Me
Mydbops at a Glance
● Founded in 2015, HQ in Bangalore India, 25 Employees.
● Mydbops is on Database Consulting with core specialization on MySQL and MongoDB
Administration and Support.
● Mydbops was created with a motto of developing a Devops model for Database
administration.
● We help organisations to scale in MySQL/Mongo and implement the advanced technologies in
MySQL/Mongo.
Mydbops at a GlanceHappy Customers
4
Agenda
● Intro
● MongoDB Overview
● WiredTiger Architecture
● WiredTiger Session/Cursors/Schema(Internals)
● Internal Journey of MongoDB Transaction
● Q/A
Intro
DB-Engine Ranking 2019
What is NoSQL
Flexible Schemas to build Modern Applications.
MongoDB Prominent Features
FULL TEXT SEARCH
AGGREGATION FRAMEWORK and MAP REDUCE
TRANSACTION SUPPORT
BSON STORAGE FORMAT
INBUILD REPLICATION/SHARDING SUPPORT
GRIDFS
MongoDB Overview
MongoDB Architecture
IoT Sensor Data
MongoDB Query Language
MongoDB Data Model
Content Repo Ad Service
Real-Time
Analytics
Mobile
WiredTiger MMAPv1 In-Memory Encrypted
Security Management
Horizontal Scalable
Sharding
● Sharding Types: Range, Tag-Aware, Hashed
● Increase or Decrease Capacity as you go
● Automatic Balancing
Shard 1 Shard 2 Shard 3 Shard 4 Shard N. . .
Horizontal Scalable
Vertical Scalable(Replica Set)
Replica Set
Primary
Secondary
Secondary
Primary
Secondary
Secondary
● Factors:
○ RAM
○ Disk
○ CPU
○ Network
● Redundancy
● High Availability
WiredTiger Architecture
What’s Special with WiredTiger ?
● Transactions use optimistic concurrency control algorithms
● Document Level Locking
● Snapshot and Checkpoint
● Write-ahead transaction log for the journal
● Compression
● Online compaction
● LSM and B-tree Indexing
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
WiredTiger Architecture
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
How MongoDB GLUED with WiredTiger
● MongoDB 3.0(March 2015) Introduce Internal Storage API, allowing for new Storage Engine to
be added to MongoDB.
● From MongoDB 3.0, WiredTiger Storage Engine as an option.
● From MongoDB 3.2(Dec 2015), WiredTiger Storage Engine Made as default one.
● WiredTiger Support C API, Java API and Python API.
● MongoDB Storage Engine Layer using C API to communicate with WiredTiger
Further-details:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/README.md
http://source.wiredtiger.com/3.1.0/struct_w_t___c_u_r_s_o_r.html#details
WiredTiger (Internals)
Sessions/Cursors/Schema
WiredTiger Sessions
mongo Shell/mongo drivers
mongod
Session
Cursor
Schema & Cursor
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
db.createCollection(“mydbops”)
db.mydbops.find( { name: “MMalai” } )
db.mydbops.deleteOne( { name: “MMalai” } )
db.mydbops.stats()
db.mydbops.drop()
db.mydbops.explain().find( { name: “manosh” })
Cursor == WT_CURSOR
● WT_CONNECTION, WT_SESSION and WT_CURSOR are the classes use to access and manage data.
● WT_CURSOR will handling all CURD(create, update, read, delete) operations internally.
● WT_CURSOR have all member functions related to CURD operation.
○ WT_CURSOR::reset
○ WT_CURSOR::search
○ WT_CURSOR::search_near
○ WT_CURSOR::remove
● WT_SESSION are open on specification connection, every single connection open one session.
● WT_SESSION and WT_CURSOR not thread safe
● WT_CONNECTION methods are thread safe
Thread Safe
int main(int argc, char *argv[])
{
WT_CONNECTION *conn;
WT_SESSION *session;
WT_CURSOR *cursor;
wt_thread_t threads[NUM_THREADS];
int i;
home = example_setup(argc, argv);
error_check(wiredtiger_open(home, NULL, "create", &conn));
error_check(conn->open_session(conn, NULL, NULL, &session));
error_check(session->create(session, "table:access",
"key_format=S,value_format=S"));
error_check(session->open_cursor(
session, "table:access", NULL, "overwrite", &cursor));
cursor->set_key(cursor, "key1");
cursor->set_value(cursor, "value1");
error_check(cursor->insert(cursor));
error_check(session->close(session, NULL));
for (i = 0; i < NUM_THREADS; i++)
error_check(
__wt_thread_create(NULL, &threads[i], scan_thread,
conn));
for (i = 0; i < NUM_THREADS; i++)
error_check(__wt_thread_join(NULL, threads[i]));
error_check(conn->close(conn, NULL));
return (EXIT_SUCCESS);
}
Internal Journey of MongoDB
Transactions
WiredTiger Transaction
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
3 Pillar Helps to Implement Transaction
Journal
MVCC
Snapshot
Transaction
● Transaction per Sessions
○ open
○ close
○ commit
○ rollback
● Till MongoDB 3.6, it only emulate transactions by implementing two-phase commit.
● WiredTiger support three different Isolation level
○ read-uncommitted
○ read-committed
○ snapshot
● Durability is supported only when they are part of a checkpoint.
Cont..
● From MongoDB 3.6 Transaction feature included:
○ Logical Sessions
○ Global Clock
○ Retryable write
● read-committed is default isolation level WiredTiger 3.1.0, but MongoDB Default is Snapshot.
● MongoDB 4.0 Commands introduced:
● – To starts a multi-statement transaction : Session.startTransaction()
● – To commit a transaction : Session.commitTransaction()
● – To rollback a transaction : Session.abortTransaction()
Code
ret = session->open_cursor(session, "table:mytable", NULL, NULL, &cursor);
ret = session->begin_transaction(session, NULL);
cursor->set_key(cursor, "key");
cursor->set_value(cursor, "value");
switch (ret = cursor->update(cursor)) {
case 0: /* Update success */
ret = session->commit_transaction(session, NULL);
/*
* If commit_transaction succeeds, cursors remain positioned; if
* commit_transaction fails, the transaction was rolled-back and
* and all cursors are reset.
*/
break;
case WT_ROLLBACK: /* Update conflict */
default: /* Other error */
ret = session->rollback_transaction(session, NULL);
/* The rollback_transaction call resets all cursors. */
break;
}
MVCC Principles in WT Transaction
● MVCC in the WiredTiger is a linked list based.
● The linked list unit store the
○ Transaction-Id of this modified transaction
○ TimeStamp(From 3.6)
○ modified value
● Everytime the value modified , the append is on the list header.
● Every session sees their own consistent snapshot of the database.
● Change made by one session, will not be see by any other sessions until their transaction is committed.
MVCC Workflow
MVCC List Tail MVCC List Head
Concurrent Transaction
Initial Val: 10
Read Transaction T0 Submitted to write T1 Rollback write T2 Uncommitted write T3 Read Transaction T4
Changed by T1 to: 11
Changed by T2 to: 12
Marked as Obsolete
Changed by T3 to: 14
WiredTiger Transaction Snapshot
snap_min-T1 snap_max-T4
Submit Transaction Interval(0, T1) TRANSACTION INTERVAL BEING EXECUTED [ T1 , T4] Transaction interval to be executed
T5
. . . Commit T1 Rollback T2 Uncommit T3
T4 Moment
Transaction Execution Process
1
4
2
3 5
Wt_transaction
Operation Array Journal Buffer
New Update
MVCC LIST
OP
Transaction id
Snapshot_object
Operation array
Redo log buffer
State
Wt_transaction data Structure
Wt_transaction{
Transaction_id: The globally unique ID of this transaction, used to indicate the version number of the
transaction modification data
Snapshot_object: The set of transactions that are currently executing and not committed at the beginning
of the current transaction or at the time of the operation, for transaction isolation
Operation_array: A list of operations that have been performed in this transaction for transaction rollback.
Redo_log_buf: Operation log buffer. Used for persistence after transaction commit
State: current state of the transaction
}
Wt_mvcc and snapshot_object data Structure
Wt_mvcc{
Transaction_id: ID of this modified transaction
Value: the modified value
}
Snapshot_object {
Snap_min: <Min Transaction number>,
Snap_max: <Max Transaction number>,
Snap_array: Any modification to a transaction that appears in snap_array,
}
Transaction Flow
1. Create a value unit object (update) in the MVCC list
2. According to the transaction object's transaction id and transaction status, it is determined whether the
transaction ID of the transaction is created for this transaction. If not, a transaction ID is assigned to the
transaction, and the transaction status is set to the HAS_TXN_ID state.
3. Set the ID of this transaction to the update unit as the MVCC version number.
4. Create an operation object, point the object's value pointer to update, and add the operation to the
operation_array of the transaction object.
5. Add the update unit to the linked list header of the MVCC list.
6. Write a redo log to the redo_log_buf of this transaction object.
WiredTiger Transaction Data Flush Time
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
PDFLUSH 60 S
log_flush 100 MS
Sync
LifeTime 60 S
Snapshot for END USER
● To minimize the cache pressure, we can user server Parameter transactionLifeTimeLimitSeconds to some
preferable value.
● Default value is 60.
● Before a transaction updates a document, it will try to acquire a write lock. If the document is already
locked the transaction will fail.
● Before a non-transactional operation tries to update a document, it will try to acquire a write lock. If the
document is already locked, the operation will back off and retry until MaxTimeMS is reached.
Snapshot for END USER
● Pass session information to all statements inside your transaction.
● Implement retry logic. MongoDB returns error codes that tell you if a transaction has failed and if it failed
with a retryable error or not.
● To reduce WiredTiger cache pressure, keep transactions short and don’t leave them open, even read only
transactions.
● Take into account that long running DDL operations (e.g. createIndex() ) block transactions and vice versa.
Journal and Checkpoint
● Journal writes data first to the journal and then to the core data files.
● MongoDB uses memory mapped files to writes your data to disk.
● In order to improve performance, write will first be written into the memory buffer of the journal log.
● Journal file size limit of 100MB.
● WiredTiger create new Journal file approximately every 100MB of data.
● WiredTiger use snappy compression for the Journal data
○ storage.wiredTiger.engineConfig.journalCompressor default Snappy
○ The minimum journal record size for WiredTiger is 128 bytes.
● When the buffer data reaches 100M or every 100 milliseconds, the data in the Journal buffer Will be
flushed to the journal file on the disk
○ storage.journal.commitIntervalMs Default 100 or 30
○ WriteConcern j:true will cause an immediate sync of the journal.
○ If mongodb exits abnormally, we may loss up to 100M data or the last 100ms data.
● When Journal Data files reached 2Gb or 60 seconds, changes are flushed to
○ storage.syncPeriodSecs Default 60 .
○ The amount of time that can pass before MongoDB flushes data to the data files via an fsync
operation.
○ storage.syncPeriodSecs has no effect to journal files.
What I didn’t Covered
1. Block Manager
2. Cache
3. BTree/LSM
4. Compression etc….
MongoDB WiredTiger Internals: Journey To Transactions

MongoDB WiredTiger Internals: Journey To Transactions

  • 1.
    MongoDB WiredTiger Internals:Journey To Transactions Presenter by Manosh Malai Senior Devops and DB/Cloud Consultant @ Mydbops Ranjith A DBA @ Mydbops www.mydbops.com info@mydbops.com
  • 2.
    Manosh Malai Senior Devopsand DB/Cloud Consultant @Mydbops mmalai@mydbops.com @ManoshMalai Ranjith A DBA ranjith@mydbops.com About Me
  • 3.
    Mydbops at aGlance ● Founded in 2015, HQ in Bangalore India, 25 Employees. ● Mydbops is on Database Consulting with core specialization on MySQL and MongoDB Administration and Support. ● Mydbops was created with a motto of developing a Devops model for Database administration. ● We help organisations to scale in MySQL/Mongo and implement the advanced technologies in MySQL/Mongo.
  • 4.
    Mydbops at aGlanceHappy Customers 4
  • 5.
    Agenda ● Intro ● MongoDBOverview ● WiredTiger Architecture ● WiredTiger Session/Cursors/Schema(Internals) ● Internal Journey of MongoDB Transaction ● Q/A
  • 6.
  • 7.
  • 8.
    What is NoSQL FlexibleSchemas to build Modern Applications.
  • 9.
    MongoDB Prominent Features FULLTEXT SEARCH AGGREGATION FRAMEWORK and MAP REDUCE TRANSACTION SUPPORT BSON STORAGE FORMAT INBUILD REPLICATION/SHARDING SUPPORT GRIDFS
  • 10.
  • 11.
    MongoDB Architecture IoT SensorData MongoDB Query Language MongoDB Data Model Content Repo Ad Service Real-Time Analytics Mobile WiredTiger MMAPv1 In-Memory Encrypted Security Management
  • 12.
    Horizontal Scalable Sharding ● ShardingTypes: Range, Tag-Aware, Hashed ● Increase or Decrease Capacity as you go ● Automatic Balancing Shard 1 Shard 2 Shard 3 Shard 4 Shard N. . . Horizontal Scalable
  • 13.
    Vertical Scalable(Replica Set) ReplicaSet Primary Secondary Secondary Primary Secondary Secondary ● Factors: ○ RAM ○ Disk ○ CPU ○ Network ● Redundancy ● High Availability
  • 14.
  • 15.
    What’s Special withWiredTiger ? ● Transactions use optimistic concurrency control algorithms ● Document Level Locking ● Snapshot and Checkpoint ● Write-ahead transaction log for the journal ● Compression ● Online compaction ● LSM and B-tree Indexing
  • 16.
    Python API CAPI Java API Schema & Cursor Row Storage Column Storage Cache Block Management Transactions Snapshot Page read/write WAL Database Files Journal
  • 17.
    WiredTiger Architecture Python APIC API Java API Schema & Cursor Row Storage Column Storage Cache Block Management Transactions Snapshot Page read/write WAL Database Files Journal
  • 18.
    How MongoDB GLUEDwith WiredTiger ● MongoDB 3.0(March 2015) Introduce Internal Storage API, allowing for new Storage Engine to be added to MongoDB. ● From MongoDB 3.0, WiredTiger Storage Engine as an option. ● From MongoDB 3.2(Dec 2015), WiredTiger Storage Engine Made as default one. ● WiredTiger Support C API, Java API and Python API. ● MongoDB Storage Engine Layer using C API to communicate with WiredTiger Further-details: https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/README.md http://source.wiredtiger.com/3.1.0/struct_w_t___c_u_r_s_o_r.html#details
  • 19.
  • 20.
    WiredTiger Sessions mongo Shell/mongodrivers mongod Session Cursor
  • 21.
    Schema & Cursor PythonAPI C API Java API Schema & Cursor Row Storage Column Storage Cache Block Management Transactions Snapshot Page read/write WAL Database Files Journal db.createCollection(“mydbops”) db.mydbops.find( { name: “MMalai” } ) db.mydbops.deleteOne( { name: “MMalai” } ) db.mydbops.stats() db.mydbops.drop() db.mydbops.explain().find( { name: “manosh” })
  • 22.
    Cursor == WT_CURSOR ●WT_CONNECTION, WT_SESSION and WT_CURSOR are the classes use to access and manage data. ● WT_CURSOR will handling all CURD(create, update, read, delete) operations internally. ● WT_CURSOR have all member functions related to CURD operation. ○ WT_CURSOR::reset ○ WT_CURSOR::search ○ WT_CURSOR::search_near ○ WT_CURSOR::remove ● WT_SESSION are open on specification connection, every single connection open one session. ● WT_SESSION and WT_CURSOR not thread safe ● WT_CONNECTION methods are thread safe
  • 23.
    Thread Safe int main(intargc, char *argv[]) { WT_CONNECTION *conn; WT_SESSION *session; WT_CURSOR *cursor; wt_thread_t threads[NUM_THREADS]; int i; home = example_setup(argc, argv); error_check(wiredtiger_open(home, NULL, "create", &conn)); error_check(conn->open_session(conn, NULL, NULL, &session)); error_check(session->create(session, "table:access", "key_format=S,value_format=S")); error_check(session->open_cursor( session, "table:access", NULL, "overwrite", &cursor)); cursor->set_key(cursor, "key1"); cursor->set_value(cursor, "value1"); error_check(cursor->insert(cursor)); error_check(session->close(session, NULL)); for (i = 0; i < NUM_THREADS; i++) error_check( __wt_thread_create(NULL, &threads[i], scan_thread, conn)); for (i = 0; i < NUM_THREADS; i++) error_check(__wt_thread_join(NULL, threads[i])); error_check(conn->close(conn, NULL)); return (EXIT_SUCCESS); }
  • 24.
    Internal Journey ofMongoDB Transactions
  • 25.
    WiredTiger Transaction Python APIC API Java API Schema & Cursor Row Storage Column Storage Cache Block Management Transactions Snapshot Page read/write WAL Database Files Journal
  • 26.
    3 Pillar Helpsto Implement Transaction Journal MVCC Snapshot
  • 27.
    Transaction ● Transaction perSessions ○ open ○ close ○ commit ○ rollback ● Till MongoDB 3.6, it only emulate transactions by implementing two-phase commit. ● WiredTiger support three different Isolation level ○ read-uncommitted ○ read-committed ○ snapshot ● Durability is supported only when they are part of a checkpoint.
  • 28.
    Cont.. ● From MongoDB3.6 Transaction feature included: ○ Logical Sessions ○ Global Clock ○ Retryable write ● read-committed is default isolation level WiredTiger 3.1.0, but MongoDB Default is Snapshot. ● MongoDB 4.0 Commands introduced: ● – To starts a multi-statement transaction : Session.startTransaction() ● – To commit a transaction : Session.commitTransaction() ● – To rollback a transaction : Session.abortTransaction()
  • 29.
    Code ret = session->open_cursor(session,"table:mytable", NULL, NULL, &cursor); ret = session->begin_transaction(session, NULL); cursor->set_key(cursor, "key"); cursor->set_value(cursor, "value"); switch (ret = cursor->update(cursor)) { case 0: /* Update success */ ret = session->commit_transaction(session, NULL); /* * If commit_transaction succeeds, cursors remain positioned; if * commit_transaction fails, the transaction was rolled-back and * and all cursors are reset. */ break; case WT_ROLLBACK: /* Update conflict */ default: /* Other error */ ret = session->rollback_transaction(session, NULL); /* The rollback_transaction call resets all cursors. */ break; }
  • 30.
    MVCC Principles inWT Transaction ● MVCC in the WiredTiger is a linked list based. ● The linked list unit store the ○ Transaction-Id of this modified transaction ○ TimeStamp(From 3.6) ○ modified value ● Everytime the value modified , the append is on the list header. ● Every session sees their own consistent snapshot of the database. ● Change made by one session, will not be see by any other sessions until their transaction is committed.
  • 31.
    MVCC Workflow MVCC ListTail MVCC List Head Concurrent Transaction Initial Val: 10 Read Transaction T0 Submitted to write T1 Rollback write T2 Uncommitted write T3 Read Transaction T4 Changed by T1 to: 11 Changed by T2 to: 12 Marked as Obsolete Changed by T3 to: 14
  • 32.
    WiredTiger Transaction Snapshot snap_min-T1snap_max-T4 Submit Transaction Interval(0, T1) TRANSACTION INTERVAL BEING EXECUTED [ T1 , T4] Transaction interval to be executed T5 . . . Commit T1 Rollback T2 Uncommit T3 T4 Moment
  • 33.
    Transaction Execution Process 1 4 2 35 Wt_transaction Operation Array Journal Buffer New Update MVCC LIST OP Transaction id Snapshot_object Operation array Redo log buffer State
  • 34.
    Wt_transaction data Structure Wt_transaction{ Transaction_id:The globally unique ID of this transaction, used to indicate the version number of the transaction modification data Snapshot_object: The set of transactions that are currently executing and not committed at the beginning of the current transaction or at the time of the operation, for transaction isolation Operation_array: A list of operations that have been performed in this transaction for transaction rollback. Redo_log_buf: Operation log buffer. Used for persistence after transaction commit State: current state of the transaction }
  • 35.
    Wt_mvcc and snapshot_objectdata Structure Wt_mvcc{ Transaction_id: ID of this modified transaction Value: the modified value } Snapshot_object { Snap_min: <Min Transaction number>, Snap_max: <Max Transaction number>, Snap_array: Any modification to a transaction that appears in snap_array, }
  • 36.
    Transaction Flow 1. Createa value unit object (update) in the MVCC list 2. According to the transaction object's transaction id and transaction status, it is determined whether the transaction ID of the transaction is created for this transaction. If not, a transaction ID is assigned to the transaction, and the transaction status is set to the HAS_TXN_ID state. 3. Set the ID of this transaction to the update unit as the MVCC version number. 4. Create an operation object, point the object's value pointer to update, and add the operation to the operation_array of the transaction object. 5. Add the update unit to the linked list header of the MVCC list. 6. Write a redo log to the redo_log_buf of this transaction object.
  • 37.
    WiredTiger Transaction DataFlush Time Python API C API Java API Schema & Cursor Row Storage Column Storage Cache Block Management Transactions Snapshot Page read/write WAL Database Files Journal PDFLUSH 60 S log_flush 100 MS Sync LifeTime 60 S
  • 38.
    Snapshot for ENDUSER ● To minimize the cache pressure, we can user server Parameter transactionLifeTimeLimitSeconds to some preferable value. ● Default value is 60. ● Before a transaction updates a document, it will try to acquire a write lock. If the document is already locked the transaction will fail. ● Before a non-transactional operation tries to update a document, it will try to acquire a write lock. If the document is already locked, the operation will back off and retry until MaxTimeMS is reached.
  • 39.
    Snapshot for ENDUSER ● Pass session information to all statements inside your transaction. ● Implement retry logic. MongoDB returns error codes that tell you if a transaction has failed and if it failed with a retryable error or not. ● To reduce WiredTiger cache pressure, keep transactions short and don’t leave them open, even read only transactions. ● Take into account that long running DDL operations (e.g. createIndex() ) block transactions and vice versa.
  • 40.
    Journal and Checkpoint ●Journal writes data first to the journal and then to the core data files. ● MongoDB uses memory mapped files to writes your data to disk. ● In order to improve performance, write will first be written into the memory buffer of the journal log. ● Journal file size limit of 100MB. ● WiredTiger create new Journal file approximately every 100MB of data. ● WiredTiger use snappy compression for the Journal data ○ storage.wiredTiger.engineConfig.journalCompressor default Snappy ○ The minimum journal record size for WiredTiger is 128 bytes.
  • 41.
    ● When thebuffer data reaches 100M or every 100 milliseconds, the data in the Journal buffer Will be flushed to the journal file on the disk ○ storage.journal.commitIntervalMs Default 100 or 30 ○ WriteConcern j:true will cause an immediate sync of the journal. ○ If mongodb exits abnormally, we may loss up to 100M data or the last 100ms data. ● When Journal Data files reached 2Gb or 60 seconds, changes are flushed to ○ storage.syncPeriodSecs Default 60 . ○ The amount of time that can pass before MongoDB flushes data to the data files via an fsync operation. ○ storage.syncPeriodSecs has no effect to journal files.
  • 42.
    What I didn’tCovered 1. Block Manager 2. Cache 3. BTree/LSM 4. Compression etc….