2. Presenter
• Keith Bostic
• Co-architect WiredTiger
• Senior Staff Engineer MongoDB
Ask questions as we go, or
keith.bostic@mongodb.com
3. 3
This presentation is not…
• How to write stand-alone WiredTiger apps
– contact info@wiredtiger.com
• How to configure MongoDB with WiredTiger for your workload
4. 4
WiredTiger
• Embedded database engine
– general purpose toolkit
– high performing: scalable throughput with low latency
• Key-value store (NoSQL)
• Schema layer
– data typing, indexes
• Single-node
• OO APIs
– Python, C, C++, Java
• Open Source
5. 5
Deployments
• Amazon AWS
• ORC/Tbricks: financial trading solution
And, most important of all:
• MongoDB: next-generation document store
8. 8
MongoDB’s Storage Engine API
• Allows different storage engines to "plug-in"
– different workloads have different performance characteristics
– mmapV1 is not ideal for all workloads
– more flexibility
• mix storage engines on same replica set/sharded cluster
• Opportunity to innovate further
– HDFS, encrypted, other workloads
• WiredTiger is MongoDB’s general-purpose workhorse
9. Topics
Ø WiredTiger Architecture
• In-memory performance
• Record-level concurrency
• Compression
• Durability and the journal
• Future features
10. 10
Motivation for WiredTiger
• Traditional engines struggle with modern hardware:
– lots of CPU cores
– lots of RAM
• Avoid thread contention for resources
– lock-free algorithms, for example, hazard pointers
– concurrency control without blocking
• Hotter cache, more work per I/O
– big blocks
– compact file formats
11. 11
WiredTiger Architecture
WiredTiger Engine
Schema &
Cursors
Python API C API Java API
Database
Files
Transactions
Page
read/write
Logging
Column
storage
Block
management
Row
storage
Snapshots
Log Files
Cache
12. 12
Column-store, LSM
• Column-store
– implemented inside the B+tree
– 64-bit record number keys
– valued by the key’s position in the tree
– variable-length or fixed-length
• LSM
– forest of B+trees (row-store or column-store)
– bloom filters (fixed-length column-store)
• Mix-and-match
– sparse, wide table: column-store primary, LSM indexes
13. Topics
ü WiredTiger Architecture
Ø In-memory performance
• Record-level concurrency
• Compression
• Durability and the journal
• Future features
16. 16
Pages in cache
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
17. 17
Skiplists
• Updates stored in skiplists
– ordered linked lists with forward “skip” pointers
• William Pugh, 1989
– simpler, as fast as binary-search, less space
– likely binary-search performance plus cache prefetch
– more space for an existing data set
• Implementation
– insert without locking
– forward/backward traversal without locking, while inserting
– removal requires locking
18. 18
In-memory performance
• Cache trees/pages optimized for in-memory access
• Follow pointers to traverse a tree
• No locking to read or write
• Keep updates separate from initial data
– updates are stored in skiplists
– updates are atomic in almost all cases
• Do structural changes (eviction, splits) in background threads
19. Topics
ü WiredTiger Architecture
ü In-memory performance
Ø Record-level concurrency
• Compression
• Durability and the journal
• Future features
20. 20
Multiversion Concurrency Control (MVCC)
• Multiple versions of records maintained in cache
• Readers see most recently committed version
– read-uncommitted or snapshot isolation available
– configurable per-transaction or per-handle
• Writers can create new versions concurrent with readers
• Concurrent updates to a single record cause write conflicts
– one of the updates wins
– other generally retries with back-off
• No locking, no lock manager
21. 21
Pages in cache
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
skiplist
23. Topics
ü WiredTiger Architecture
ü In-memory performance
ü Record-level concurrency
Ø Compression
• Durability and the journal
• Future features
24. 24
Block manager
• Block allocation
– fragmentation
– allocation policy
• Checksums
– block compression is at a higher level
• Checkpoints
– involved in durability guarantees
• Opaque address cookie
– stored as internal page key’s “value”
• Pluggable
25. 25
Write path
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
reconciled
during write
26. 26
In-memory Compression
• Prefix compression
– index keys usually have a common prefix
– rolling, per-block, requires instantiation for performance
• Huffman/static encoding
– burns CPU
• Dictionary lookup
– single value per page
• Run-length encoding
– column-store values
29. Topics
ü WiredTiger Architecture
ü In-memory performance
ü Record-level concurrency
ü Compression
Ø Durability and the journal
• Future features
30. 30
Journal and Recovery
• Write-ahead logging (aka journal) enabled by default
• Only written at transaction commit
– only write redo records
• Log records are compressed
• Group commit for concurrency
• Automatic log archival / removal
– bounded by checkpoint frequency
• On startup, find a consistent checkpoint in the metadata
– use the checkpoint to figure out how much to roll forward
31. 31
Durability without Journaling
• MongoDB’s MMAP storage requires the journal for consistency
– running with “nojournal” is unsafe
• WiredTiger is a no-overwrite data store
– with “nojournal”, updates since the last checkpoint may be lost
– data will still be consistent
– checkpoints every N seconds by default
• Replication can guarantee durability
– the network is generally faster than disk I/O
32. Topics
ü WiredTiger Architecture
ü In-memory performance
ü Record-level concurrency
ü Compression
ü Durability and the journal
Ø Future features
34. 34
What’s next for WiredTiger?
• Our Big Year of Tuning
– applications doing “interesting” things
– stalls during checkpoints with 100GB+ caches
– MongoDB capped collections
• Encryption
• Advanced transactional semantics
– updates not stable until confirmed by replica majority
35. 35
WiredTiger LSM support
• Random insert workloads
• Data set much larger than cache
• Query performance less important
• Background maintenance overhead acceptable
• Bloom filters
Feel free to ask questions as we go, hopefully there will be a few minutes for Q&A at the end.
Also happy to discuss by email, let me know how we can help.
Build a toolkit: one path is to build special-purpose engines to handle specific workloads, another path is to handle complex/changing workloads.
This kind of positive feedback isn’t common in engineering groups.
The structure for the rest of the talk.
Traditional storage engine designs struggle with modern hardware.
I/O, in relationship to memory, a worse outcome than ever before: trade CPU for I/O wherever possible.
Big block I/O: if we have to do I/O, bring in a lot of data.
Moderately complex inside.
Outside APIs, handle + method
Top-level schema layer, where every table and associated indexes
Operations are transactionally protected, implemented in terms of in-memory snapshots.
Operations are to a row-store engine (ordered key/value pair), column-store (key is a 64-bit record number)
Block management is intended to be pluggable itself.
On disk, key/value pair files and log files.
From now on, I’m going to mostly be talking about “trees” in-memory, without distinguishing what type of tree – here’s the overview, after which it’s just a page in-memory.
WiredTiger focuses on in-memory performance: I/O means you’ve lost the war, you’re only trying to retreat gracefully.
WiredTiger does have root pages and internal pages, with leaf pages at the bottom: binary search of each page yields the child page for the subsequent search.
Importantly, pointers in memory are not disk offsets (in many engines in-memory objects find each other use disk offsets, so, for example, a transition from an internal page to a leaf page implies giving the cache subsystem a disk offset, and the cache returns the in-memory pointer, possibly after a read.
The WiredTiger in-memory tree is exactly that, it’s an in-memory optimized tree. This is good (fast in-memory operations), this is bad (we have a translation step after reading, and before writing, disk blocks). We knew we wanted to modify our in-memory representation without changing our on-disk format (avoid upgrades!), and we knew a lot of our compression algorithms would require translation before writing anyway (for example, our on-disk pages have no indexing information, saves about 15-20% of the disk footprint in some workloads).
To make this efficient, pointers need to be protected: once data is larger than cache, there needs to be a check to ensure a pointer is valid.
There’s a background thread doing eviction of pages.
Hazard pointers can be thought of as micro-logging.
Readers/writers publishes the memory address of a page it wants on a non-shared cache line; after that publish, if the pointer is still valid, can proceed.
Eviction threads must check those locations to ensure a page is not currently in use: eviction bears the burden, readers/writers go fast.
Design principal: application threads never wait, shift work from application threads to system-internal threads.
Different in-memory representation vs on-disk – opportunities to optimize plus some interesting features.
Read on-disk page into cache, generally a read-only image. Add indexing information (binary search) on top of that image.
Updates are layered on top of that image, including new key/value pairs inserted between existing keys.
if the page grows too large, background threads will deal with it.
Lots of magic in traversal, threads must go back-and-forth between the original image and the updated image.
Writing a page:
Combine the previous page image with in-memory changes
If the result is too big, split
Allocate new space in the file
Always write multiples of the allocation size
To summarize:
Note we haven’t talked about locking at all: application threads can retrieve and update data without every acquiring a lock.
Justin Levandoski’s BW-Tree work: they’ve avoided taking pages out of circulation during splits, interesting.
Readers don’t block writers, writers don’t block readers, writers don’t block writers, again, no locks have been acquired.
If there have been no updates, then it’s easy, the on-page item is the correct item for any query.
If there are updates, each update has associated with it a transaction ID, and that transaction ID combined with the transaction’s snapshot, determines the correct value.
Index references the original page image, once updates installed readers/writers have to check for updates.
The first update in the list is generally the one we want, if it’s not yet visible, other updates are checked. if no update is correct, the value on the original page must be the one we want.
All updates done by atomic updates, swapping a new pointer into place, readers concurrent to updates.
Writing the page to disk requires processing all of this information, and determining the values to write.
Page-Write: transform the in-memory version: selecting values to write, page-splitting, all sorts of compression, checksums.
Checkpoints are simply another “snapshot reader”, so they run concurrently with other readers and writers.
Writing a page:
Combine the previous page image with in-memory changes
Allocate new space in the file
If the result is too big, split
Always write multiples of the allocation size
Can configure direct I/O to keep reads and writes out of the filesystem cache
All of these apply in-memory and on disk, so we save both on disk and in the cache.
in the same code paths, we compress the data.
In WiredTiger, compact file sizes, and that certain types of compression cannot be turned off, WiredTiger without “compression”, is 50%.
Storage engines are all about not losing your stuff.
Pretty standard WAL implementation: before a commit is visible, a log record with all of the changes in the transaction has been flushed to stable storage.
Group commit: concurrent log writes are done with a single storage flush.
Started with “Scalability of write-ahead logging on multicore and multisocket hardware by Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis and Anastasia Ailamaki.”, and there’s lots more engineering there.
Checkpoints move from one stable point to another one.
Our goal was to build a single-node engine, and for that reason, we had to run without a log without losing durability.
Lookaside tables.
A very large data-set over time:
blue: a btree tails off over time as the probability of a page being found in cache decreases (that’s why the random nature of the insert matters).
red/green flatter: only maintain the recent updates in cache, and merge the updates in the background.
LSM is write-optimized, though: the reason the btree is primary is that read-mostly workloads generally behave better in a btree than in LSM.
What we want to do eventually, is enable the conversion of an LSM tree into a btree (if you think of a forest of btrees collapsing into a single btree, that matches nicely with the typical workload of inserting a lot of data and then processing that data. Ideally, we’d also be able to reverse that process on demand.