SlideShare a Scribd company logo
LDAP at Lightning Speed
Howard Chu
CTO, Symas Corp. hyc@symas.com
Chief Architect, OpenLDAP hyc@openldap.org
2015-03-05
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/lmdb
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
2
OpenLDAP Project
● Open source code project
● Founded 1998
● Three core team members
● A dozen or so contributors
● Feature releases every 12-18 months
● Maintenance releases as needed
3
A Word About Symas
● Founded 1999
● Founders from Enterprise Software world
– platinum Technology (Locus Computing)
– IBM
● Howard joined OpenLDAP in 1999
– One of the Core Team members
– Appointed Chief Architect January 2007
● No debt, no VC investments: self-funded
4
Intro
● Howard Chu
– Founder and CTO Symas Corp.
– Developing Free/Open Source software since
1980s
● GNU compiler toolchain, e.g. "gmake -j", etc.
● Many other projects...
– Worked for NASA/JPL, wrote software for Space
Shuttle, etc.
5
Topics
(1) Background
(2) Features
(3) Design Approach
(4) Internals
(5) Special Features
(6) Results
6
(1) Background
● API inspired by Berkeley DB (BDB)
– OpenLDAP has used BDB extensively since 1999
– Deep experience with pros and cons of BDB design
and implementation
– Omits BDB features that were found to be of no benefit
● e.g. extensible hashing
– Avoids BDB characteristics that were problematic
● e.g. cache tuning, complex locking, transaction logs,
recovery
7
(2) Features
LMDB At A Glance
● Key/Value store using B+trees
● Fully transactional, ACID compliant
● MVCC, readers never block
● Uses memory-mapped files, needs no tuning
● Crash-proof, no recovery needed after restart
● Highly optimized, extremely compact
– under 40KB object code, fits in CPU L1 I$
● Runs on most modern OSs
– Linux, Android, *BSD, MacOSX, iOS, Solaris, Windows, etc...
8
Features
● Concurrency Support
– Both multi-process and multi-thread
– Single Writer + N readers
● Writers don't block readers
● Readers don't block writers
● Reads scale perfectly linearly with available CPUs
● No deadlocks
– Full isolation with MVCC - Serializable
– Nested transactions
– Batched writes
9
Features
● Uses Copy-on-Write
– Live data is never overwritten
– DB structure cannot be corrupted by incomplete
operations (system crashes)
– No write-ahead logs needed
– No transaction log cleanup/maintenance
– No recovery needed after crashes
10
Features
● Uses Single-Level Store
– Reads are satisfied directly from the memory map
● No malloc or memcpy overhead
– Writes can be performed directly to the memory map
● No write buffers, no buffer tuning
– Relies on the OS/filesystem cache
● No wasted memory in app-level caching
– Can store live pointer-based objects directly
● using a fixed address map
● minimal marshalling, no unmarshalling required
11
Features
● LMDB config is simple, e.g. slapd
● BDB config is complex
database mdb
directory /var/lib/ldap/data/mdb
maxsize 4294967296
database hdb
directory /var/lib/ldap/data/hdb
cachesize 50000
idlcachesize 50000
dbconfig set_cachesize 4 0 1
dbconfig set_lg_regionmax 262144
dbconfig set_lg_bsize 2097152
dbconfig set_lg_dir /mnt/logs/hdb
dbconfig set_lk_max_locks 3000
dbconfig set_lk_max_objects 1500
dbconfig set_lk_max_lockers 1500
12
(3) Design Approach
● Motivation - problems dealing with BDB
● Obvious Solutions
● Approach
13
Motivation
● BDB slapd backend always required careful,
complex tuning
– Data comes through 3 separate layers of caches
– Each layer has different size and speed traits
– Balancing the 3 layers against each other can be a
difficult juggling act
– Performance without the backend caches is
unacceptably slow - over an order of magnitude
14
Motivation
● Backend caching significantly increased the
overall complexity of the backend code
– Two levels of locking required, since BDB database
locks are too slow
– Deadlocks occurring routinely in normal operation,
requiring additional backoff/retry logic
15
Motivation
● The caches were not always beneficial, and were
sometimes detrimental
– Data could exist in 3 places at once - filesystem, DB,
and backend cache - wasting memory
– Searches with result sets that exceeded the configured
cache size would reduce the cache effectiveness to
zero
– malloc/free churn from adding and removing entries in
the cache could trigger pathological heap
fragmentation in libc malloc
16
Obvious Solutions
● Cache management is a hassle, so don't do any
caching
– The filesystem already caches data; there's no
reason to duplicate the effort
● Lock management is a hassle, so don't do any
locking
– Use Multi-Version Concurrency Control (MVCC)
– MVCC makes it possible to perform reads with no
locking
17
Obvious Solutions
● BDB supports MVCC, but still requires complex
caching and locking
● To get the desired results, we need to abandon BDB
● Surveying the landscape revealed no other DB
libraries with the desired characteristics
● Thus LMDB was created in 2011
– "Lightning Memory-Mapped Database"
– BDB is now deprecated in OpenLDAP
18
Design Approach
● Based on the "Single-Level Store" concept
– Not new, first implemented in Multics in 1964
– Access a database by mapping the entire DB into
memory
– Data fetches are satisfied by direct reference to the
memory map; there is no intermediate page or
buffer cache
19
Single-Level Store
● Only viable if process address spaces are
larger than the expected data volumes
– For 32 bit processors, the practical limit on data
size is under 2GB
– For common 64 bit processors which only
implement 48 bit address spaces, the limit is 47 bits
or 128 terabytes
– The upper bound at 63 bits is 8 exabytes
20
Design Approach
● Uses a read-only memory map
– Protects the DB structure from corruption due to stray
writes in memory
– Any attempts to write to the map will cause a SEGV,
allowing immediate identification of software bugs
● Can optionally use a read-write mmap
– Slight performance gain for fully in-memory data sets
– Should only be used on fully-debugged application
code
21
Design Approach
● Keith Bostic (BerkeleyDB author, personal email, 2008)
– "The most significant problem with building an mmap'd back-end is implementing
write-ahead-logging (WAL). (You probably know this, but just in case: the way
databases usually guarantee consistency is by ensuring that log records
describing each change are written to disk before their transaction commits, and
before the database page that was changed. In other words, log record X must hit
disk before the database page containing the change described by log record X.)
– In Berkeley DB WAL is done by maintaining a relationship between the database
pages and the log records. If a database page is being written to disk, there's a
look-aside into the logging system to make sure the right log records have already
been written. In a memory-mapped system, you would do this by locking modified
pages into memory (mlock), and flushing them at specific times (msync), otherwise
the VM might just push a database page with modifications to disk before its log
record is written, and if you crash at that point it's all over but the screaming."
22
Design Approach
● Implement MVCC using copy-on-write
– In-use data is never overwritten, modifications are
performed by copying the data and modifying the copy
– Since updates never alter existing data, the DB
structure can never be corrupted by incomplete
modifications
● Write-ahead transaction logs are unnecessary
– Readers always see a consistent snapshot of the DB,
they are fully isolated from writers
● Read accesses require no locks
23
MVCC Details
● "Full" MVCC can be extremely resource intensive
– DBs typically store complete histories reaching far back into time
– The volume of data grows extremely fast, and grows without
bound unless explicit pruning is done
– Pruning the data using garbage collection or compaction requires
more CPU and I/O resources than the normal update workload
● Either the server must be heavily over-provisioned, or updates must be
stopped while pruning is done
– Pruning requires tracking of in-use status, which typically
involves reference counters, which require locking
24
Design Approach
● LMDB nominally maintains only two versions of the DB
– Rolling back to a historical version is not interesting for
OpenLDAP
– Older versions can be held open longer by reader transactions
●
LMDB maintains a free list tracking the IDs of unused
pages
– Old pages are reused as soon as possible, so data volumes
don't grow without bound
● LMDB tracks in-use status without locks
25
Implementation Highlights
● LMDB library started from the append-only btree
code written by Martin Hedenfalk for his ldapd,
which is bundled in OpenBSD
– Stripped out all the parts we didn't need (page cache
management)
– Borrowed a couple pieces from slapd for expedience
– Changed from append-only to page-reclaiming
– Restructured to allow adding ideas from BDB that we
still wanted
26
Implementation Highlights
● Resulting library was under 32KB of object
code
– Compared to the original btree.c at 39KB
– Compared to BDB at 1.5MB
● API is loosely modeled after the BDB API to
ease migration of back-bdb code
27
Implementation Highlights
size
db_bench*
text data bss dec hex filename Lines of Code
285306 1516 352 287174 461c6 db_bench 39758
384206 9304 3488 396998 60ec6 db_bench_basho 26577
1688853 2416 312 1691581 19cfbd db_bench_bdb 1746106
315491 1596 360 317447 4d807 db_bench_hyper 21498
121412 1644 320 123376 1e1f0 db_bench_mdb 7955
1014534 2912 6688 1024134 fa086 db_bench_rocksdb 81169
992334 3720 30352 1026406 fa966 db_bench_tokudb 227698
853216 2100 1920 857236 d1494 db_bench_wiredtiger 91410
Footprint
28
(4) Internals
● Btree Operation
– Write-Ahead Logging
– Append-Only
– Copy-on-Write, LMDB-style
● Free Space Management
– Avoiding Compaction/Garbage Collection
● Transaction Handling
– Avoiding Locking
29
Btree Operation
Pgno
Misc...
Database Page
Pgno
Misc...
offset
key, data
Data Page
Pgno
Misc...
Root
Meta Page
Basic Elements
30
Btree Operation
Pgno: 0
Misc...
Root : EMPTY
Meta Page Write-Ahead Log
Write-Ahead Logger
31
Btree Operation
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Write-Ahead Logger
Add 1,foo to
page 1
Write-Ahead Log
32
Btree Operation
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Add 1,foo to
page 1
Write-Ahead Log
Write-Ahead Logger
33
Btree Operation
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Add 1,foo to
page 1
Commit
Write-Ahead Log
Write-Ahead Logger
34
Btree Operation
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Write-Ahead Logger
Add 1,foo to
page 1
Commit
Add 2,bar to
page 1
Write-Ahead Log
35
Btree Operation
Pgno: 1
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Add 1,foo to
page 1
Commit
Add 2,bar to
page 1
Write-Ahead Log
Write-Ahead Logger
36
Btree Operation
Pgno: 1
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Add 1,foo to
page 1
Commit
Add 2,bar to
page 1
Commit
Write-Ahead Log
Write-Ahead Logger
37
Btree Operation
Pgno: 1
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Add 1,foo to
page 1
Commit
Add 2,bar to
page 1
Commit
Checkpoint
Write-Ahead Log
Write-Ahead Logger
Pgno: 1
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 0
Misc...
Root : 1
Meta Page
Pgno: 0
Misc...
Root : 1
Meta Page
38
Btree Operation
How Append-Only/Copy-On-Write Works
● Updates are always performed bottom up
● Every branch node from the leaf to the root
must be copied/modified for any leaf update
● Any node not on the path from the leaf to the
root is unaltered
● The root node is always written last
39
Btree Operation
Append-Only
Start with a simple tree
40
Btree Operation
Append-Only
Update a leaf node by copying it and
updating the copy
41
Btree Operation
Append-Only
Copy the root node, and point it at the new leaf
42
Btree Operation
Append-Only
The old root and old leaf remain as a previous
version of the tree
43
Btree Operation
Append-Only
Further updates create additional versions
44
Btree Operation
Append-Only
45
Btree Operation
Append-Only
46
Btree Operation
Append-Only
47
Btree Operation
In the Append-Only tree, new pages are always appended sequentially to
the DB file
● While there's significant overhead for making complete copies of
modified pages, the actual I/O is linear and relatively fast
● The root node is always the last page of the file, unless there was a
crash
● Any root node can be found by seeking backward from the end of the
file, and checking the page's header
● Recovery from a crash is relatively easy
– Everything from the last valid root to the beginning of the file is always pristine
– Anything between the end of the file and the last valid root is discarded
48
Btree Operation
Append-Only
Pgno: 0
Misc...
Root : EMPTY
Meta Page
49
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
50
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
51
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
52
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
Root : 3
Meta Page
53
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
Root : 3
Meta Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
54
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
Root : 3
Meta Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 6
Misc...
Root : 5
Meta Page
55
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
Root : 3
Meta Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 6
Misc...
Root : 5
Meta Page
Pgno: 7
Misc...
offset: 4000
offset: 3000
2,xyz
1,blah
Data Page
56
Btree Operation
Append-Only
Pgno: 1
Misc...
offset: 4000
1,foo
Data Page
Pgno: 0
Misc...
Root : EMPTY
Meta Page
Pgno: 2
Misc...
Root : 1
Meta Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
Root : 3
Meta Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 6
Misc...
Root : 5
Meta Page
Pgno: 7
Misc...
offset: 4000
offset: 3000
2,xyz
1,blah
Data Page
Pgno: 8
Misc...
Root : 7
Meta Page
57
Btree Operation
Append-Only disk usage is very inefficient
● Disk space usage grows without bound
● 99+% of the space will be occupied by old versions of
the data
● The old versions are usually not interesting
● Reclaiming the old space requires a very expensive
compaction phase
● New updates must be throttled until compaction
completes
58
Btree Operation
The LMDB Approach
● Still Copy-on-Write, but using two fixed root nodes
– Page 0 and Page 1 of the file, used in double-buffer fashion
– Even faster cold-start than Append-Only, no searching
needed to find the last valid root node
– Any app always reads both pages and uses the one with
the greater Transaction ID stamp in its header
– Consequently, only 2 outstanding versions of the DB exist,
not fully "multi-version"
59
Btree Operation
00
60
Btree Operation
00
61
Btree Operation
10
62
Btree Operation
10
63
Btree Operation
12
After this step the old blue page is no longer referenced by
anything else in the database, so it can be reclaimed
64
Btree Operation
12
65
Btree Operation
32
After this step the old yellow page is no longer referenced by
anything else in the database, so it can also be reclaimed
66
Free Space Management
LMDB maintains two B+trees per root node
● One storing the user data, as illustrated above
● One storing lists of IDs of pages that have been freed
in a given transaction
● Old, freed pages are used in preference to new
pages, so the DB file size remains relatively static
over time
● No compaction or garbage collection phase is ever
needed
67
Free Space Management
Pgno: 0
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 1
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
68
Free Space Management
Pgno: 0
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 1
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
69
Free Space Management
Pgno: 0
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
70
Free Space Management
Pgno: 0
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
71
Free Space Management
Pgno: 0
Misc...
TXN: 0
FRoot: EMPTY
DRoot: EMPTY
Meta Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
72
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
73
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
74
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 6
Misc...
offset: 4000
offset: 3000
txn 3,page 3,4
txn 2,page 2
Data Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 1
FRoot: EMPTY
DRoot: 2
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
75
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 6
Misc...
offset: 4000
offset: 3000
txn 3,page 3,4
txn 2,page 2
Data Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 3
FRoot: 6
DRoot: 5
Meta Page
Pgno: 2
Misc...
offset: 4000
1,foo
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
76
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 6
Misc...
offset: 4000
offset: 3000
txn 3,page 3,4
txn 2,page 2
Data Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 3
FRoot: 6
DRoot: 5
Meta Page
Pgno: 2
Misc...
offset: 4000
offset: 3000
2,xyz
1,blah
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
77
Free Space Management
Pgno: 0
Misc...
TXN: 2
FRoot: 4
DRoot: 3
Meta Page
Pgno: 7
Misc...
offset: 4000
offset: 3000
txn 4,page 5,6
txn 3,page 3,4
Data Page
Pgno: 6
Misc...
offset: 4000
offset: 3000
txn 3,page 3,4
txn 2,page 2
Data Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 3
FRoot: 6
DRoot: 5
Meta Page
Pgno: 2
Misc...
offset: 4000
offset: 3000
2,xyz
1,blah
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
78
Free Space Management
Pgno: 0
Misc...
TXN: 4
FRoot: 7
DRoot: 2
Meta Page
Pgno: 7
Misc...
offset: 4000
offset: 3000
txn 4,page 5,6
txn 3,page 3,4
Data Page
Pgno: 6
Misc...
offset: 4000
offset: 3000
txn 3,page 3,4
txn 2,page 2
Data Page
Pgno: 5
Misc...
offset: 4000
offset: 3000
2,bar
1,blah
Data Page
Pgno: 1
Misc...
TXN: 3
FRoot: 6
DRoot: 5
Meta Page
Pgno: 2
Misc...
offset: 4000
offset: 3000
2,xyz
1,blah
Data Page
Pgno: 3
Misc...
offset: 4000
offset: 3000
2,bar
1,foo
Data Page
Pgno: 4
Misc...
offset: 4000
txn 2,page 2
Data Page
79
Free Space Management
● Caveat: If a read transaction is open on a
particular version of the DB, that version and
every version after it are excluded from page
reclaiming.
● Thus, long-lived read transactions should be
avoided, otherwise the DB file size may grow
rapidly, devolving into Append-Only behavior
until the transactions are closed
80
Transaction Handling
● LMDB supports a single writer concurrent with many readers
– A single mutex serializes all write transactions
– The mutex is shared/multiprocess
● Readers run lockless and never block
– But for page reclamation purposes, readers are tracked
● Transactions are stamped with an ID which is a monotonically
increasing integer
– The ID is only incremented for Write transactions that actually modify
data
– If a Write transaction is aborted, or committed with no changes, the
same ID will be reused for the next Write transaction
81
Transaction Handling
● Transactions take a snapshot of the currently valid
meta page at the beginning of the transaction
● No matter what write transactions follow, a read
transaction's snapshot will always point to a valid
version of the DB
● The snapshot is totally isolated from subsequent
writes
● This provides the Consistency and Isolation in ACID
semantics
82
Transaction Handling
● The currently valid meta page is chosen based
on the greatest transaction ID in each meta
page
– The meta pages are page and CPU cache aligned
– The transaction ID is a single machine word
– The update of the transaction ID is atomic
– Thus, the Atomicity semantics of transactions are
guaranteed
83
Transaction Handling
● During Commit, the data pages are written and
then synchronously flushed before the meta page
is updated
– Then the meta page is written synchronously
– Thus, when a commit returns "success", it is
guaranteed that the transaction has been written intact
– This provides the Durability semantics
– If the system crashes before the meta page is updated,
then the data updates are irrelevant
84
Transaction Handling
● For tracking purposes, Readers must acquire a slot in the
readers table
– The readers table is also in a shared memory map, but separate
from the main data map
– This is a simple array recording the Process ID, Thread ID, and
Transaction ID of the reader
– The array elements are CPU cache aligned
– The first time a thread opens a read transaction, it must acquire
a mutex to reserve a slot in the table
– The slot ID is stored in Thread Local Storage; subsequent read
transactions performed by the thread need no further locks
85
Transaction Handling
● Write transactions use pages from the free list before
allocating new disk pages
– Pages in the free list are used in order, oldest transaction first
– The readers table must be scanned to see if any reader is
referencing an old transaction
– The writer doesn't need to lock the reader table when
performing this scan - readers never block writers
● The only consequence of scanning with no locks is that the writer
may see stale data
● This is irrelevant, newer readers are of no concern; only the oldest
readers matter
86
(5) Special Features
● Reserve Mode
– Allocates space in write buffer for data of user-
specified size, returns address
– Useful for data that is generated dynamically
instead of statically copied
– Allows generated data to be written directly to DB,
avoiding unnecessary memcpy
87
Special Features
● Fixed Mapping
– Uses a fixed address for the memory map
– Allows complex pointer-based data structures to be
stored directly with minimal serialization
– Objects using persistent addresses can thus be
read back and used directly, with no deserialization
88
Special Features
● Sub-Databases
– Store multiple independent named B+trees in a single
LMDB environment
– A Sub-DB is simply a key/data pair in the main DB,
where the data item is the root node of another tree
– Allows many related databases to be managed easily
● Transactions may span all of the Sub-DBs
● Used in back-mdb for the main data and all of the indices
● Used in SQLightning for multiple tables and indices
89
Special Features
● Sorted Duplicates
– Allows multiple data values for a single key
– Values are stored in sorted order, with customizable
comparison functions
– When the data values are all of a fixed size, the values are
stored contiguously, with no extra headers
● maximizes storage efficiency and performance
– Implemented by the same code as SubDB support
● maximum coding efficiency
– Can be used to efficiently implement inverted indices and sets
90
Special Features
● Atomic Hot Backup
– The entire database can be backed up live
– No need to stop updates while backups run
– The backup runs at the maximum speed of the
target storage medium
– Essentially: write(outfd, map, mapsize);
● No memcpy's in or out of user space
● Pure DMA from the database to the backup
91
(6) Results
● Support for LMDB is available in scores of open source
projects and all major Linux and BSD distros
● In OpenLDAP slapd
– LMDB reads are 5-20x faster than BDB
– Writes are 2-5x faster than BDB
– Consumes 1/4 as much RAM as BDB
● In MemcacheDB
– LMDB reads are 2-200x faster than BDB
– Writes are 5-900x faster than BDB
– Multi-thread reads are 2-8x faster than pure-memory Memcached
92
Results
● LMDB has been tested exhaustively by multiple
parties
– Symas has tested on all major filesystems: btrfs, ext2,
ext3, ext4, jfs, ntfs, reiserfs, xfs, zfs
– ext3, ext4, jfs, reiserfs, xfs also tested with external
journalling
– Testing on physical servers, VMs, HDDs, SSDs, PCIe NVM
– Testing crash reliability as well as performance and
efficiency - LMDB is proven corruption-proof in real world
conditions
93
Results
● Microbenchmarks
– In-memory DB with 100M records, 16 byte keys,
100 byte values
94
Results
● Scaling up to 64 CPUs, 64 concurrent readers
95
Results
● Scaling up to 64 CPUs, 64 concurrent readers
96
Results
● Microbenchmarks
– On-disk, 1.6Billion records, 16 byte keys, 96 byte
values, 160GB on disk with 32GB RAM, VM
00:00:00.00
00:28:48.00
00:57:36.00
01:26:24.00
01:55:12.00
02:24:00.00
02:52:48.00
03:21:36.00
03:50:24.00
Load Time
smaller is better
User
Sys
Wall
Time
97
Results
● VM with 16 CPU cores, 64 concurrent readers
98
Results
● VM with 16 CPU cores, 64 concurrent readers
99
Results
● Microbenchmark
– On-disk, 384M records, 16 byte keys, 4000 byte
values, 160GB on disk with 32GB RAM
00:00:00.00
00:07:12.00
00:14:24.00
00:21:36.00
00:28:48.00
00:36:00.00
00:43:12.00
Load Time
smaller is better
User
Sys
Wall
Time
100
Results
● 16 CPU cores, 64 concurrent readers
101
Results
● 16 CPU cores, 64 concurrent readers
102
Results
● Memcached
BDB 5.3 LMDB Memcached InnoDB
0.01
0.1
1
10
100
1000
Read Performance
Single Thread, Log Scale
max
max99th
max95th
max90th
avg
min
msec
BDB 5.3 LMDB Memcached InnoDB
0.01
0.1
1
10
100
1000
Write Performance
Single Thread, Log Scale
max
max99th
max95th
max90th
avg
min
msec
103
Results
● Memcached
BDB 5.3 LMDB MemcachedInnoDB
0.01
0.1
1
10
100
1000
10000
Read Performance
4 Threads, Log Scale
max
max99th
max95th
max90th
avg
min
msec
BDB 5.3 LMDB Memcached InnoDB
0.01
0.1
1
10
100
1000
Write Performance
4 Threads, Log Scale
max
max99th
max95th
max90th
avg
min
msec
104
Results
● HyperDex/YCSB
105
Results
● HyperDex/YCSB
LMDB LevelDB
0.1
1
10
100
1000
10000
100000
Latency
Sequential Insert
Max
99%
95%
Avg
Min
msec
106
Results
● HyperDex/YCSB
107
Results
● HyperDex/YCSB
LMDB update LMDB read LevelDB update LevelDB read
0.1
1
10
100
1000
10000
100000
Latency
Random Update/Read
Max
99%
95%
Avg
Min
msec
108
Results
● An Interview with Armory Technologies CEO Alan Reiner
– JMC For more normal users, who have been frustrated with long
load times. In my testing of the latest beta build, using bitcoin 0.10
and the new headers first format, I’ve seen you optimise the load
time from 3 days, to less than 2 hours now. Well done! Can you talk
us through how you did this?
– AR. It really comes down to the new database engine (LMDB instead
of LevelDB) and really hard [work] by some of our developers to
reshape the architecture and the optimizations of the databases
● http://bitcoinsinireland.com/an-interview-with-armory-
technologies-ceo-alan-reiner/
109
Results
● LDAP Benchmarks - compared to:
– OpenLDAP 2.4 back-mdb and -hdb
– OpenLDAP 2.4 back-mdb on Windows 2012 x64
– OpenDJ 2.4.6, 389DS, ApacheDS 2.0.0-M13
– Latest proprietary servers from CA, Microsoft,
Novell, and Oracle
– Test on a VM with 32GB RAM, 10M entries
110
Results
● LDAP Benchmarks
OL mdb 2.5
OL mdb
OL hdb
OL mdb W64
OpenDJ
389DS
Other #1
Other #2
Other #3
Other #4
AD LDS 2012
ApacheDS
0
5000
10000
15000
20000
25000
30000
35000
40000
LDAP Performance
Search Mixed Search Modify Mixed Mod
Ops/second
111
Results
● Full benchmark reports are available on the
LMDB page
– http://www.symas.com/mdb/
● Supported builds of LMDB-based packages are
available from Symas
– http://www.symas.com/
– OpenLDAP, Cyrus-SASL, Heimdal Kerberos
112
Conclusions
● The combination of memory-mapped operation with MVCC
is extremely potent
– Reduced administrative overhead
● no periodic cleanup / maintenance required
● no particular tuning required
– Reduced developer overhead
● code size and complexity drastically reduced
– Enhanced efficiency
● minimal CPU and I/O use
– allows for longer battery life on mobile devices
– allows for lower electricity/cooling costs in data centers
– allows more work to be done with less hardware
113
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/lmdb

More Related Content

Similar to LDAP at Lightning Speed

Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDSVibhor Kumar
 
Simplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12cSimplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12c
Maris Elsins
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
Ahmed Farag
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
Swiss Data Forum Swiss Data Forum
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Say Hello to MyRocks
Say Hello to MyRocksSay Hello to MyRocks
Say Hello to MyRocks
Sergey Petrunya
 
DbB 10 Webcast #3 The Secrets Of Scalability
DbB 10 Webcast #3   The Secrets Of ScalabilityDbB 10 Webcast #3   The Secrets Of Scalability
DbB 10 Webcast #3 The Secrets Of Scalability
Laura Hood
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
Dávid Kőszeghy
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
Best practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recoveryBest practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recovery
Florence Dubois
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoopabord
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
MongoDB
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 

Similar to LDAP at Lightning Speed (20)

Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDS
 
Simplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12cSimplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12c
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Say Hello to MyRocks
Say Hello to MyRocksSay Hello to MyRocks
Say Hello to MyRocks
 
DbB 10 Webcast #3 The Secrets Of Scalability
DbB 10 Webcast #3   The Secrets Of ScalabilityDbB 10 Webcast #3   The Secrets Of Scalability
DbB 10 Webcast #3 The Secrets Of Scalability
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Best practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recoveryBest practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recovery
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 

Recently uploaded (20)

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 

LDAP at Lightning Speed

  • 1. LDAP at Lightning Speed Howard Chu CTO, Symas Corp. hyc@symas.com Chief Architect, OpenLDAP hyc@openldap.org 2015-03-05
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /lmdb
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. 2 OpenLDAP Project ● Open source code project ● Founded 1998 ● Three core team members ● A dozen or so contributors ● Feature releases every 12-18 months ● Maintenance releases as needed
  • 5. 3 A Word About Symas ● Founded 1999 ● Founders from Enterprise Software world – platinum Technology (Locus Computing) – IBM ● Howard joined OpenLDAP in 1999 – One of the Core Team members – Appointed Chief Architect January 2007 ● No debt, no VC investments: self-funded
  • 6. 4 Intro ● Howard Chu – Founder and CTO Symas Corp. – Developing Free/Open Source software since 1980s ● GNU compiler toolchain, e.g. "gmake -j", etc. ● Many other projects... – Worked for NASA/JPL, wrote software for Space Shuttle, etc.
  • 7. 5 Topics (1) Background (2) Features (3) Design Approach (4) Internals (5) Special Features (6) Results
  • 8. 6 (1) Background ● API inspired by Berkeley DB (BDB) – OpenLDAP has used BDB extensively since 1999 – Deep experience with pros and cons of BDB design and implementation – Omits BDB features that were found to be of no benefit ● e.g. extensible hashing – Avoids BDB characteristics that were problematic ● e.g. cache tuning, complex locking, transaction logs, recovery
  • 9. 7 (2) Features LMDB At A Glance ● Key/Value store using B+trees ● Fully transactional, ACID compliant ● MVCC, readers never block ● Uses memory-mapped files, needs no tuning ● Crash-proof, no recovery needed after restart ● Highly optimized, extremely compact – under 40KB object code, fits in CPU L1 I$ ● Runs on most modern OSs – Linux, Android, *BSD, MacOSX, iOS, Solaris, Windows, etc...
  • 10. 8 Features ● Concurrency Support – Both multi-process and multi-thread – Single Writer + N readers ● Writers don't block readers ● Readers don't block writers ● Reads scale perfectly linearly with available CPUs ● No deadlocks – Full isolation with MVCC - Serializable – Nested transactions – Batched writes
  • 11. 9 Features ● Uses Copy-on-Write – Live data is never overwritten – DB structure cannot be corrupted by incomplete operations (system crashes) – No write-ahead logs needed – No transaction log cleanup/maintenance – No recovery needed after crashes
  • 12. 10 Features ● Uses Single-Level Store – Reads are satisfied directly from the memory map ● No malloc or memcpy overhead – Writes can be performed directly to the memory map ● No write buffers, no buffer tuning – Relies on the OS/filesystem cache ● No wasted memory in app-level caching – Can store live pointer-based objects directly ● using a fixed address map ● minimal marshalling, no unmarshalling required
  • 13. 11 Features ● LMDB config is simple, e.g. slapd ● BDB config is complex database mdb directory /var/lib/ldap/data/mdb maxsize 4294967296 database hdb directory /var/lib/ldap/data/hdb cachesize 50000 idlcachesize 50000 dbconfig set_cachesize 4 0 1 dbconfig set_lg_regionmax 262144 dbconfig set_lg_bsize 2097152 dbconfig set_lg_dir /mnt/logs/hdb dbconfig set_lk_max_locks 3000 dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_lockers 1500
  • 14. 12 (3) Design Approach ● Motivation - problems dealing with BDB ● Obvious Solutions ● Approach
  • 15. 13 Motivation ● BDB slapd backend always required careful, complex tuning – Data comes through 3 separate layers of caches – Each layer has different size and speed traits – Balancing the 3 layers against each other can be a difficult juggling act – Performance without the backend caches is unacceptably slow - over an order of magnitude
  • 16. 14 Motivation ● Backend caching significantly increased the overall complexity of the backend code – Two levels of locking required, since BDB database locks are too slow – Deadlocks occurring routinely in normal operation, requiring additional backoff/retry logic
  • 17. 15 Motivation ● The caches were not always beneficial, and were sometimes detrimental – Data could exist in 3 places at once - filesystem, DB, and backend cache - wasting memory – Searches with result sets that exceeded the configured cache size would reduce the cache effectiveness to zero – malloc/free churn from adding and removing entries in the cache could trigger pathological heap fragmentation in libc malloc
  • 18. 16 Obvious Solutions ● Cache management is a hassle, so don't do any caching – The filesystem already caches data; there's no reason to duplicate the effort ● Lock management is a hassle, so don't do any locking – Use Multi-Version Concurrency Control (MVCC) – MVCC makes it possible to perform reads with no locking
  • 19. 17 Obvious Solutions ● BDB supports MVCC, but still requires complex caching and locking ● To get the desired results, we need to abandon BDB ● Surveying the landscape revealed no other DB libraries with the desired characteristics ● Thus LMDB was created in 2011 – "Lightning Memory-Mapped Database" – BDB is now deprecated in OpenLDAP
  • 20. 18 Design Approach ● Based on the "Single-Level Store" concept – Not new, first implemented in Multics in 1964 – Access a database by mapping the entire DB into memory – Data fetches are satisfied by direct reference to the memory map; there is no intermediate page or buffer cache
  • 21. 19 Single-Level Store ● Only viable if process address spaces are larger than the expected data volumes – For 32 bit processors, the practical limit on data size is under 2GB – For common 64 bit processors which only implement 48 bit address spaces, the limit is 47 bits or 128 terabytes – The upper bound at 63 bits is 8 exabytes
  • 22. 20 Design Approach ● Uses a read-only memory map – Protects the DB structure from corruption due to stray writes in memory – Any attempts to write to the map will cause a SEGV, allowing immediate identification of software bugs ● Can optionally use a read-write mmap – Slight performance gain for fully in-memory data sets – Should only be used on fully-debugged application code
  • 23. 21 Design Approach ● Keith Bostic (BerkeleyDB author, personal email, 2008) – "The most significant problem with building an mmap'd back-end is implementing write-ahead-logging (WAL). (You probably know this, but just in case: the way databases usually guarantee consistency is by ensuring that log records describing each change are written to disk before their transaction commits, and before the database page that was changed. In other words, log record X must hit disk before the database page containing the change described by log record X.) – In Berkeley DB WAL is done by maintaining a relationship between the database pages and the log records. If a database page is being written to disk, there's a look-aside into the logging system to make sure the right log records have already been written. In a memory-mapped system, you would do this by locking modified pages into memory (mlock), and flushing them at specific times (msync), otherwise the VM might just push a database page with modifications to disk before its log record is written, and if you crash at that point it's all over but the screaming."
  • 24. 22 Design Approach ● Implement MVCC using copy-on-write – In-use data is never overwritten, modifications are performed by copying the data and modifying the copy – Since updates never alter existing data, the DB structure can never be corrupted by incomplete modifications ● Write-ahead transaction logs are unnecessary – Readers always see a consistent snapshot of the DB, they are fully isolated from writers ● Read accesses require no locks
  • 25. 23 MVCC Details ● "Full" MVCC can be extremely resource intensive – DBs typically store complete histories reaching far back into time – The volume of data grows extremely fast, and grows without bound unless explicit pruning is done – Pruning the data using garbage collection or compaction requires more CPU and I/O resources than the normal update workload ● Either the server must be heavily over-provisioned, or updates must be stopped while pruning is done – Pruning requires tracking of in-use status, which typically involves reference counters, which require locking
  • 26. 24 Design Approach ● LMDB nominally maintains only two versions of the DB – Rolling back to a historical version is not interesting for OpenLDAP – Older versions can be held open longer by reader transactions ● LMDB maintains a free list tracking the IDs of unused pages – Old pages are reused as soon as possible, so data volumes don't grow without bound ● LMDB tracks in-use status without locks
  • 27. 25 Implementation Highlights ● LMDB library started from the append-only btree code written by Martin Hedenfalk for his ldapd, which is bundled in OpenBSD – Stripped out all the parts we didn't need (page cache management) – Borrowed a couple pieces from slapd for expedience – Changed from append-only to page-reclaiming – Restructured to allow adding ideas from BDB that we still wanted
  • 28. 26 Implementation Highlights ● Resulting library was under 32KB of object code – Compared to the original btree.c at 39KB – Compared to BDB at 1.5MB ● API is loosely modeled after the BDB API to ease migration of back-bdb code
  • 29. 27 Implementation Highlights size db_bench* text data bss dec hex filename Lines of Code 285306 1516 352 287174 461c6 db_bench 39758 384206 9304 3488 396998 60ec6 db_bench_basho 26577 1688853 2416 312 1691581 19cfbd db_bench_bdb 1746106 315491 1596 360 317447 4d807 db_bench_hyper 21498 121412 1644 320 123376 1e1f0 db_bench_mdb 7955 1014534 2912 6688 1024134 fa086 db_bench_rocksdb 81169 992334 3720 30352 1026406 fa966 db_bench_tokudb 227698 853216 2100 1920 857236 d1494 db_bench_wiredtiger 91410 Footprint
  • 30. 28 (4) Internals ● Btree Operation – Write-Ahead Logging – Append-Only – Copy-on-Write, LMDB-style ● Free Space Management – Avoiding Compaction/Garbage Collection ● Transaction Handling – Avoiding Locking
  • 31. 29 Btree Operation Pgno Misc... Database Page Pgno Misc... offset key, data Data Page Pgno Misc... Root Meta Page Basic Elements
  • 32. 30 Btree Operation Pgno: 0 Misc... Root : EMPTY Meta Page Write-Ahead Log Write-Ahead Logger
  • 33. 31 Btree Operation Pgno: 0 Misc... Root : EMPTY Meta Page Write-Ahead Logger Add 1,foo to page 1 Write-Ahead Log
  • 34. 32 Btree Operation Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Add 1,foo to page 1 Write-Ahead Log Write-Ahead Logger
  • 35. 33 Btree Operation Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Add 1,foo to page 1 Commit Write-Ahead Log Write-Ahead Logger
  • 36. 34 Btree Operation Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Write-Ahead Logger Add 1,foo to page 1 Commit Add 2,bar to page 1 Write-Ahead Log
  • 37. 35 Btree Operation Pgno: 1 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Add 1,foo to page 1 Commit Add 2,bar to page 1 Write-Ahead Log Write-Ahead Logger
  • 38. 36 Btree Operation Pgno: 1 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Add 1,foo to page 1 Commit Add 2,bar to page 1 Commit Write-Ahead Log Write-Ahead Logger
  • 39. 37 Btree Operation Pgno: 1 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Add 1,foo to page 1 Commit Add 2,bar to page 1 Commit Checkpoint Write-Ahead Log Write-Ahead Logger Pgno: 1 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 0 Misc... Root : 1 Meta Page Pgno: 0 Misc... Root : 1 Meta Page
  • 40. 38 Btree Operation How Append-Only/Copy-On-Write Works ● Updates are always performed bottom up ● Every branch node from the leaf to the root must be copied/modified for any leaf update ● Any node not on the path from the leaf to the root is unaltered ● The root node is always written last
  • 42. 40 Btree Operation Append-Only Update a leaf node by copying it and updating the copy
  • 43. 41 Btree Operation Append-Only Copy the root node, and point it at the new leaf
  • 44. 42 Btree Operation Append-Only The old root and old leaf remain as a previous version of the tree
  • 49. 47 Btree Operation In the Append-Only tree, new pages are always appended sequentially to the DB file ● While there's significant overhead for making complete copies of modified pages, the actual I/O is linear and relatively fast ● The root node is always the last page of the file, unless there was a crash ● Any root node can be found by seeking backward from the end of the file, and checking the page's header ● Recovery from a crash is relatively easy – Everything from the last valid root to the beginning of the file is always pristine – Anything between the end of the file and the last valid root is discarded
  • 51. 49 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page
  • 52. 50 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page
  • 53. 51 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page
  • 54. 52 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... Root : 3 Meta Page
  • 55. 53 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... Root : 3 Meta Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page
  • 56. 54 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... Root : 3 Meta Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 6 Misc... Root : 5 Meta Page
  • 57. 55 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... Root : 3 Meta Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 6 Misc... Root : 5 Meta Page Pgno: 7 Misc... offset: 4000 offset: 3000 2,xyz 1,blah Data Page
  • 58. 56 Btree Operation Append-Only Pgno: 1 Misc... offset: 4000 1,foo Data Page Pgno: 0 Misc... Root : EMPTY Meta Page Pgno: 2 Misc... Root : 1 Meta Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... Root : 3 Meta Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 6 Misc... Root : 5 Meta Page Pgno: 7 Misc... offset: 4000 offset: 3000 2,xyz 1,blah Data Page Pgno: 8 Misc... Root : 7 Meta Page
  • 59. 57 Btree Operation Append-Only disk usage is very inefficient ● Disk space usage grows without bound ● 99+% of the space will be occupied by old versions of the data ● The old versions are usually not interesting ● Reclaiming the old space requires a very expensive compaction phase ● New updates must be throttled until compaction completes
  • 60. 58 Btree Operation The LMDB Approach ● Still Copy-on-Write, but using two fixed root nodes – Page 0 and Page 1 of the file, used in double-buffer fashion – Even faster cold-start than Append-Only, no searching needed to find the last valid root node – Any app always reads both pages and uses the one with the greater Transaction ID stamp in its header – Consequently, only 2 outstanding versions of the DB exist, not fully "multi-version"
  • 65. 63 Btree Operation 12 After this step the old blue page is no longer referenced by anything else in the database, so it can be reclaimed
  • 67. 65 Btree Operation 32 After this step the old yellow page is no longer referenced by anything else in the database, so it can also be reclaimed
  • 68. 66 Free Space Management LMDB maintains two B+trees per root node ● One storing the user data, as illustrated above ● One storing lists of IDs of pages that have been freed in a given transaction ● Old, freed pages are used in preference to new pages, so the DB file size remains relatively static over time ● No compaction or garbage collection phase is ever needed
  • 69. 67 Free Space Management Pgno: 0 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 1 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page
  • 70. 68 Free Space Management Pgno: 0 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 1 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page
  • 71. 69 Free Space Management Pgno: 0 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page
  • 72. 70 Free Space Management Pgno: 0 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page
  • 73. 71 Free Space Management Pgno: 0 Misc... TXN: 0 FRoot: EMPTY DRoot: EMPTY Meta Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 74. 72 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 75. 73 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 76. 74 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 6 Misc... offset: 4000 offset: 3000 txn 3,page 3,4 txn 2,page 2 Data Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 1 FRoot: EMPTY DRoot: 2 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 77. 75 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 6 Misc... offset: 4000 offset: 3000 txn 3,page 3,4 txn 2,page 2 Data Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 3 FRoot: 6 DRoot: 5 Meta Page Pgno: 2 Misc... offset: 4000 1,foo Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 78. 76 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 6 Misc... offset: 4000 offset: 3000 txn 3,page 3,4 txn 2,page 2 Data Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 3 FRoot: 6 DRoot: 5 Meta Page Pgno: 2 Misc... offset: 4000 offset: 3000 2,xyz 1,blah Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 79. 77 Free Space Management Pgno: 0 Misc... TXN: 2 FRoot: 4 DRoot: 3 Meta Page Pgno: 7 Misc... offset: 4000 offset: 3000 txn 4,page 5,6 txn 3,page 3,4 Data Page Pgno: 6 Misc... offset: 4000 offset: 3000 txn 3,page 3,4 txn 2,page 2 Data Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 3 FRoot: 6 DRoot: 5 Meta Page Pgno: 2 Misc... offset: 4000 offset: 3000 2,xyz 1,blah Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 80. 78 Free Space Management Pgno: 0 Misc... TXN: 4 FRoot: 7 DRoot: 2 Meta Page Pgno: 7 Misc... offset: 4000 offset: 3000 txn 4,page 5,6 txn 3,page 3,4 Data Page Pgno: 6 Misc... offset: 4000 offset: 3000 txn 3,page 3,4 txn 2,page 2 Data Page Pgno: 5 Misc... offset: 4000 offset: 3000 2,bar 1,blah Data Page Pgno: 1 Misc... TXN: 3 FRoot: 6 DRoot: 5 Meta Page Pgno: 2 Misc... offset: 4000 offset: 3000 2,xyz 1,blah Data Page Pgno: 3 Misc... offset: 4000 offset: 3000 2,bar 1,foo Data Page Pgno: 4 Misc... offset: 4000 txn 2,page 2 Data Page
  • 81. 79 Free Space Management ● Caveat: If a read transaction is open on a particular version of the DB, that version and every version after it are excluded from page reclaiming. ● Thus, long-lived read transactions should be avoided, otherwise the DB file size may grow rapidly, devolving into Append-Only behavior until the transactions are closed
  • 82. 80 Transaction Handling ● LMDB supports a single writer concurrent with many readers – A single mutex serializes all write transactions – The mutex is shared/multiprocess ● Readers run lockless and never block – But for page reclamation purposes, readers are tracked ● Transactions are stamped with an ID which is a monotonically increasing integer – The ID is only incremented for Write transactions that actually modify data – If a Write transaction is aborted, or committed with no changes, the same ID will be reused for the next Write transaction
  • 83. 81 Transaction Handling ● Transactions take a snapshot of the currently valid meta page at the beginning of the transaction ● No matter what write transactions follow, a read transaction's snapshot will always point to a valid version of the DB ● The snapshot is totally isolated from subsequent writes ● This provides the Consistency and Isolation in ACID semantics
  • 84. 82 Transaction Handling ● The currently valid meta page is chosen based on the greatest transaction ID in each meta page – The meta pages are page and CPU cache aligned – The transaction ID is a single machine word – The update of the transaction ID is atomic – Thus, the Atomicity semantics of transactions are guaranteed
  • 85. 83 Transaction Handling ● During Commit, the data pages are written and then synchronously flushed before the meta page is updated – Then the meta page is written synchronously – Thus, when a commit returns "success", it is guaranteed that the transaction has been written intact – This provides the Durability semantics – If the system crashes before the meta page is updated, then the data updates are irrelevant
  • 86. 84 Transaction Handling ● For tracking purposes, Readers must acquire a slot in the readers table – The readers table is also in a shared memory map, but separate from the main data map – This is a simple array recording the Process ID, Thread ID, and Transaction ID of the reader – The array elements are CPU cache aligned – The first time a thread opens a read transaction, it must acquire a mutex to reserve a slot in the table – The slot ID is stored in Thread Local Storage; subsequent read transactions performed by the thread need no further locks
  • 87. 85 Transaction Handling ● Write transactions use pages from the free list before allocating new disk pages – Pages in the free list are used in order, oldest transaction first – The readers table must be scanned to see if any reader is referencing an old transaction – The writer doesn't need to lock the reader table when performing this scan - readers never block writers ● The only consequence of scanning with no locks is that the writer may see stale data ● This is irrelevant, newer readers are of no concern; only the oldest readers matter
  • 88. 86 (5) Special Features ● Reserve Mode – Allocates space in write buffer for data of user- specified size, returns address – Useful for data that is generated dynamically instead of statically copied – Allows generated data to be written directly to DB, avoiding unnecessary memcpy
  • 89. 87 Special Features ● Fixed Mapping – Uses a fixed address for the memory map – Allows complex pointer-based data structures to be stored directly with minimal serialization – Objects using persistent addresses can thus be read back and used directly, with no deserialization
  • 90. 88 Special Features ● Sub-Databases – Store multiple independent named B+trees in a single LMDB environment – A Sub-DB is simply a key/data pair in the main DB, where the data item is the root node of another tree – Allows many related databases to be managed easily ● Transactions may span all of the Sub-DBs ● Used in back-mdb for the main data and all of the indices ● Used in SQLightning for multiple tables and indices
  • 91. 89 Special Features ● Sorted Duplicates – Allows multiple data values for a single key – Values are stored in sorted order, with customizable comparison functions – When the data values are all of a fixed size, the values are stored contiguously, with no extra headers ● maximizes storage efficiency and performance – Implemented by the same code as SubDB support ● maximum coding efficiency – Can be used to efficiently implement inverted indices and sets
  • 92. 90 Special Features ● Atomic Hot Backup – The entire database can be backed up live – No need to stop updates while backups run – The backup runs at the maximum speed of the target storage medium – Essentially: write(outfd, map, mapsize); ● No memcpy's in or out of user space ● Pure DMA from the database to the backup
  • 93. 91 (6) Results ● Support for LMDB is available in scores of open source projects and all major Linux and BSD distros ● In OpenLDAP slapd – LMDB reads are 5-20x faster than BDB – Writes are 2-5x faster than BDB – Consumes 1/4 as much RAM as BDB ● In MemcacheDB – LMDB reads are 2-200x faster than BDB – Writes are 5-900x faster than BDB – Multi-thread reads are 2-8x faster than pure-memory Memcached
  • 94. 92 Results ● LMDB has been tested exhaustively by multiple parties – Symas has tested on all major filesystems: btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, zfs – ext3, ext4, jfs, reiserfs, xfs also tested with external journalling – Testing on physical servers, VMs, HDDs, SSDs, PCIe NVM – Testing crash reliability as well as performance and efficiency - LMDB is proven corruption-proof in real world conditions
  • 95. 93 Results ● Microbenchmarks – In-memory DB with 100M records, 16 byte keys, 100 byte values
  • 96. 94 Results ● Scaling up to 64 CPUs, 64 concurrent readers
  • 97. 95 Results ● Scaling up to 64 CPUs, 64 concurrent readers
  • 98. 96 Results ● Microbenchmarks – On-disk, 1.6Billion records, 16 byte keys, 96 byte values, 160GB on disk with 32GB RAM, VM 00:00:00.00 00:28:48.00 00:57:36.00 01:26:24.00 01:55:12.00 02:24:00.00 02:52:48.00 03:21:36.00 03:50:24.00 Load Time smaller is better User Sys Wall Time
  • 99. 97 Results ● VM with 16 CPU cores, 64 concurrent readers
  • 100. 98 Results ● VM with 16 CPU cores, 64 concurrent readers
  • 101. 99 Results ● Microbenchmark – On-disk, 384M records, 16 byte keys, 4000 byte values, 160GB on disk with 32GB RAM 00:00:00.00 00:07:12.00 00:14:24.00 00:21:36.00 00:28:48.00 00:36:00.00 00:43:12.00 Load Time smaller is better User Sys Wall Time
  • 102. 100 Results ● 16 CPU cores, 64 concurrent readers
  • 103. 101 Results ● 16 CPU cores, 64 concurrent readers
  • 104. 102 Results ● Memcached BDB 5.3 LMDB Memcached InnoDB 0.01 0.1 1 10 100 1000 Read Performance Single Thread, Log Scale max max99th max95th max90th avg min msec BDB 5.3 LMDB Memcached InnoDB 0.01 0.1 1 10 100 1000 Write Performance Single Thread, Log Scale max max99th max95th max90th avg min msec
  • 105. 103 Results ● Memcached BDB 5.3 LMDB MemcachedInnoDB 0.01 0.1 1 10 100 1000 10000 Read Performance 4 Threads, Log Scale max max99th max95th max90th avg min msec BDB 5.3 LMDB Memcached InnoDB 0.01 0.1 1 10 100 1000 Write Performance 4 Threads, Log Scale max max99th max95th max90th avg min msec
  • 109. 107 Results ● HyperDex/YCSB LMDB update LMDB read LevelDB update LevelDB read 0.1 1 10 100 1000 10000 100000 Latency Random Update/Read Max 99% 95% Avg Min msec
  • 110. 108 Results ● An Interview with Armory Technologies CEO Alan Reiner – JMC For more normal users, who have been frustrated with long load times. In my testing of the latest beta build, using bitcoin 0.10 and the new headers first format, I’ve seen you optimise the load time from 3 days, to less than 2 hours now. Well done! Can you talk us through how you did this? – AR. It really comes down to the new database engine (LMDB instead of LevelDB) and really hard [work] by some of our developers to reshape the architecture and the optimizations of the databases ● http://bitcoinsinireland.com/an-interview-with-armory- technologies-ceo-alan-reiner/
  • 111. 109 Results ● LDAP Benchmarks - compared to: – OpenLDAP 2.4 back-mdb and -hdb – OpenLDAP 2.4 back-mdb on Windows 2012 x64 – OpenDJ 2.4.6, 389DS, ApacheDS 2.0.0-M13 – Latest proprietary servers from CA, Microsoft, Novell, and Oracle – Test on a VM with 32GB RAM, 10M entries
  • 112. 110 Results ● LDAP Benchmarks OL mdb 2.5 OL mdb OL hdb OL mdb W64 OpenDJ 389DS Other #1 Other #2 Other #3 Other #4 AD LDS 2012 ApacheDS 0 5000 10000 15000 20000 25000 30000 35000 40000 LDAP Performance Search Mixed Search Modify Mixed Mod Ops/second
  • 113. 111 Results ● Full benchmark reports are available on the LMDB page – http://www.symas.com/mdb/ ● Supported builds of LMDB-based packages are available from Symas – http://www.symas.com/ – OpenLDAP, Cyrus-SASL, Heimdal Kerberos
  • 114. 112 Conclusions ● The combination of memory-mapped operation with MVCC is extremely potent – Reduced administrative overhead ● no periodic cleanup / maintenance required ● no particular tuning required – Reduced developer overhead ● code size and complexity drastically reduced – Enhanced efficiency ● minimal CPU and I/O use – allows for longer battery life on mobile devices – allows for lower electricity/cooling costs in data centers – allows more work to be done with less hardware
  • 115. 113
  • 116. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/lmdb