Perforce BTrees: The Arcane and the Profane

Perforce BTrees:
The Arcane and The Profane
Jeff Anton
Storage Architect, Perforce Software

2
Major changes in the P4D Database
Berkeley DB 1.8X
DBOpen2
2001.1
+Reorg
2005.1
+Checksums
2008.2
+LockLess+64Bit
Ref
2013.3
Storage behavior and Operational needs have changed over time
SSD’s and non-disk storage have changed the database world

3
File System Caching is Critical
 Each P4D Thread/Process has only a small in-process cache
 The OS Cache provides primary I/O caching
 Load up machines with real memory to get good I/O caching
(Does mean archives and processes fight for memory use)
 Page size behaviors can be non-obvious
 8K byte pages seem best
 SSD’s are not a substitute for real memory!

4
Rebuilding for Space and Performance
 SDP and many other installations reload the DB regularly
 Pro: Recover disk space
 Pro: Sequential reads are fast
 Con: Downtime (can be minimal using offline backups)
 Con: Updates can be slow for a while after a rebuild
 Con: Space for rebuild
 dbopen.freepct (0-99, 0 default)
 p4d –vdbopen.freepct=10 –jr <checkpoint>

5
Passive Reorganization
 OS file systems often schedule read-ahead I/O
 We want to take advantage of that
 Solution – Re-write subtrees to be kept in sequential pages
 Slows down some write operations
 Can make the DB files larger due to needing contiguous
pages
 Larger table scans win
 Churns Flash memory – expensive writes

6
Reorganization Space Usage
 Getting sequential pages for a reorganization is hard
 Free Page index can quickly find contiguous free page blocks
– But often no such blocks are available
 If reorganizations happen too often, tables grow from
reorganization while many scattered free pages remain
unused!
 Summary point – Reorganization makes tables larger with
more unused space

7
Is Reorganization Obsolete?
 In some cases, we’ve seen Reorganization is not worth the
costs – Extra write load can be costly
 Solid State “Disk” makes read-ahead less important
 Overhead of larger DB files may be costly for SSD
 New Lock Free Reading speeds up scans and eliminates
readers blocking writers – so slower readers are OK
 Try turning it off
 db.reorg.disable = 1

8
Page Location Choices
 The Index of free pages allows page allocations to be made
near to referencing pages. I.e. We reuse pages near to
existing related pages
 But if newer data is near the end of the db file, we keep
using pages near the end of the file.
 db.page.migrate can be set to a percentage to avoid
allocating pages at the end of the file if possible.
 Foreshadowing – Shrinking the db file – If a lot of pages are
free at the end of the file, we can truncate!

9
Other configuration
 dbopen.cache – In p4d cache (number of pages)
 dbopen.cache.wide – In p4d cache for db.integed
 dbopen.nofsync – Skip fsync on close of DB file
 dbopen.pagesize – default 8K, key size related (only useful
when tables are created such as with checkpoint recovery)

10
P4 dbstat
 P4 dbstat –h db.working
db.working
internal+leaf 69+3452
page size 8k end page 5398
generation 18 levels 3 fanout 51
ordered leaves: 84%
Checksum 2028059175
.... : -1000 85
-1000 : -100 73
-100 : -10 63
-10 : -1 11
1 2926
1 : 10 54
10 : 100 70
100 : 1000 88
1000 : .... 81

11
P4 dbstat -f
p4 dbstat –f –h db.locks
db.locks
2529 pages, 741 free, 29% of file
0% through 10% 0 pages 0 pct free

12
Reading -Ztrack output
--- lapse 445s … (from the p4 archive command)
--- db.revbx
--- pages in+out+cached 2071+962+96
--- pages split internal+leaf 7+260
--- locks read/write 1/605 rows get+pos+scan put+del
0+5524530+5524530 4083+0
--- total lock wait+held read/write
0ms+62479ms/0ms+187ms
--- max lock wait+held read/write 0ms+62479ms/0ms+141ms

13
P4 dbverify
p4 dbverify –v or p4d –xv –vdb=3
Validating db.have
tree stats: leafs: 1219568 internal: 22065 free: 2734 levels: 4
items: 74449685 overflow chains: 0 overflow pages:
0
missing pages: 0 leaf page free space: 1%
leaf offset sum: 2316501851 wrinkle factor: 1899.44
main checksum: 1769119828 alt checksum 1244557999

14
P4 dbverify (cont)
Validating db.desc
tree stats: leafs: 19910 internal: 57 free: 1 levels: 2
items: 791223 overflow chains: 1775 overflow pages: 2489
missing pages: 0 leaf page free space: 2%
leaf offset sum: 374121wrinkle factor: 18.79
main checksum: 3701700704 alt checksum 1844779787

15
Why not use a DBMS instead of your DB?
 Lots of DBMS’s provide lots of value
 Answer: P4D is a DBMS!
 Ok, It’s a Special Purpose DBMS, not a general one
 Tightly integrated
 Maps and pattern matching is close to the database
 Might be able to use an Extensible DBMS to match
functionality

16
Useful References
USENIX Fast’16 Conference Proceedings
https://www.usenix.org/conference/fast16/technical-sessions
BTree introduction and graphics of splits
http://underpop.online.fr/j/java/algorithims-in-java-1-
4/ch16lev1sec3.htm
p4 help undoc | grep db

Catch me at the Conference
wherever you can to talk!
anton@perforce.com

Perforce BTrees: The Arcane and the Profane

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Perforce BTrees: The Arcane and the Profane

Similar to Perforce BTrees: The Arcane and the Profane (20)

More from Perforce

More from Perforce (20)

Recently uploaded

Recently uploaded (20)

Perforce BTrees: The Arcane and the Profane