Perforce BTrees:
The Arcane and The Profane
Jeff Anton
Storage Architect, Perforce Software
2
Major changes in the P4D Database
Berkeley DB 1.8X
DBOpen2
2001.1
+Reorg
2005.1
+Checksums
2008.2
+LockLess+64Bit
Ref
2013.3
Storage behavior and Operational needs have changed over time
SSD’s and non-disk storage have changed the database world
3
File System Caching is Critical
 Each P4D Thread/Process has only a small in-process cache
 The OS Cache provides primary I/O caching
 Load up machines with real memory to get good I/O caching
(Does mean archives and processes fight for memory use)
 Page size behaviors can be non-obvious
 8K byte pages seem best
 SSD’s are not a substitute for real memory!
4
Rebuilding for Space and Performance
 SDP and many other installations reload the DB regularly
 Pro: Recover disk space
 Pro: Sequential reads are fast
 Con: Downtime (can be minimal using offline backups)
 Con: Updates can be slow for a while after a rebuild
 Con: Space for rebuild
 dbopen.freepct (0-99, 0 default)
 p4d –vdbopen.freepct=10 –jr <checkpoint>
5
Passive Reorganization
 OS file systems often schedule read-ahead I/O
 We want to take advantage of that
 Solution – Re-write subtrees to be kept in sequential pages
 Slows down some write operations
 Can make the DB files larger due to needing contiguous
pages
 Larger table scans win
 Churns Flash memory – expensive writes
6
Reorganization Space Usage
 Getting sequential pages for a reorganization is hard
 Free Page index can quickly find contiguous free page blocks
– But often no such blocks are available
 If reorganizations happen too often, tables grow from
reorganization while many scattered free pages remain
unused!
 Summary point – Reorganization makes tables larger with
more unused space
7
Is Reorganization Obsolete?
 In some cases, we’ve seen Reorganization is not worth the
costs – Extra write load can be costly
 Solid State “Disk” makes read-ahead less important
 Overhead of larger DB files may be costly for SSD
 New Lock Free Reading speeds up scans and eliminates
readers blocking writers – so slower readers are OK
 Try turning it off
 db.reorg.disable = 1
8
Page Location Choices
 The Index of free pages allows page allocations to be made
near to referencing pages. I.e. We reuse pages near to
existing related pages
 But if newer data is near the end of the db file, we keep
using pages near the end of the file.
 db.page.migrate can be set to a percentage to avoid
allocating pages at the end of the file if possible.
 Foreshadowing – Shrinking the db file – If a lot of pages are
free at the end of the file, we can truncate!
9
Other configuration
 dbopen.cache – In p4d cache (number of pages)
 dbopen.cache.wide – In p4d cache for db.integed
 dbopen.nofsync – Skip fsync on close of DB file
 dbopen.pagesize – default 8K, key size related (only useful
when tables are created such as with checkpoint recovery)
10
P4 dbstat
 P4 dbstat –h db.working
db.working
internal+leaf 69+3452
page size 8k end page 5398
generation 18 levels 3 fanout 51
ordered leaves: 84%
Checksum 2028059175
.... : -1000 85
-1000 : -100 73
-100 : -10 63
-10 : -1 11
1 2926
1 : 10 54
10 : 100 70
100 : 1000 88
1000 : .... 81
11
P4 dbstat -f
p4 dbstat –f –h db.locks
db.locks
2529 pages, 741 free, 29% of file
0% through 10% 0 pages 0 pct free
10% through 20% 0 pages 0 pct free
20% through 30% 0 pages 0 pct free
30% through 40% 0 pages 0 pct free
40% through 50% 0 pages 0 pct free
50% through 60% 0 pages 0 pct free
60% through 70% 1 pages 0 pct free
70% through 80% 236 pages 31 pct free
80% through 90% 253 pages 34 pct free
90% through 100% 251 pages 33 pct free
12
Reading -Ztrack output
--- lapse 445s … (from the p4 archive command)
--- db.revbx
--- pages in+out+cached 2071+962+96
--- pages split internal+leaf 7+260
--- locks read/write 1/605 rows get+pos+scan put+del
0+5524530+5524530 4083+0
--- total lock wait+held read/write
0ms+62479ms/0ms+187ms
--- max lock wait+held read/write 0ms+62479ms/0ms+141ms
13
P4 dbverify
p4 dbverify –v or p4d –xv –vdb=3
Validating db.have
tree stats: leafs: 1219568 internal: 22065 free: 2734 levels: 4
items: 74449685 overflow chains: 0 overflow pages:
0
missing pages: 0 leaf page free space: 1%
leaf offset sum: 2316501851 wrinkle factor: 1899.44
main checksum: 1769119828 alt checksum 1244557999
14
P4 dbverify (cont)
Validating db.desc
tree stats: leafs: 19910 internal: 57 free: 1 levels: 2
items: 791223 overflow chains: 1775 overflow pages: 2489
missing pages: 0 leaf page free space: 2%
leaf offset sum: 374121wrinkle factor: 18.79
main checksum: 3701700704 alt checksum 1844779787
15
Why not use a DBMS instead of your DB?
 Lots of DBMS’s provide lots of value
 Answer: P4D is a DBMS!
 Ok, It’s a Special Purpose DBMS, not a general one
 Tightly integrated
 Maps and pattern matching is close to the database
 Might be able to use an Extensible DBMS to match
functionality
16
Useful References
USENIX Fast’16 Conference Proceedings
https://www.usenix.org/conference/fast16/technical-sessions
BTree introduction and graphics of splits
http://underpop.online.fr/j/java/algorithims-in-java-1-
4/ch16lev1sec3.htm
p4 help undoc | grep db
Catch me at the Conference
wherever you can to talk!
anton@perforce.com

Perforce BTrees: The Arcane and the Profane

  • 1.
    Perforce BTrees: The Arcaneand The Profane Jeff Anton Storage Architect, Perforce Software
  • 2.
    2 Major changes inthe P4D Database Berkeley DB 1.8X DBOpen2 2001.1 +Reorg 2005.1 +Checksums 2008.2 +LockLess+64Bit Ref 2013.3 Storage behavior and Operational needs have changed over time SSD’s and non-disk storage have changed the database world
  • 3.
    3 File System Cachingis Critical  Each P4D Thread/Process has only a small in-process cache  The OS Cache provides primary I/O caching  Load up machines with real memory to get good I/O caching (Does mean archives and processes fight for memory use)  Page size behaviors can be non-obvious  8K byte pages seem best  SSD’s are not a substitute for real memory!
  • 4.
    4 Rebuilding for Spaceand Performance  SDP and many other installations reload the DB regularly  Pro: Recover disk space  Pro: Sequential reads are fast  Con: Downtime (can be minimal using offline backups)  Con: Updates can be slow for a while after a rebuild  Con: Space for rebuild  dbopen.freepct (0-99, 0 default)  p4d –vdbopen.freepct=10 –jr <checkpoint>
  • 5.
    5 Passive Reorganization  OSfile systems often schedule read-ahead I/O  We want to take advantage of that  Solution – Re-write subtrees to be kept in sequential pages  Slows down some write operations  Can make the DB files larger due to needing contiguous pages  Larger table scans win  Churns Flash memory – expensive writes
  • 6.
    6 Reorganization Space Usage Getting sequential pages for a reorganization is hard  Free Page index can quickly find contiguous free page blocks – But often no such blocks are available  If reorganizations happen too often, tables grow from reorganization while many scattered free pages remain unused!  Summary point – Reorganization makes tables larger with more unused space
  • 7.
    7 Is Reorganization Obsolete? In some cases, we’ve seen Reorganization is not worth the costs – Extra write load can be costly  Solid State “Disk” makes read-ahead less important  Overhead of larger DB files may be costly for SSD  New Lock Free Reading speeds up scans and eliminates readers blocking writers – so slower readers are OK  Try turning it off  db.reorg.disable = 1
  • 8.
    8 Page Location Choices The Index of free pages allows page allocations to be made near to referencing pages. I.e. We reuse pages near to existing related pages  But if newer data is near the end of the db file, we keep using pages near the end of the file.  db.page.migrate can be set to a percentage to avoid allocating pages at the end of the file if possible.  Foreshadowing – Shrinking the db file – If a lot of pages are free at the end of the file, we can truncate!
  • 9.
    9 Other configuration  dbopen.cache– In p4d cache (number of pages)  dbopen.cache.wide – In p4d cache for db.integed  dbopen.nofsync – Skip fsync on close of DB file  dbopen.pagesize – default 8K, key size related (only useful when tables are created such as with checkpoint recovery)
  • 10.
    10 P4 dbstat  P4dbstat –h db.working db.working internal+leaf 69+3452 page size 8k end page 5398 generation 18 levels 3 fanout 51 ordered leaves: 84% Checksum 2028059175 .... : -1000 85 -1000 : -100 73 -100 : -10 63 -10 : -1 11 1 2926 1 : 10 54 10 : 100 70 100 : 1000 88 1000 : .... 81
  • 11.
    11 P4 dbstat -f p4dbstat –f –h db.locks db.locks 2529 pages, 741 free, 29% of file 0% through 10% 0 pages 0 pct free 10% through 20% 0 pages 0 pct free 20% through 30% 0 pages 0 pct free 30% through 40% 0 pages 0 pct free 40% through 50% 0 pages 0 pct free 50% through 60% 0 pages 0 pct free 60% through 70% 1 pages 0 pct free 70% through 80% 236 pages 31 pct free 80% through 90% 253 pages 34 pct free 90% through 100% 251 pages 33 pct free
  • 12.
    12 Reading -Ztrack output ---lapse 445s … (from the p4 archive command) --- db.revbx --- pages in+out+cached 2071+962+96 --- pages split internal+leaf 7+260 --- locks read/write 1/605 rows get+pos+scan put+del 0+5524530+5524530 4083+0 --- total lock wait+held read/write 0ms+62479ms/0ms+187ms --- max lock wait+held read/write 0ms+62479ms/0ms+141ms
  • 13.
    13 P4 dbverify p4 dbverify–v or p4d –xv –vdb=3 Validating db.have tree stats: leafs: 1219568 internal: 22065 free: 2734 levels: 4 items: 74449685 overflow chains: 0 overflow pages: 0 missing pages: 0 leaf page free space: 1% leaf offset sum: 2316501851 wrinkle factor: 1899.44 main checksum: 1769119828 alt checksum 1244557999
  • 14.
    14 P4 dbverify (cont) Validatingdb.desc tree stats: leafs: 19910 internal: 57 free: 1 levels: 2 items: 791223 overflow chains: 1775 overflow pages: 2489 missing pages: 0 leaf page free space: 2% leaf offset sum: 374121wrinkle factor: 18.79 main checksum: 3701700704 alt checksum 1844779787
  • 15.
    15 Why not usea DBMS instead of your DB?  Lots of DBMS’s provide lots of value  Answer: P4D is a DBMS!  Ok, It’s a Special Purpose DBMS, not a general one  Tightly integrated  Maps and pattern matching is close to the database  Might be able to use an Extensible DBMS to match functionality
  • 16.
    16 Useful References USENIX Fast’16Conference Proceedings https://www.usenix.org/conference/fast16/technical-sessions BTree introduction and graphics of splits http://underpop.online.fr/j/java/algorithims-in-java-1- 4/ch16lev1sec3.htm p4 help undoc | grep db
  • 17.
    Catch me atthe Conference wherever you can to talk! anton@perforce.com