What's New in Teams Calling, Meetings and Devices March 2024
Scaling PostgreSQL on SMP Architectures
1. Scaling PostgreSQL on SMP Architectures
Doug Tolbert, David Strong, Johney Tsai
{doug.tolbert, david.strong, johney.tsai}@unisys.com
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 1
2. Unisys Cellular Multi-Processor Architecture
• Up to 32 CPU sockets
C PU
– In 4- or 8-CPU pods C PU
C PU
– Dual-core supported C PU
Multi-core eventually C PU
T LC
C C PU
– Dynamically partitionable PU C PU
• 32-bit & 64-bit technologies T LC CPU
C PU C PU
8 GB T LC C PU
• Multilayered caches C PU M SU
C PU C PU C PU
C PU C PU C PU
• Large cross-bar memories T LC C PU
C PU
– IA32: 64Gb (with PAE) PC I C PU C PU
T LC PC I CP
– EM64T, Opteron, Athlon: 256Tb PC I 8 GB T LCU C PU
– Itanium: 1Pb PC I M SU C PU C PU
PC I C PU
– Theoretical: Multi-Pb PC I T LC C PU
C PU C PU
Processor family dependent DIB PC I
PC I C PU
• Up to 96 PCI Slots PC I 8 GB DIB PC I
M SU PC I
– Lower on newer models PC I PC I
PC I PC I
PC I DIB
• Linux: SLES, RedHat PC I
PC I
DIB
• Windows: Data Center 8 GB PC I
DIB PC I
M SU
PC I
DIB PC I
PC I
PC I
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 2
3. Unisys Open And Secure Integrated Stack
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 3
4. PostgreSQL ASC Lab Engagement
• February 2005, Unisys Mission Viejo Facility
– Simon Riggs, Mark Wong (OSDL) participated
– PostgreSQL 8.0.1 measured
– Summary at
http://developer.osdl.org/markw/papers/PostgreSQL%20Visit%20Report%20External%20050513.pdf
• Interesting findings
– Locking protocol and granularity
– Context switch “storms” (~80K/sec)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 4
5. PostgreSQL 8.1 Enhancements
• A few stand out from a performance perspective
– Improve concurrent access to the shared buffer cache (Tom)
– Allow index scans to use an intermediate in-memory bitmap (Tom)
– Improve performance for partitioned tables (Simon)
• Context switches reduced
– From ~80K/sec to ~30K/sec
• Other effects only seen in overall throughput improvements
• Undertook more complete characterization of 8.1 performance
improvements
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 5
6. TPC-C Benchmark
• Transaction Processing Performance Council (TPC)
– Defines suite of benchmarks
– Sole certifier of vendor-submitted benchmark results
– Specs, results, more info: http://www.tpc.org
• TPC-C benchmark is a complete order-entry environment
– Industry standard measure of DBMS OLTP performance across vendors
– Reports transactions per minute (tpmC) and cost metrics
– Mix of 5 transactions
• New Order (selects, updates, inserts)
• Payment (selects, updates, inserts)
• Order Status (selects)
• Stock Level (selects)
• Delivery (selects, updates, deletes), interactive & deferred variants
• Data volumes, number of clients scaled by vendor’s tpmC goal
– See spec for details
• Current (June 2006) result ranges
– Best performance: 3,210,540 tpmC @ $5.07/tpmC
– Lowest performance: 9,347 tpmC
– Best price/performance: 38,622 tpmC @ $0.99/tpmC
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 6
7. Ah, these are
NOT TPC-C
numbers!
Test Application
• DBMS stressor application with well-known characteristics
• TPCC4J = “TPC-C For Java”
– Unisys-developed from TPC-C version 5.3 (April 2004)
– Runs on 1.4.2 JVM (5.0 JVM works)
– Small application footprint (~72K)
– Very few garbage collection cycles
– Client overhead low (7-9% for ~1000 users)
• Deviates from TPC-C in important ways
– Non-audited components and results
– Data entry screens not populated, only SQL statements run
– User think time eliminated
– Reports TPS, not tpmC
• No, you can’t multiply by 60 to get tpmC!
– No checkpoints during benchmark runs
– Not run for a minimum of 8 hours
– Data volumes not scaled
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 7
8. Test Configuration
Client Driver: Unisys ES3040 PostgreSQL: Unisys ES7000/540 Storage: EMC CX-200 External Disk
4 x 3.0Ghz Intel Xeon MP CPUs (IA32) 32 x 3.0Ghz Intel Xeon MP CPUs (IA32) 8 x 36Gb drives (15K rpm)
Hyper threading enabled Hyper threading disabled RAID 10 configured (striped mirrors)
4Gb RAM 16Gb RAM
Small data volume: 50 TPC-C
1 x 36Gb boot driver (10K rpm) 1 x 36Gb boot driver (10K rpm)
warehouses, ~5.4Gb (4.7Gb data +
1 Gbit NIC (copper) 1 Gbit NIC (copper)
700Mb indexes)
OS: SLES 9 SP3 OS: SLES 9 SP3
No I/O issues in benchmarking
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 8
11. Performance vs. Scalability
• Performance and Scalability are not the same thing
– But are closely related
– Must be REALLY careful comparing anecdotal reports
• Performance comparable on equal configurations
– In this data set, the only equal configuration is 1 CPU
– 8.1.2 exhibits a small (~8%) improvement over 8.0.7
• Scalability measures throughput as resources are added
– 8.1.2 uses incremental resources more effectively than 8.0.7
– At 16 CPUs, 8.1.2 is 5.6 times better than 8.0.7
• Fair to conclude that 8.1.2 scales effectively to 16 CPUs
– At least on this application and platform combination
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 11
12. Hardware Architecture
• Cache invalidations
– Handled quickly when on-cell ES7000/540 Cell
• No visible degradation up to 8 CPUs RAM
– Slower across cells Memory
• Cross-cell interactions required, but slower Controller
• Responsible for dip at 12 CPUs Shared
Cache
– Processing power unbalanced across cells (8 + 4)
CPU CPU
• Recovers by 16 CPUs
CPU CPU
– Processing power balance across cells (8 + 8)
CPU CPU
– Reduces chance of cache misses
CPU CPU
• Cross-cell cache invalidations increase quickly when data
structures are poorly aligned with cache line size
• Every large-scale computing platform, regardless of vendor,
will have some sort of hardware related behavior that needs to
be considered in performance and scalability work
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 12
13. Software Architecture
• Degradation above 16 CPUs
– Continued decline implies systemic origin
– Characteristic of software algorithms or data structures
reaching inherent scalability limits
• Causes unclear, but something important is wrong
• Possibilities include
– The Usual Suspects: lock mgmt, buffer mgmt, etc.
– Lack of Repeatable Read isolation level
• A TPC-C requirement
• Elevation to Serializable hurts overall throughput
– See To Do List (a later slide)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 13
14. It’s not just TPCC4J!
• Another industry standard, transactional Java benchmark
– ES7000 8x
– 8.0: 34 TPS
– 8.1: 544 TPS
• Different benchmark structure
• Different hardware platform
– No cache issues, CPUs on one cell
• Even more dramatic improvement (16x!)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 14
16. Other Useful Tooling
• VTune
– Intel low-level performance monitor
• http://www.intel.com/cd/software/products/asmo-na/eng/index.htm
– High frequency functions, by clock ticks HeapTupleSatisfiesSnapshotData??
• HeapTupleSatisfiesSnapshotData
• LWLockAcquire
• XLogInsert
• PinBuffer
• hash_search
• oprofile
– Results pending
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 16
17. To Do List
• In-line some high frequency functions
• Speed up CRC algorithm by rearranging loop structure
• Improve data structure alignment to memory pages and cache
lines
– Possibly dynamically based on processor id and settings
• Re-type byte structures to words for newer Intel processors
• Investigate efficiency of bitmaps on newer Intel processors
• Residual spin-lock implementation issues
• Reduce query plan discards (very adventurous!)
• Improve statistics infrastructure and real-time monitoring
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 17
18. Conclusions
• Scalability dramatically improved in 8.1 release
– Scales to 16 CPUs (on test environment)
– Context switches down
– Supported by multiple benchmarks
• Minor single CPU performance improvement
• Hardware architecture can affect scalability
– Important consideration for high-capacity, high-throughput
applications
– Systems of all sizes moving to multi-core processor chips
• Substantial scalability and performance challenges remain
– TPC-C required feature content missing (repeatable read)
– Algorithm scalability limits exist (especially > 16 CPUs)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 18