Showdown:  DB2 vs. Oracle Database for OLTP   Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: db2news.wordpress.com Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: database-diary.com
Agenda What Improves OLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
Technology for OLTP Performance Efficient I/O High performing data access Logger most critical choke point  Large Memory Exploitation Multiple buffer pools Ability to handle lots of users Low memory footprint per user Connection concentrator
Efficient I/O Logging is the most critical factor in high volume OLTP TPC-C proves that DB2 has a more efficient logger DB2 logs less than ½ of Oracle Database and SQL Server DB2 9  = 2.4KB of log per transaction Oracle 10gR2 = 4.9KB of log per transaction SQL Server 2005 = 6.0KB of log per transaction Reduced logging = increased performance
Large Memory and Efficient Memory Usage TPC-C benchmark with DB2 on p5 595 used 2TB of real memory 1.9TB of buffer pool space allocated by DB2 Using multiple buffer pools of different page sizes DB2 is now threaded One flat memory address space for all of DB2 No more private memory (all shared) Reduces per connection footprint by 1MB More efficient memory access = better performance
User Scalability DB2 has a proven ability to support lots of users Supports two connection methods: Each client has connection process on server Server processes are shared amongst incoming requests  DB2 has a lower memory footprint per agent (connection) than other vendors  Enables more client connections for a given amount of memory Better user scalability = more throughput
Agenda What Improves OLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
Transactional Performance Two large-scale standardized performance benchmarks for transactional workloads are often used for comparisons TPC-C  – industry standard OLTP benchmark SAP SD  – represents an SAP R/3 Sales and Distribution application
Longevity in TPC-C Performance Results as of April 21, 2008
Apples-to-Apples Comparison Both on exactly the same server (8way 1.9GHz p5 570) DB2 v8 vs. Oracle 10g DB2 leads in performance by 16% over Oracle 16%  Faster Results current as of Feb 24, 2008 Check  http://www.tpc.org  for latest results
Longevity in SAP 3-Tier SD Performance Results as of Jan 8, 2008
SAP SD 3-tier This benchmark represents a  3-tier SAP R/3 environment Database on separate server DB2 outperforms Oracle by 68% with  half  the number of cores DB2 running on  32-way  p5 595 Oracle running on  64-way  HP
2-tier SAP SD Benchmarks DB2 on IBM Power leads Oracle on HP by 18% on ½ the number of cores DB2 on IBM Power delivers significantly more SAPS/Watt saving you money Results as of April 8, 2008
Agenda What Improves OLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
Oracle RAC - Single Instance Wants to Read a Page Process on Instance 1 wants to read page 501 mastered by instance 2 System process checks local buffer pool: page not found System process sends an IPC to the Global Cache Service process to get page 501 Context Switch to schedule GCS on a CPU GCS copies request to kernel memory to make TCP/IP stack call GCS sends request over to Instance 2 IP receive call requires interrupt processing on remote node Remote node responds back via IP interrupt to GCS on Instance 1 GCS sends IPC to System process (another context switch to process request) System process performs I/O to get the page Buffer Cache Instance 2 Buffer Cache system 1 2 3 4 GCS GCS 5 6 501 501 Instance 1
What Happens in DB2 pureScale to Read a Page Agent on Member 1 wants to read page 501 db2agent checks local buffer pool: page not found db2agent performs Read And Register (RaR) RDMA call directly into CF memory No context switching, no kernel calls.  Synchronous request to CF CF replies that it does not have the page (again via RDMA) db2agent reads the page from disk Group Buffer Pool CF Buffer Pool Member 1 db2agent 1 2 3 4 501 501 PowerHA pureScale
Oracle RAC - Single Instance Wants to Read a Page Process on Instance 1 wants to read page 501 mastered by instance 2 System process checks local buffer pool: page not found System process sends an IPC to the Global Cache Service process to get page 501 Context Switch to schedule GCS on a CPU GCS copies request to kernel memory to make TCP/IP stack call GCS sends request over to Instance 2 IP receive call requires interrupt processing on remote node Remote node responds back via IP interrupt to GCS on Instance 1 GCS sends IPC to System process (another context switch to process request) System process performs I/O to get the page Buffer Cache Instance 2 Buffer Cache system 1 2 3 4 GCS GCS 5 6 501 501 Instance 1
What Happens in DB2 pureScale to Read a Page Agent on Member 1 wants to read page 501 db2agent checks local buffer pool: page not found db2agent performs Read And Register (RaR) RDMA call directly into CF memory No context switching, no kernel calls.  Synchronous request to CF CF replies that it does not have the page (again via RDMA) db2agent reads the page from disk Group Buffer Pool CF Buffer Pool Member 1 db2agent 1 2 3 4 501 501 PowerHA pureScale
The Advantage of DB2 Read and Register with RDMA DB2 agent on Member 1 writes directly into CF memory with: Page number it wants to read Buffer pool slot that it wants the page to go into CF either responds by writing directly into memory on Member 1: That it does not have the page  or   With the requested page of data Total end to end time for RAR is measured in microseconds Calls are very fast, the agent may even stay on the CPU for the response Much more scalable, does not require locality of data Direct remote memory write with request I don’t have it,  get it from disk I want page 501.  Put into slot 42  of my buffer pool. Direct remote memory write of response Member 1 CF 1, Eaton, 10210, SW 2, Smith, 10111, NE 3, Jones, 11251, NW db2agent CF thread
Transparent Application Scalability Scalability without application or database partitioning Centralized locking and real global buffer pool with RDMA access results in real scaling without making application cluster aware Sharing of data pages is via RDMA from a true shared cache  not synchronized access via process interrupts between servers) No need to partition application or data for scalability Resulting in lower administration and application development costs Distributed locking in RAC results in higher overhead and lower scalability Oracle RAC best practices recommends Fewer rows per page (to avoid hot pages) Partition database to avoid hot pages Partition application to get some level of scalability All of these result in higher management and development costs
2, 4 and 8 Members Over 95% Scalability Scalability for OLTP Applications 64 Members 95% Scalability 16 Members Over 95% Scalability 32 Members Over 95% Scalability 88 Members 90% Scalability 112 Members 89% Scalability 128 Members 84% Scalability
Agenda What Improves OLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
Online Recovery DB2 pureScale design point is to  maximize availability during failure  recovery processing When a database member fails, only  in-flight  data remains locked until member recovery completes In-flight = data being updated on the failed member at the time it failed Target time to row availability <20 seconds Shared Data CF CF Log Log Log Log DB2 DB2 DB2 DB2 % of Data Available Time (~seconds) Only data in-flight updates locked during recovery Database member failure 100 50
Steps Involved in DB2 pureScale Member Failure Failure Detection Recovery process pulls directly from CF: Pages that need to be fixed Location of Log File to start recovery from Restart Light Instance performs redo and undo recovery
Failure Detection for Failed Member DB2 has a watchdog process to monitor itself for software failure The watchdog is signaled any time the DB2 member is dying This watchdog will interrupt the cluster manager to tell it to start recovery Software failure detection times are  a fraction of a second The DB2 cluster manager performs very low level, sub second heart beating (with negligible impact on resource utilization) DB2 cluster manager performs other checks to determine congestion or failure  Result is hardware failure detection in under 3 seconds without false failovers
Member Failure Summary DB2 Single Database View CF Shared Data Member Failure DB2 Cluster Services automatically detects member’s death Inform other members, and CFs Initiates automated member restart on same or remote host Member restart is like crash recovery in a single system, but is much faster Redo limited to in-flight transactions  Benefits from page cache in CF Client  transparently  re-routed to healthy members Other members fully available at all times –  “Online Failover” CF holds update locks held by failed member Other members can continue to read and update data not locked by failed member Member restart completes Locks released and all data fully available DB2 DB2 DB2 CF Updated Pages  Global Locks Primary CF Secondary CF Updated Pages  Global Locks Log CS CS Clients CS CS CS CS Log Log Log Log Records  Pages kill -9
Steps involved in a RAC node failure Node failure detection Data block remastering Locking of pages that need recovery Redo and undo recovery Unlike DB2 pureScale, Oracle RAC does not centralize lock or data cache
With RAC – Access to GRD and Disks are Frozen   Global Resource Directory (GRD) Redistribution No more I/O until pages that  need recovery  are locked No Lock Updates GRD GRD Instance 1 fails Instance  1 Instance  2 Instance  3 I/O Requests  are Frozen GRD
With RAC – Pages that Need Recovery are Locked GRD GRD Instance 1 fails I/O Requests  are Frozen Instance  1 Instance  2 Instance  3 Recovery Instance reads log of failed node Recovery instance locks pages that need recovery x x x x x x redo log redo log redo log GRD Must read log and lock pages before freeze is lifted.
DB2 pureScale – No Freeze at All   Member 1 fails Member 1 Member 2 Member 3 No I/O  Freeze CF knows what rows on these pages had in-flight updates at time of failure  x x x x x x x x x x x x CF always knows what changes are in flight CF Central Lock  Manager
Agenda What Improves OLTP Performance? Comparing Cluster Scalability Comparing Cluster Availability Comparing OLTP Performance What Users are Saying
Sample of Feedback… “ We compared DB2 with Oracle and found that DB2 was more reliable and has a better price-performance ratio.” ― Dr. Yuki Kitaoka, Kyoto National Hospital  “ We chose DB2 over Oracle and Sybase for its proven performance, availability and cost-effectiveness.”  ― Atakan Karaman, Anadolu Group  “ DB2 Universal Database is the bank’s database of choice for its high availability, scalability and performance.”  ― Patrick Stearn, Bayerische Landesbank  “ In comparison tests with both Oracle and Microsoft, DB2 continually demonstrated a better price-to-performance ratio -- the quality of DB2 is quite astonishing.”  ― Benjamin Simmen, Zurich Financial Services “ When making our decision, we discovered that a $250,000 DB2/Linux solution performed better than a $1.5 million Oracle/Solaris solution”  ―  Carl Ansley, CTO, Clarity Payment Solutions, Inc.
Thank You! For more information, see  www.ibm.com/db2 Or contact me at: Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: database-diary.com

Showdown: IBM DB2 versus Oracle Database for OLTP

  • 1.
    Showdown: DB2vs. Oracle Database for OLTP Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: db2news.wordpress.com Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: database-diary.com
  • 2.
    Agenda What ImprovesOLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
  • 3.
    Technology for OLTPPerformance Efficient I/O High performing data access Logger most critical choke point Large Memory Exploitation Multiple buffer pools Ability to handle lots of users Low memory footprint per user Connection concentrator
  • 4.
    Efficient I/O Loggingis the most critical factor in high volume OLTP TPC-C proves that DB2 has a more efficient logger DB2 logs less than ½ of Oracle Database and SQL Server DB2 9 = 2.4KB of log per transaction Oracle 10gR2 = 4.9KB of log per transaction SQL Server 2005 = 6.0KB of log per transaction Reduced logging = increased performance
  • 5.
    Large Memory andEfficient Memory Usage TPC-C benchmark with DB2 on p5 595 used 2TB of real memory 1.9TB of buffer pool space allocated by DB2 Using multiple buffer pools of different page sizes DB2 is now threaded One flat memory address space for all of DB2 No more private memory (all shared) Reduces per connection footprint by 1MB More efficient memory access = better performance
  • 6.
    User Scalability DB2has a proven ability to support lots of users Supports two connection methods: Each client has connection process on server Server processes are shared amongst incoming requests DB2 has a lower memory footprint per agent (connection) than other vendors Enables more client connections for a given amount of memory Better user scalability = more throughput
  • 7.
    Agenda What ImprovesOLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
  • 8.
    Transactional Performance Twolarge-scale standardized performance benchmarks for transactional workloads are often used for comparisons TPC-C – industry standard OLTP benchmark SAP SD – represents an SAP R/3 Sales and Distribution application
  • 9.
    Longevity in TPC-CPerformance Results as of April 21, 2008
  • 10.
    Apples-to-Apples Comparison Bothon exactly the same server (8way 1.9GHz p5 570) DB2 v8 vs. Oracle 10g DB2 leads in performance by 16% over Oracle 16% Faster Results current as of Feb 24, 2008 Check http://www.tpc.org for latest results
  • 11.
    Longevity in SAP3-Tier SD Performance Results as of Jan 8, 2008
  • 12.
    SAP SD 3-tierThis benchmark represents a 3-tier SAP R/3 environment Database on separate server DB2 outperforms Oracle by 68% with half the number of cores DB2 running on 32-way p5 595 Oracle running on 64-way HP
  • 13.
    2-tier SAP SDBenchmarks DB2 on IBM Power leads Oracle on HP by 18% on ½ the number of cores DB2 on IBM Power delivers significantly more SAPS/Watt saving you money Results as of April 8, 2008
  • 14.
    Agenda What ImprovesOLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
  • 15.
    Oracle RAC -Single Instance Wants to Read a Page Process on Instance 1 wants to read page 501 mastered by instance 2 System process checks local buffer pool: page not found System process sends an IPC to the Global Cache Service process to get page 501 Context Switch to schedule GCS on a CPU GCS copies request to kernel memory to make TCP/IP stack call GCS sends request over to Instance 2 IP receive call requires interrupt processing on remote node Remote node responds back via IP interrupt to GCS on Instance 1 GCS sends IPC to System process (another context switch to process request) System process performs I/O to get the page Buffer Cache Instance 2 Buffer Cache system 1 2 3 4 GCS GCS 5 6 501 501 Instance 1
  • 16.
    What Happens inDB2 pureScale to Read a Page Agent on Member 1 wants to read page 501 db2agent checks local buffer pool: page not found db2agent performs Read And Register (RaR) RDMA call directly into CF memory No context switching, no kernel calls. Synchronous request to CF CF replies that it does not have the page (again via RDMA) db2agent reads the page from disk Group Buffer Pool CF Buffer Pool Member 1 db2agent 1 2 3 4 501 501 PowerHA pureScale
  • 17.
    Oracle RAC -Single Instance Wants to Read a Page Process on Instance 1 wants to read page 501 mastered by instance 2 System process checks local buffer pool: page not found System process sends an IPC to the Global Cache Service process to get page 501 Context Switch to schedule GCS on a CPU GCS copies request to kernel memory to make TCP/IP stack call GCS sends request over to Instance 2 IP receive call requires interrupt processing on remote node Remote node responds back via IP interrupt to GCS on Instance 1 GCS sends IPC to System process (another context switch to process request) System process performs I/O to get the page Buffer Cache Instance 2 Buffer Cache system 1 2 3 4 GCS GCS 5 6 501 501 Instance 1
  • 18.
    What Happens inDB2 pureScale to Read a Page Agent on Member 1 wants to read page 501 db2agent checks local buffer pool: page not found db2agent performs Read And Register (RaR) RDMA call directly into CF memory No context switching, no kernel calls. Synchronous request to CF CF replies that it does not have the page (again via RDMA) db2agent reads the page from disk Group Buffer Pool CF Buffer Pool Member 1 db2agent 1 2 3 4 501 501 PowerHA pureScale
  • 19.
    The Advantage ofDB2 Read and Register with RDMA DB2 agent on Member 1 writes directly into CF memory with: Page number it wants to read Buffer pool slot that it wants the page to go into CF either responds by writing directly into memory on Member 1: That it does not have the page or With the requested page of data Total end to end time for RAR is measured in microseconds Calls are very fast, the agent may even stay on the CPU for the response Much more scalable, does not require locality of data Direct remote memory write with request I don’t have it, get it from disk I want page 501. Put into slot 42 of my buffer pool. Direct remote memory write of response Member 1 CF 1, Eaton, 10210, SW 2, Smith, 10111, NE 3, Jones, 11251, NW db2agent CF thread
  • 20.
    Transparent Application ScalabilityScalability without application or database partitioning Centralized locking and real global buffer pool with RDMA access results in real scaling without making application cluster aware Sharing of data pages is via RDMA from a true shared cache not synchronized access via process interrupts between servers) No need to partition application or data for scalability Resulting in lower administration and application development costs Distributed locking in RAC results in higher overhead and lower scalability Oracle RAC best practices recommends Fewer rows per page (to avoid hot pages) Partition database to avoid hot pages Partition application to get some level of scalability All of these result in higher management and development costs
  • 21.
    2, 4 and8 Members Over 95% Scalability Scalability for OLTP Applications 64 Members 95% Scalability 16 Members Over 95% Scalability 32 Members Over 95% Scalability 88 Members 90% Scalability 112 Members 89% Scalability 128 Members 84% Scalability
  • 22.
    Agenda What ImprovesOLTP Performance? Comparing OLTP Performance Comparing Cluster Scalability Comparing Cluster Availability What Users are Saying
  • 23.
    Online Recovery DB2pureScale design point is to maximize availability during failure recovery processing When a database member fails, only in-flight data remains locked until member recovery completes In-flight = data being updated on the failed member at the time it failed Target time to row availability <20 seconds Shared Data CF CF Log Log Log Log DB2 DB2 DB2 DB2 % of Data Available Time (~seconds) Only data in-flight updates locked during recovery Database member failure 100 50
  • 24.
    Steps Involved inDB2 pureScale Member Failure Failure Detection Recovery process pulls directly from CF: Pages that need to be fixed Location of Log File to start recovery from Restart Light Instance performs redo and undo recovery
  • 25.
    Failure Detection forFailed Member DB2 has a watchdog process to monitor itself for software failure The watchdog is signaled any time the DB2 member is dying This watchdog will interrupt the cluster manager to tell it to start recovery Software failure detection times are a fraction of a second The DB2 cluster manager performs very low level, sub second heart beating (with negligible impact on resource utilization) DB2 cluster manager performs other checks to determine congestion or failure Result is hardware failure detection in under 3 seconds without false failovers
  • 26.
    Member Failure SummaryDB2 Single Database View CF Shared Data Member Failure DB2 Cluster Services automatically detects member’s death Inform other members, and CFs Initiates automated member restart on same or remote host Member restart is like crash recovery in a single system, but is much faster Redo limited to in-flight transactions Benefits from page cache in CF Client transparently re-routed to healthy members Other members fully available at all times – “Online Failover” CF holds update locks held by failed member Other members can continue to read and update data not locked by failed member Member restart completes Locks released and all data fully available DB2 DB2 DB2 CF Updated Pages Global Locks Primary CF Secondary CF Updated Pages Global Locks Log CS CS Clients CS CS CS CS Log Log Log Log Records Pages kill -9
  • 27.
    Steps involved ina RAC node failure Node failure detection Data block remastering Locking of pages that need recovery Redo and undo recovery Unlike DB2 pureScale, Oracle RAC does not centralize lock or data cache
  • 28.
    With RAC –Access to GRD and Disks are Frozen Global Resource Directory (GRD) Redistribution No more I/O until pages that need recovery are locked No Lock Updates GRD GRD Instance 1 fails Instance 1 Instance 2 Instance 3 I/O Requests are Frozen GRD
  • 29.
    With RAC –Pages that Need Recovery are Locked GRD GRD Instance 1 fails I/O Requests are Frozen Instance 1 Instance 2 Instance 3 Recovery Instance reads log of failed node Recovery instance locks pages that need recovery x x x x x x redo log redo log redo log GRD Must read log and lock pages before freeze is lifted.
  • 30.
    DB2 pureScale –No Freeze at All Member 1 fails Member 1 Member 2 Member 3 No I/O Freeze CF knows what rows on these pages had in-flight updates at time of failure x x x x x x x x x x x x CF always knows what changes are in flight CF Central Lock Manager
  • 31.
    Agenda What ImprovesOLTP Performance? Comparing Cluster Scalability Comparing Cluster Availability Comparing OLTP Performance What Users are Saying
  • 32.
    Sample of Feedback…“ We compared DB2 with Oracle and found that DB2 was more reliable and has a better price-performance ratio.” ― Dr. Yuki Kitaoka, Kyoto National Hospital “ We chose DB2 over Oracle and Sybase for its proven performance, availability and cost-effectiveness.” ― Atakan Karaman, Anadolu Group “ DB2 Universal Database is the bank’s database of choice for its high availability, scalability and performance.” ― Patrick Stearn, Bayerische Landesbank “ In comparison tests with both Oracle and Microsoft, DB2 continually demonstrated a better price-to-performance ratio -- the quality of DB2 is quite astonishing.” ― Benjamin Simmen, Zurich Financial Services “ When making our decision, we discovered that a $250,000 DB2/Linux solution performed better than a $1.5 million Oracle/Solaris solution” ― Carl Ansley, CTO, Clarity Payment Solutions, Inc.
  • 33.
    Thank You! Formore information, see www.ibm.com/db2 Or contact me at: Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: database-diary.com

Editor's Notes

  • #4 What are the most important features in an RDBMS for OLTP transactions? In order to deliver very high throughput levels for transactional systems, the RDBMS must be able to efficiently perform I/O operations without holding up the transaction. It must be able to utilize memory more efficiently and must also be able to effectively handle large numbers of users. These three critical areas enable a database server to deliver very high levels of performance. We will explore each of these three areas in detail on the following 3 slides.
  • #5 In very high volume transactional systems the logger can quickly become the bottleneck. DB2 (and other RDBMSs) can do most of the work a transaction requires completely in memory (updates occur in the buffer pool in memory, authorizations and access plans are all cached in memory, etc.). However there is one thing that cannot happen in memory. Whenever a transaction performs a COMMIT the information that tells the RDBMS how to redo that transaction (i.e. the log information) must be flushed to disk. If the committed transaction is not recorded in the log files on disk then it would be possible to lose committed transactions. Therefore all database servers write records to the log files whenever a user commits a transaction. In order to achieve very high concurrency and high throughput, it is essential that the logger be as efficient as possible since these I/Os can quickly become the bottleneck (since disk access is significantly slower than memory access). There is a very strong proof point that demonstrates DB2 has a much more efficient logger than our competitors. TPC-C is an industry standard transaction processing benchmark that all major database vendors participate in. Each vendor runs their own benchmarks to try to demonstrate they have the best RDBMS. DB2 comes out on top more often than any other database vendor (but more on that later). One interesting thing to note about TPC-C is that one of the requirements of the benchmark is that you also publish how much log space you consume during the benchmark run. By comparing the log space consumed (and knowing that this standard benchmark requires every vendor to run the exact same transactions over and over again) we can compare the efficiency of the 3 database vendor’s loggers. The most current TPC-C results (as of March 18, 2008) are shown on this chart. You can see that for each transaction (standard TPC-C transaction) DB2 produced 2.4 KB of log. Oracle’s top result is with 10gR2 (no 11g top result as of 3/18/2008) consumes 2x that much log space meaning that their the DB2 logger is twice as efficient and therefore can deliver higher levels of throughput. Oracle ran a TPC-C benchmark with RAC and consumed 20x more space than DB2. Ask Oracle why RAC consumes so much log space for the same transactions! Microsoft SQL Server 2005 result, was even worse than Oracle consuming more than 2.5x that of DB2. These benchmarks are the most highly tuned database systems available (tuned by database vendor benchmark experts from the vendor). This reduction in logging is one of the reasons why DB2 delivers better OLTP performance compared to Oracle and Microsoft.
  • #6 Efficient use of memory is also critical for high volume transaction processing. Given a limited amount of physical memory, you want your database to utilize it to the fullest in order to improve throughput of your system. DB2 has two unique advantages over Oracle in this area. The first is that DB2 allows for multiple buffer pools. In Oracle you can have only one buffer pool per page size (i.e. 1 4KB pool, 1 8KB pool, 1 16KB pool and 1 32KB pool). This can severely limit your ability to effectively utilize the memory on the server to tune the system for optimal performance. DB2 however, allows for as many buffer pools of any page size that you like. For example you can have 4 buffer pools of 4KB and another 5 buffer pools of 8KB, etc. You can choose the buffer pool configuration that best suites your transaction processing needs. As an example, on a server with 2TB of real memory in a TPC-C benchmark DB2 allocated several buffer pools of different page sizes whose total size in aggregate was 1.9TB. Now with the new threaded engine in DB2 9.5 there is even more advantage over Oracle. By using threads rather than processes for user connections, the amount of memory consumed per connection is significantly lower. This allows more user connections for a given amount of memory and leaves more memory available to other areas of DB2 (like the buffer pool). This better memory utilization results again in higher throughput and better performance. Later in this presentation we will talk about Self Tuning Memory Manager (STMM) which will show that not only does DB2 better exploit memory for higher performance, but it does so with less administration required.
  • #7 The final area that is critical to high transaction performance is the ability to support large numbers of concurrent users. Both DB2 and Oracle have the ability to do connection concentration to reduce memory requirements on the server. However, only DB2 has the threaded engine mentioned on the previous slide. This enables DB2 to scale higher than Oracle on the same server with the same amount of memory and therefore deliver higher throughput.
  • #9 There are several transaction processing benchmarks that demonstrate DB2’s performance leadership over Oracle. The first is TPC-C which is an industry standard transaction processing benchmark. SAP Sales and Distribution (SD) is also a widely used performance benchmark which simulates real world SAP transactions. The third transaction processing benchmark is called SPECjAppServer which measures the performance of a web based java application on the database system. We will discuss each of these benchmarks on the following slides.
  • #10 Benchmarks are often a leapfrog game where on any given day, one database vendor can be in front of the rest if they run on some newly announced hardware or the latest software versions. This chart represents days of leadership for TPC-C since Jan 1, 2003 through April 21, 2008. It measures how long each of the vendors have held the top spot in TPC-C. Over this 5 year period of time, DB2 has been in a leadership position almost 2x longer than Oracle and in fact has lead longer than all other database vendors combined.
  • #11 It is not very often that you get an Apples to Apples comparison where two database vendors run their benchmarks on the exact same hardware. This result is slightly dated (using DB2 v8 against Oracle 10g) however it shows that on exactly the same hardware, DB2 delivered 16% better performance than Oracle. In fact you would need 10 CPUs of Oracle to match the performance of 8 CPUs of DB2 on this class of server.
  • #12 In fact, Oracle has rarely been able to challenge DB2 over the past 5 years on the SAP SD 3-tier benchmark. This chart represents days of leadership for SAP SD 3-tier since Jan 1, 2003. As you can see, DB2 has held the lead over the last 5 years 8 times longer than Oracle (the only other competitor to lead in this timeframe) .
  • #13 This result shows the top SAP SD 3-tier benchmark results as of March 18, 2008. SAP Sales and Distribution 3-tier represents a configuration where the database software is running on it’s own server hardware and there are several SAP application servers in the middle tier. This is the configuration that most enterprise customers would run their SAP workloads on and DB2 has demonstrated clear performance leadership in this area.
  • #14 On the SAP SD 2-tier benchmark DB2 leads Oracle by 18% using half the number of processor cores. On April 8, 2008 DB2 9.5 running on a 64 core IBM Power 595 with AIX 6.1 delivered 35,400 SD Users. Oracle’s top result is 30,000 SD users with 10g running on a 128 core HP Integrity Superdome with HP-UX.
  • #16 A server process that wants to access a data page, for example page 501, will first check to see if that page is in its local buffer pool (step  ). If this page is not found, the server process will send an inter-process communication (IPC) request to a GCS process in order to ask the master node for that data page (  ). This results in the server process yielding the CPU and the CPU performing a context switch to potentially re-establish the GCS process on the CPU to process the interrupt. High levels of context switching can be very costly to perform. The GCS process then sends an IP request to the master node for the data block being requested (  ). Because IP calls are processed in the operating system kernel, the GCS process has to copy the requested information into kernel memory and then execute expensive IP stack calls to push the request to the remote node. Even if an InfiniBand network is being used, Oracle still uses IP over InfiniBand or in some cases Reliable Datagram Sockets (RDS). Use of a socket protocol even over InfiniBand is costly due to processor interrupts, IP stack traversal, etc. Next, the remote master GCS process will receive an interrupt and will be scheduled on the CPU to process the request. It will check to see if any other members have the page in its buffer cache. In this example, no member has the page so the GCS process will send an IP message back to the requester telling it to read the page from disk (  ). The GCS processes on the requesting node will be interrupted again to process the incoming IP request, and will in turn send an IPC interrupt (  ) to the server process to inform it that no other node in the cluster has the page. The server process will then read the page from disk into its own buffer cache (  ).
  • #17 This slide illustrates the advantage of DB2 pureScale for very efficient access to data. A comparison to Oracle RAC will follow in the next section. The steps listed above show you how DB2 pureScale communicates with the CF to declare its intent to access a data page. Steps 2 and 3 are the critical success factors to DB2 pureScale efficiency. That is, when there is a need to communicate with the centralized CF, that communication uses RDMA. Essentially, the process on member 1 writes directly into the memory of the CF with its request. This is done without going through the IP socket stack, without context switching and in many cases without having to yield the CPU (the round trip communication time between the two servers can be as little as 15 microseconds).
  • #18 A server process that wants to access a data page, for example page 501, will first check to see if that page is in its local buffer pool (step  ). If this page is not found, the server process will send an inter-process communication (IPC) request to a GCS process in order to ask the master node for that data page (  ). This results in the server process yielding the CPU and the CPU performing a context switch to potentially re-establish the GCS process on the CPU to process the interrupt. High levels of context switching can be very costly to perform. The GCS process then sends an IP request to the master node for the data block being requested (  ). Because IP calls are processed in the operating system kernel, the GCS process has to copy the requested information into kernel memory and then execute expensive IP stack calls to push the request to the remote node. Even if an InfiniBand network is being used, Oracle still uses IP over InfiniBand or in some cases Reliable Datagram Sockets (RDS). Use of a socket protocol even over InfiniBand is costly due to processor interrupts, IP stack traversal, etc. Next, the remote master GCS process will receive an interrupt and will be scheduled on the CPU to process the request. It will check to see if any other members have the page in its buffer cache. In this example, no member has the page so the GCS process will send an IP message back to the requester telling it to read the page from disk (  ). The GCS processes on the requesting node will be interrupted again to process the incoming IP request, and will in turn send an IPC interrupt (  ) to the server process to inform it that no other node in the cluster has the page. The server process will then read the page from disk into its own buffer cache (  ).
  • #19 This slide illustrates the advantage of DB2 pureScale for very efficient access to data. A comparison to Oracle RAC will follow in the next section. The steps listed above show you how DB2 pureScale communicates with the CF to declare its intent to access a data page. Steps 2 and 3 are the critical success factors to DB2 pureScale efficiency. That is, when there is a need to communicate with the centralized CF, that communication uses RDMA. Essentially, the process on member 1 writes directly into the memory of the CF with its request. This is done without going through the IP socket stack, without context switching and in many cases without having to yield the CPU (the round trip communication time between the two servers can be as little as 15 microseconds).
  • #20 To dive deeper into the “secret sauce” let’s look at exactly how the member communicates with the CF. If an agent on member 1 wants to read a page, that agent will write directly into the member of the CF telling the CF exactly what page it wants and even telling it what slot in its buffer pool that the page will go into on Member 1. If the CF does not have the page, it writes a message right into the memory of Member 1 to indicate that it doesn’t have the page. If the CF does have the page, it writes the data page directly into memory on Member 1 without any context switching or IP stack calls.
  • #21 As previously mentioned, the critical success factor for scalability in an active-active cluster is to ensure that when a transaction requests a piece of data, it can get that data with the lowest amount of latency. With DB2 pureScale, by centralizing data that is of interest to more than one member in the cluster, and by accessing that data using RDMA in an interrupt free processing environment, you can see near linear scalability even out to dozens of nodes. More importantly, you do not need to design your application to be cluster aware. There is no need to route transactions that access the same data pages to a single node. In practice, this is not the case with Oracle RAC. There are many stories on the internet, and in published books on Oracle RAC that tell customers to avoid hot pages being passed between nodes by using one of the methods described in the last 4 bullets of the above slide. These methods require costly DBA and application developer interventions, as well as potential application rework as the size of the cluster changes.
  • #22 To demonstrate the scalability of DB2 pureScale, the lab set up a configuration comprised of 128 members (note that for server consolidation environments it is possible to put multiple members on an SMP server). A workload was created where the read to write ratios are typically 90:10. As well, to prove the scalability of architecture, the application has no cluster awareness. In fact the application updates or selects a random row and therefore every row in the database will be touched by all members in the cluster (we did this to show that locality of data is not as essential for scaling as with other shared disk architectures) The results of this 128 member test show that there is near linear scaling even out to 128 members in the cluster. Up to 64 members in the cluster, the scalability (compared to the 1 member result) is still above 95% and at 128 members the scalability was at 84%. Note that this is a validation of the architecture and includes some capabilities under development that will not be in the December GA code.
  • #24 The second key feature of DB2 pureScale is the high availability it provides. Again the secret to its success is the centralized locking and caching. When one member fails, all other members in the cluster can continue to process transactions. The only data that is unavailable are the actually pages that were being updated in flight when the member failed. And if those pages are hot then they will be in the CF memory which means the recovery of pages needed by other members will be very fast.
  • #25 There are 3 things that occur during an instance failure at a high level Failure detection Pull pages that need to be fixed directly from CF memory Fix the pages In each of these steps, DB2 pureScale has been optimized with the goal of getting these pages fixed and having them accessible in under 20 seconds (all the while the rest of the data in the database is completely available).
  • #26 Failure detection was a large part of the investment that went into DB2 pureScale. Software failure in a DB2 pureScale environment has been architected to be caught in a fraction of a second and to begin the driving of recovery processing within that second. Hardware failure is a more difficult challenge, but thanks to some innovative techniques, DB2 pureScale has built in a set of algorithms that can detect node failures in as little as 3 seconds without false failovers. When we talk about having the rows available again within 20 seconds of a failure, we mean from the time the failure occurred, not the time the failure was detected. Other vendors may exclude this time to give better numbers but from an end user this time is critical and so we include it.
  • #27 Here is a detailed walk through of what happens when a node fails. Run this in slide show mode to see the steps. Note that we call this process “Online Failover” because the other transactions on other nodes are not impacted in any way from processing (which is different than Oracle RAC as you will see in future slides. As well, the data that needs to be fixed will primarily be in memory on the CF so it will be working at memory speeds. In the event of a hardware failure, we take the additional step of automatically fencing off the storage access from the failed member to prevent split brain issues.
  • #28 In Oracle there are a similar set of steps to recover from a filed instance. However the middle two are where things are very different when compared to DB2 pureScale: Node failure detection Global lock remastering Lock pages that need recovery Fix those pages The biggest difference is that with DB2 pureScale there is centralized locking so there is no need to remaster global locks. Also pureScale does not need to find the pages to lock (it is already aware of the pages that need to be fixed).
  • #29 In Oracle RAC, each data page (called a data block in Oracle) is mastered by one of the instances in the cluster. Oracle employs a distributed locking mechanism and therefore each instance in the cluster is responsible for managing and granting lock requests for the pages that it masters. In the event of a node failure, the data pages for the failed node become momentarily orphaned while RAC goes through a lock redistribution process to assign new ownership of these orphaned pages to the surviving nodes in the cluster. This is called Global Resource Directory (GRD) reconfiguration and while this is occurring, any request to read a page, as well as any request to lock a page, is momentarily frozen. Applications can continue to process on the surviving nodes, however, during this time, they cannot perform any I/O operations or request any new locks. This results in many applications experiencing a freeze as shown in this slide.
  • #30 The second step in the Oracle RAC node recovery process is to lock all the data pages that need recovery. This must be done before the GRD freeze described earlier is released. If an instance was allowed to read a page from disk before the appropriate page locks were acquired, the update from the failed instance could be lost. The recovery instance performs a first pass read of the redo log file from the failed instance and locks any pages that need recovery as shown in Figure 2. This may require a significant number of random I/O operations as the log file, and potentially the pages that need recovery, may not be in the memory of any of the surviving nodes. The GRD freeze is lifted and the stalled applications can continue processing only after all these I/O operations are performed by the recovery instance and the appropriate pages are locked. Depending on the amount of work that the failed node was doing at the time of the failure, this process can take from tens of seconds up to as much as a minute before it completes. This GRD freeze and the fact that I/O operations cannot be performed during this period or new lock requests granted, is documented in several published books on Oracle RAC.
  • #31 In comparison, DB2 pureScale environments require no global freeze in the cluster. The CF is aware at all times which pages would need recovery should any member fail. If a member fails, all other members in the cluster can continue to run transactions and perform I/O operations. Only requests to access pages that need recovery will be blocked while the recovery process cleans up from the failed member as shown on this slide (and the process is likely to happen from memory).