Oow2007 performance


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This graphic focuses on the interconnect fabric. It is a network, like any other, for a single dedicated specific (private) purpose – cluster communication. People can get creative with their network design, with use of VLANS, various switches, and different topologies. Latency is the key factor to minimize, along with reliability of the switches for cluster communications.
  • OCSSD/OPROCD are running in RT. Total # of RT processes < # of CPUs.
  • From the point of view of process architecture, one or more block server processes , called LMS, are handling the bulk of the message traffic. The LMS processes are Oracle background processes. When a shadow process makes a request for data, it sends a message directly to an LMS process on another node, which in turn returns either the data or a grant ( permission to read from disk, or write to data block ) directly to the requester. The state objects used for globally cached data are maintained in the SGA and are accessed by all processes in an instance which need to maintain and manipulate global data consistently. LMS n. Runs in RT by default since 10gR2. Need predictable scheduling for predictable performance in: Runtime cache fusion performance. Broadcast on commit performance VKTM is a new fatal background process. VKTM keeps updating a timer variable on the SGA VKTM reduces the CPU overhead for getting timing information considerably VKTM needs to be in RT for correctness
  • The Global Cache Service manages data cached in the buffer caches of all instances which are part of the database cluster. In conjunction with the an IPC transport layer, it initiates and handles the memory transfers for write access ( CURRENT ) or read access ( CR ) transfers for all types (e.g data, index, undo, headers ) globally managed access permissions to cached data Global state of the block The GCS can determine if and where a data block is cached and forwards data requests to the appropriate instances. It minimizes the access time to data, as the response time on a private network is faster than a read from disk. The message protocol scales and will at most involve 3 hops in a cluster comprised of more than 2 nodes. In fact, the total number of messages will be determined by the probabilities of a finding the global state information for a data block on the local node or a remote node, and whether data is cached in the instance which also masters the global state for the data. Oracle RAC attempts to colocate buffered data and their global state as much as possible to miminize the impact of the message cost. Cache Fusion and the GCS constitute the infrastructure which allows the scale-out of a database tier by adding commodity servers.
  • In its simplest case, if the data is not in the local buffer cache but in the buffer cache of a another instance, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor.
  • If the data is not cached in any of the instances and is on disk, a grant from the master may be required. The master can be thought of as the directory node for a block or an object. The global state of the resource ( data block of object ) – whether it is cached or on disk, which instances have the blocks cached and whether the blocks can be sahred immediately or has modification pending - is completely known at the master. When data is on disk, B – (B / N ) messages may be required ( where B = number of disk reads ) and N = Nodes, as the resources masters are distributed over all instances in the cluster ( I.e. each instance can be master for a particular data block or object )
  • In 11g, the case where data is on disk is optimized, if the table/index/partition is found to be accessed mostly by reads. The read-mostly protocol detects the access and marks an objects as read-mostly. For read-only or read-mostly accesses, no messages are required. Many read-intensive applications or parts of applications in the OLTP, DW or Business Analystics space will be able to take advantage of it. The performance gain will depend on how frequent modifications to the read-mostly data are required, but the CPU saving can be very significant
  • In its simplest case, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor. Actual Cost determined by Message Propagation Delay IPC CPU Operating system scheduling Block server process load Interconnect stability
  • It should be stressed that these are the minimum roundtrip latencies measures at low to medium load ( 50% CPU utilization ) It should be clear that the processing cost is affected by several factors, as just explained . Hot database blocks may incur and extra processing cost in user space. Most average values represented in AWR reports are from a large distribution with some amount of variance, I.e. higher values can skew the avg value , and often one sees 1 or 2 ms avg latency although the the majority of accesses complete in less than 1 ms The main purpose of this table is to serve as a reference for expected values. In 11g, latency probes for small and large messages would allow you to correlate system LOAD and average access time at run time in the system. The results of the latency probes are stored in the AWR repository and can therefore be accesses and regressed easiliy .
  • Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.
  • In most known OLTP configurations to date, the bandwidth of 1 GbE is sufficient. The actual utilization depends on the size of the cluster nodes in terms of CPU power, the number of nodes accessing the same data, the size of the working set for an application. Most applications have good cache locality, and there are no increasing interconnect requirements when scaling the application out by adding cluster nodes and distributing the work over more instances or adding additional load. For small working sets which could fit into a small percentage of the available global buffer cache, the interconnect traffic may increase when the set remains constant. The actual utilization is difficult to predict but in most cases is no reason for concern in the OLTP world when it comes to providing adequate bandwidth. Typical utilizations for OLTP are usually much lower than the total available network capacity of 1 GbE. As a rule of thumb, a total disk IO rate of 10000 Ios /sec in a cluster with 4 Nodes will require about 7.5 MB/sec of network bandwidth , given that the Ios read data into the buffer cache and are not direct reads ( for a read-mostly workload, it will be only a small fraction of that as long as the read-mostly state is active ). Direct Reads or read-mostly ( 11g) do not require any messages for global cache synchronization . For DSS queries which use inter-instance communication between slaves, the size of the data sets and the distribution of work between query slaves suggests using multiple GbE NICs, 10GbE or IB . The rule of thumb here is that it is a good design practice to provide for a higher bandwidth than 1GbE . For OLTP, a general rule is that if the number of CPUs in a cluster node exceeds 16 – 20 CPUs, multiple NICs may be required to provide sufficient bandwidth
  • It is recommended to check and test the network infrastructure and protocol stack configuration thoroughly before committing a system to production. Specifically, socket buffer sizes, NIC data buffer and queue length sizes, negotiated bit rate and duplex more for NICs and switch ports, flow control settings. For Jumbo frames, consult with the hardware vendor as to the optimal setting, because the NIC and driver resources may have to be increased. In some cases, network interrupts are handled by a decicated CPU. If that CPU becomes 100% busy, performance will suffer and the IPC will not scale. Make sure this does not become a bottleneck, you can dedicate it to more CPUs While the cluster verification utility automates some of these checks, it is advisable to thoroughly test the hardware and OS configuration with non-Oracle tools, such as netperf, iperf and other publicly available software.
  • As seen earlier, a lot of cycles for block access are actually spent in the OS on process wakeup and scheduling as well as network stack processing. The LMSs or block server processes are a crucial component. They should always be scheduled immediately when they need to run. On a very busy system with many concurrent processes, the system load may have an impact on how predictably LMS can be scheduled. The default number of LMS processes is based on the number of available CPUs and the goal is to minimize their number to keep individual LMS processes busy. Fewer LMS process have an additional advantage of allowing for better message aggregation and therefore more CPU efficient processing. It is default to max (2, 1/4 * number of CPU) if you have only 1 cpu on the system, it is 1. It is computed as MIN(MAX((1/4 * cpu_count),2),10) So it 1/4 of the number of CPUs but can not be more than 10 or less than 2. So if you have less than 8CPU (or cores) per node, you still get minimum of 2 LMS processes. You can use gcs_server_processes parameter to change the number of LMS processes. In 10gR2, if you see significant wait on event like 'gc... congested', such as gc cr block congested gc current block congested it likely to mean that LMS processes were starved for CPU resource. Depending on the size of the buffer cache, multiple LMS processes can speed up instance reconfiguration and recovery and startup. This should be born in mind when configuring machines with large SGAs If large buffer cache , want more than one lms, especially if want fast failover On most platforms, the block server processes are running in a high priority by default in order to minimize delays due to scheduling. The priority for LMS is set at startup time.
  • In the following slides , we present the most common issues which you are likely to encounter with RAC and the global cache. We present the symptoms and possible solutions and a guideline on how to diagnose different problems A highly visible issue in 10g is the loss of messages due to network errors or congestion. These problems are usually visible as “lost blocks”. The disk subsystem may impact performance in RAC significantly, certain loads such as queries scanning large amount of data, backups and other Concurrent load which may affect the same disks or disk groups and cause bottlenecks. When these extra loads are run on a particular node, other nodes may be affected although those nodes may not show any particular symtoms except for higher average log writes and disk reads times. A high CPU utilization or context switching load can affect the performance of the global cache by adding run queue wait time to the access latencies. It is important to ensure that the LMS processes can run predictably and that interconnect messages and clusterware heartbeats can be processed predictably. Avoiding negative feedback when the servers slow down under load and existing connections are busy is and important best practice. Unconstrained dumping of new connections onto the database instance can aggravate a performance issue and render a system unstable. Application contention such as frequent access to the same blocks can cause serialization on latches, in the buffer cache of an instance, and in the global cache. If the serialization is on globally accessed data, then the response time impact can be significant . When these symptoms becomes dominant, regular application and schema tuning will take care of most of these bottlenecks Unexpectedly high latencies for data access should be rare , but can occur in some cases of network configuration problems, high system load, process spins or other extreme events .
  • The so-called “lost block” issue - you will actually see a wait event indicating that time is spent in waiting for blocks which are “lost” – is almost always a network configuration or congestion issue. It means that a user process has made a request for data via interconnect, the block is sent by an LMS from a remote node and has not arrived after a certain period of time ( usually about 5 secs in 10g ) . The block is then considered “lost” ,probably due to a flow control or congestion issue ( buffer overflows in switches or NICs ). If lost blocks or packets occur frequently, the impact in 10g is usually severe. Therefore, it accounts for a large part of the performance related escalations in RAC. Assuming that the interconnect is a rivate network, the most frequent symptoms which can be detected on servers using netstat or ifconfig commands are buffer overflows, packet reassembly failures, errors on the NIC etc and can be fixed by increasing receive and send buffer sizes or manipulating flow control settings. In cases where switches are involved, monitoring the ports on the switches which connect the nodes to the switch fabric is required. Sometimes a network sniffer ( such as Etherreal ) can be of great diagnostic value. The use of Jumbo frames reduces the probability of lost blocks are the Oracle data blocks are usually not likely to be fragmented into small MTUs ( e.g. and 8K block is sent in 5-6 frames over the Ethernet )
  • A “lost block” issue by example: receive errors on eth0 are detected with ifconfig. The ifconfig command should not show any positive vaues for errors, dropped or overrruns. Overruns indicates that NIC internal buffers should be increased while dropped may indicate that the driver and OS layers cannot drain the queued messages fast enough. Here the problem is in the lower portions of the network stack.
  • Another lost block issue by example , a bit higher up in the network stack, namely at the IP layer. The fragments of an Oracle blocks are fragmented by the sender and reassembled by the receiver IP stack. AN 8K Oracle blocks may be require 5-6 packets of MTU size 1500 bytes . The OS buffers the arriving packets until the last Fragments is received. If a fragment does not within a certain time period, all fragments of that UDP packet which constitutes the Oracle block are discarded.
  • Even higher up in the stack, at the application, the lost block sceanrios persented in the previous slides present themselves as time waited for block that does not arrive. The request is cancelled and retried, as you can see in the top 5 wait events of this report. For obvious reasons, these two events often occur together and should never be prominent, I.e. in the top 5 list of wait events . Note that the other findings in this list are not looking good either, but the network issue needs to be fixed first, it indicates that the infrastructure cannot achieve good performance and scalability and the problem cannot be solved by any other means of tuning.
  • In Oracle 11g, the impact of the lost block issue is mitigated by a lower detection time. The algorithm is robust and avoid false postives without causing any overhead. Although the impact of lost blocks will be reduced, the issue is still if concern and should not be underestimated only because the time spent waiting for data that does not arrive may not show up in the top 5 wait event list. Note that that cr request retry event can be a logical consequence of losing blocks. Even in 11g where the impact of the failure is reduced, these event should never become significant .
  • In 11g, probes of various sizes are sent infrequently from the IPC layer below the global cache to all instances. This results in a running “bottom line” for all messaging operations . This is a sample of a new section added to AWR report which shows the interconnect statistics. The data summaries are stored in the AWR repositories and are used by the automated diagnostics framework to provide advisories. The underlying V$ views can also be queried directly. The actual report as more statistics such as throughput and send/receive errors and dropped packets. It also groups data by the clients which call into the IPC layer, e.g the global cache, global enqueue management or the parallel execution layer. This data is also useful in detecting errors and dropped packets, obviating the need to use netstat of ifconfig
  • The solution to the lost block issues is almost always the same: network errors or congestion cause data requested by Oracle to be dropped. The problem can always be fixed by tuning buffer sizes, setting flow control and NIC hardware parameters correctly , replacing NICs or updatring firmware
  • Moving on to the next big group of issues, disk IO . Any IO capacity problem or bottlenecks may impact RAC. First off, the storage is global to the cluster and a badly behaving node or badly balanced disk configuration can affect the entire disk read and write performance of all nodes. Some operations in the global cache may involve log flushes, in cases where frequently modified block is also frequently read on all nodes in the cluster, I.e. read across the interconnect If changes ( I.e. redo ) for those blocks has not been written to the logs when such a read request from another node arrives, the global cache asks LGWR to sync write the redo before sending the block. For these blocks, the log file write or sync latency determines the access time for the other node. If the IO takes long, the users on other nodes wait longer for the data, and the increase access time may result in serialization. In a scenario where a “bad” query saturates disks which are also used for log files, the impact of the bad query on the log file sync performance can be considerable . In 11g, ADDM and AWR will present a global picture as well as instance specific drill down, I.e. cluster-wide IO issues can be identified with more ease.
  • Here is a cluster-wide IO issue by example, it is a real case. High IO volume on node 2 , caused by a query with a plan that should not have been run there. Impacts the log file sync on Node 1. Note that the wait events for the global cache are marked as busy. If a wait event is marked as busy, it means that the block could not be sent immediately, and it is highly likely for data blocks that a log flush took long. If those blocks are frequently access by users from all nodes, serialization may become a secondary symptom, as indicated by the gc buffer busy wait event here.
  • After the query issue was fixed, the system returned to normal, expected behaviour. Note that the log file sync time has come down considerably, and that the event marked as “busy” have disappeared. All blocks are sent immediately , I.e. no log flushes are required. This list of event also present the goal of any tuning for the global cache: to only see events marked as 2-way or 3-way in the top 5 or with significant impact on the call time
  • I, n the post mortem for the problem in the previosu slides , the top 5 wait event list from Node 2 at the bad time shows clear signs of frequently executed table scans, scattered reads and multi block reads are the the indicators. IN this scenario, it is most important to realize that the IO issue needs to be identified and fixed first before looking at any other symptoms.
  • Best practice checkpoint : tuning IO layout and queries is most important in RAC, as in non-RAC systems. If there are clear signs of an disk performance problems , e.g. long lasting log syncs and read bottlenecks identify the cause and remove them, I.e. add more disks, Stripe them differently, or simply fix the queries. As you have seen in the previous examples, the secondary symptoms indicated an issue with the global cache , the wait events in the top 5 list are not correlated or causally connected. In this case, ADDM would have ranked the impact and significance of the problem, identified the query , and provided recommendations.
  • A highly utilized server in the cluster, in terms of high CPU utilization or context switches, can affect the effciency with which the block server processes can respond to messages and process the requests. If an LMS is not able to be scheduled in order to process messages which have arrived in its request queue, the time in the run queue adds to the data access time for users on other nodes. The hint congested indicates that it may have taken long to access a block because the block server process was too busy or did not get the CPU in order to server the data request. This case should be relatively infrequent, as LMSs run in higher priority than any other database processes, but it could still occur when database external processes are running at an even higher priority and unfairly consume large shares of the CPU power. Of course , it is also possible that the high priority for the LMS processes could not be set at startup. Checking the priority of LMS and also eliminating external processes which may cause starvation is the most important action to be taken here. Starting up more block server processes is possible and recommended, if the indivdual LMSs are already very busy ( 90-100% of a CPU ).
  • As a short best practice checkpoint: When events marked as congested find their way into the top 5 events, CPU and process tuning is the correct course of action. The goal for the global cache is to minimize the wait time for global cache events and to only see waits marked as 2-way or 3-way in the top 5.
  • In RAC as well as in non-RAC , contention and serialization in the application or schema design affects performance and scalability . Any frequently accessed data may have hotspots which are sensitive to how may users are accessing the same data concurrently. Any slight increase in the access time to those data can cause queueing and serialization, which in a RAC cluster can magnify a bottleneck. To identify contention and serialization , a hint is added to the event which characterizes the time spent waiting. This hint is useful to identify table and indexes for which contention is high . SQL and schema tuning aimed at removing those hot spots is the correct action taken in such cases. However, in this example it is also very likely that the high avg latency for immediate block transfers aggravates the contention and should be looked at first.
  • The best practice checkpoint for the category of performance issues associated with contention is to find the hot spots and tune them, as one would in a non-RAC system. Note if you are running single instance, you will still see this problem. In single instance you see buffer busy waits or latch contention, which is also often an indicator that when moving from a non-RAC to a RAC system, the bottleneck can cause a performance problem
  • For the previous example which dealt with contention, it became clear that it is not always the events that are marked with busy hints that are the most important. IN this case, the unexpectedly high access times to data for immediate sends are problematic, and also have the highest impact on the response times. Remember we said that a transfer should be less than a millisecond. Here we see high latency for the transfer of blocks. This is not really a RAC problem. RAC is the victim of either network problem or high system load. The tuning goal here should be to minimize the latencies for the 2-way or 3-way accesses . The rule of the thumb is that they should be around 1 ms on average, and that double digit avg access times , or access times that are slower than the average disk read IO, are suspect. Those must be tackled first before moving on to removing the hot spots
  • The best practice checkpoint for this scenario for this is to run some network diagnostics, ensure that the interconnect is a private network and that the link is operating at the expected bit rate . Frequent retries due to network errors can also cause similar symptoms, since it may cause the user processes to spin for a short while. Of course, bugs are not excluded.
  • Last but not least for this section, it is important to have in mind what is “good” and “bad” , or in other words, which events and performance levels are expected. As a final health check and summary for this section, a list of “bad” symptoms that one should be aware of and tackle if these expected events show up in the top 5 list of events for which time is spent waiting. The list is ordered by importance for performance, I.e. when these symptoms are removed, the performance and scalability of a RAC cluster will be acceptable. Network issues will always be a problem, and contention and serialization should be take seriously an can almost always be solved by application and schema tuning. Load and system tuning will solve a large class of problems. In summary, the cluster should be tuned so that load, contention, network errors and unexpectedly high latencies do not show up in the top 5 list. The diagnostics framework built into the Oracle kernel will provide useful recommendation and guidance to facilitate this process.
  • RAC Best Practices accumulated a wealth of knowledge learned from real life environments. Following those practices most of time eliminates any tuning effort. In general, good application and database design and SQL Tuning will resolve a large majority of performance and scalability issues in RAC. The practices therefore are not fundamentally different from performance tuning in a non-RAC system. Existing bottlenecks are very likely to become worse when the application is migrated to RAC.
  • Over the past few years, the performance and scalability issues have been cristallized around a few fundamental issues. At the top of the list, hot spots and serialized access to data constitute the most serious scalability issues in RAC. A right-growing index , due to the use of sequence numbers for keys, can cause a severe response time increase when the index is modified from all instances in the cluster. So can a heavily access data block in a table, or the improper configuration of segment or tablespaces at create time. Full table scans in RAC may entail higher CPU consumption, but may not be a good thing anyway, regardless of RAC. It is always a good idea to look at execution plans and consider parallel execution and direct reads if large scans are made. In 11g, read-mostly parts of application will benefit from new optimization, as will full table scans for buffered reads. Concurrent DDL, such as dropping or truncating tables, involve cross-instance cache invalidations and are heavy-handed operations which may serialize. Creating or dropping partitions on the fly and using them immediately for online processes can cause additional invalidations of library cache objects and hard parses. As with concurrent DDL, the implications are that invalidations and parsing needs to occur on all instances and must be globally synchronized .
  • A lot of best practices have been accumulated and published over the years and are available in the form of tech notes and white papers. For 10 and 11g, these best practices have been incorporated into the AWR and ADDM advisories, which are accessible via reports or on direct display in EM..
  • IN the previsou section, we have already outlined the basic flow of performance drill downs using AWR data. One starts top down with a look at where most of the time in the database is spent, then identify the high impact events which are related to contention and system load, checks with latencies with expected ones, and then moves on to SQL and Segments for which then impact is highest. IN 10g and 11g, the ADDM condenses this multi step approach to a single run, in fact ADDM is always run automatically for every statistics snapshots, and the findings can be queried. In fact ADDM should always be consulted before considering the more details and less interpretive AWR statistics.
  • The previous example illustrate the diagnostics flow: Am IO response time issue with higher impact on response times is found in the top 5 wait events.Looking at the events it can be assumed that the majority of time is spent waiting for full table scans and that this causes contention on the buffer cache
  • In a drill down to identify the SQL, the section of an AWR report which reports the work done by individual queries is analysize for queries with a high number of physical reads. In this case, a query which reads the table ES_SHELL, is identified, For the same query, there is a disk IO and an interconnect finding. The report will also give the hash id for the SQL so that execution plan can be expected.
  • The segment statistics allow us to conclude that the table access in the query is not only the one with most of the physical reads, but also one on which global contention is experienced. After this drill down, the IO issue is largely explained and can be tacked immediately.
  • Em, based on the ADDM and its automatically generated findings and recommendations, combines these steps and puts out database and cluster-wide impact rankings and recommendations. Note the affected instances in this global view of performance in the cluster. The findings also indicate that there are recommendations for the cluster interconnect and for SQL affected by interconnect latencies which can be resolved by SQL tuning.
  • The entire performance diagnostics framework in 10g and 11g will make performance analysis and troubleshooting more efficient virtually at the push of a button. The ranking of impact and diagnostics flow which we gave be example in the previous section is incorporated into this framework. The recommendation is therefore to use it, as it saves time and reduces effort in identifying issues in a cluster. For trending and post mortem diagnostics, it is a good practice to export the AWR repository regularly and archive it. The retention time is about 1 by default for the snapshots taken so that a weekly export and archiving of the export file or import into a statistics warehouse are infrequent and low impact operations.
  • In 11g, the ADDM is also global. The analyst will obtain global and local findings ( for specific instances and for the entire cluster/database ) in the same report. Its findings include analyzing the impact of particular instances on another instance, .e. remore dependencies.
  • A full spec of what ADDm does for RAC covers what we discussed in the previous sections. Cotention and congestion are found and diagnosed, as well as network problems. It is a productivity infrastructure for performance diagnostics which we recommend to exploit in any RAC systems for the benefit of the users and the analysts.
  • Oow2007 performance

    1. 2. Practical Performance Management for Oracle RAC Barb Lundhild RAC Product Management Michael Zoll RAC Development, Performance
    2. 3. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
    3. 4. Agenda <ul><li>Oracle RAC Fundamentals and Infrastructure </li></ul><ul><li>Common Problems and Symptoms </li></ul><ul><li>Application and Database Design </li></ul><ul><li>Diagnostics and Problem Determination </li></ul><ul><li>Summary: Practical Performance Analysis </li></ul><ul><li>Appendix </li></ul><Insert Picture Here>
    4. 5. OBJECTIVE <ul><li>Realize that Oracle RAC performance does not requires “Black Magic” </li></ul><ul><li>General system and SQL analysis and tuning experience is practically sufficient for Oracle RAC </li></ul><ul><li>Problems can be identified with a minimum of metrics and effort </li></ul><ul><li>Diagnostics framework and advisories are efficient </li></ul>
    5. 6. <Insert Picture Here> RAC Fundamentals and Infrastructure
    6. 7. Oracle RAC Architecture Service public network Node1 Operating System Oracle Clusterware instance 1 ASM VIP1 Listener Node 2 Operating System Oracle Clusterware instance 2 ASM VIP2 Listener Service Node n Operating System Oracle Clusterware instance n ASM VIPn Listener Service /…/ Redo / Archive logs all instances shared storage Database / Control files OCR and Voting Disks Managed by ASM RAW Devices
    7. 8. Oracle Clusterware Node1 public network EVMD CRSD OPROCD ONS VIP1 CSSD Node 2 EVMD CRSD OPROCD ONS VIP2 CSSD Node n EVMD CRSD OPROCD ONS VIPn CSSD /…/ shared storage CSSD Runs in Real Time Priority OCR and Voting Disks RAW Devices
    8. 9. Under the Covers Node n Node 2 Data Files and Control Files Dictionary Cache VKTM LGWR DBW0 SMON PMON Library Cache Global Resoruce Directory LMS0 Instance 2 SGA Instance n Cluster Private High Speed Network LMON LMD0 DIAG Dictionary Cache VKTM LGWR DBW0 SMON PMON Library Cache Global Resoruce Directory LMS0 LMON LMD0 DIAG Dictionary Cache VKTM LGWR DBW0 SMON PMON Library Cache Global Resoruce Directory LMS0 LMON LMD0 DIAG Instance 1 Node 1 SGA SGA Runs in Real Time Priority Redo Log Files Redo Log Files Redo Log Files Log buffer Buffer Cache Log buffer Buffer Cache Log buffer Buffer Cache
    9. 10. Global Cache Service (GCS) <ul><li>Manages coherent access to data in buffer caches of all instances in the cluster </li></ul><ul><li>Minimizes access time to data which is not in local cache </li></ul><ul><ul><li>access to data in global cache faster than disk access </li></ul></ul><ul><li>Implements fast direct memory access over high-speed interconnects </li></ul><ul><ul><li>for all data blocks and types </li></ul></ul><ul><li>Uses an efficient and scalable messaging protocol </li></ul><ul><ul><li>Never more than 3 hops </li></ul></ul><ul><li>New optimizations for read-mostly applications </li></ul>
    10. 11. Cache Hierarchy: Data in Remote Cache Local Cache Miss Datablock Requested Datablock Returned Remote Cache Hit
    11. 12. Cache Hierarchy: Data On Disk Local Cache Miss Datablock Requested Grant Returned Remote Cache Miss Disk Read
    12. 13. Cache Hierarchy: Read Mostly Local Cache Miss No Message required Disk Read
    13. 14. Performance of Cache Fusion Message:~200 bytes Block: e.g. 8K LMS Initiate send and wait Receive Process block Send Receive 200 bytes/(1 Gb/sec ) 8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( “wire time” ) is a minor factor for roundtrip time ( approx.: 6% , vs. 52% in OS and network stack )
    14. 15. Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB (*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high ) 0.20 0.16 0.13 0.12 RDS/IB 0.46 0.36 0.31 0.30 UDP/GE 16K 8K 4K 2K Block size RT (ms)
    15. 16. Infrastructure: Private Interconnect <ul><li>Network between the nodes of a RAC cluster MUST be private </li></ul><ul><ul><li>Best practice is not to share IE with ISCSI storage </li></ul></ul><ul><li>Supported links: GbE, IB ( IPoIB: 10.2 ) </li></ul><ul><li>Supported transport protocols: UDP, RDS ( </li></ul><ul><li>Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding </li></ul><ul><li>Large ( Jumbo ) Frames for GbE recommended </li></ul>
    16. 17. Infrastructure: Interconnect Bandwidth <ul><li>Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application </li></ul><ul><li>Typical utilization approx. 10-30% i n OLTP </li></ul><ul><ul><li>10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-80% of theoretical bandwidth ) </li></ul></ul><ul><li>Generally, 1Gb/sec sufficient for performance and scalability in OLTP. </li></ul><ul><li>DSS/DW systems should be designed with > 1Gb/sec capacity </li></ul><ul><li>A sizing approach with rules of thumb is described in </li></ul><ul><ul><li>Project MegaGrid: Capacity Planning for Large Commodity Clusters ( http://otn.oracle.com/rac ) </li></ul></ul>
    17. 18. Infrastructure: IPC configuration <ul><li>Important Settings: </li></ul><ul><ul><li>Negotiated top bit rate and full duplex mode </li></ul></ul><ul><ul><li>NIC ring buffers </li></ul></ul><ul><ul><li>Ethernet flow control settings </li></ul></ul><ul><ul><li>CPU(s) receiving network interrupts </li></ul></ul><ul><li>Verify your setup: </li></ul><ul><ul><li>CVU does checking </li></ul></ul><ul><ul><li>Load testing eliminates potential for problems </li></ul></ul><ul><ul><li>AWR and ADDM give estimations of link utilization </li></ul></ul><ul><li>Buffer overflows, congested links and flow control can have severe consequences for performance </li></ul>
    18. 19. Infrastructure: Operating System <ul><li>Block access latencies increase when CPU(s) busy and run queues are long </li></ul><ul><ul><li>Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy </li></ul></ul><ul><li>Fewer and busier LMS processes may be more efficient. </li></ul><ul><ul><li>monitor their CPU utilization </li></ul></ul><ul><ul><li>Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time </li></ul></ul><ul><ul><li>the default is good for most requirements </li></ul></ul><ul><li>Higher priority for LMS is default </li></ul><ul><ul><li>The implementation is platform-specific </li></ul></ul>
    19. 20. <Insert Picture Here> Common Problems and Symptoms
    20. 21. Common Problems and Symptoms <ul><li>“Lost Blocks”: Interconnect or Switch Problems </li></ul><ul><li>Slow or bottlenecked disks </li></ul><ul><li>System load and scheduling </li></ul><ul><li>Contention </li></ul><ul><li>Unexpectedly high latencies </li></ul><Insert Picture Here>
    21. 22. Miss-configured or Faulty Interconnect Can Cause: <ul><li>Dropped packets/fragments </li></ul><ul><li>Buffer overflows </li></ul><ul><li>Packet reassembly failures or timeouts </li></ul><ul><li>Ethernet Flow control kicks in </li></ul><ul><li>TX/RX errors </li></ul><ul><ul><li>“ lost blocks” at the RDBMS level, responsible for 64% of escalations </li></ul></ul>
    22. 23. “Lost Blocks”: NIC Receive Errors Db_block_size = 8K ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr: Bcast: Mask: UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0 …
    23. 24. “Lost Blocks”: IP Packet Reassembly Failures netstat –s Ip:    84884742 total packets received    … 1201 fragments dropped after timeout    …    3384 packet reassembles failed
    24. 25. Finding a Problem with the Interconnect or IPC Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ---------------------------------------------------------------------------------------------------- log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster gc cr block busy 110,348 5,703 52 4.8 Cluster gc cr block lost 4,272 4,953 1159 4.1 Cluster cr request retry 6,316 4,668 739 3.9 Other Should never be here
    25. 26. Global Cache Lost block handling <ul><li>Detection Time in 11g reduced </li></ul><ul><ul><li>500ms ( around 5 secs in 10g ) </li></ul></ul><ul><ul><li>can be lowered if necessary </li></ul></ul><ul><ul><li>robust ( no false positives ) </li></ul></ul><ul><ul><li>no extra overhead </li></ul></ul><ul><li>Cr request retry event related to lost blocks </li></ul><ul><ul><li>It is highly likely to see it when gc cr blocks lost show up </li></ul></ul>
    26. 27. Interconnect Statistics Automatic Workload Repository (AWR ) Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg --------------------------------------------------------------------- 1 .79 .65 1.04 1.06 2 .75 .57 . 95 .78 3 .55 .59 .53 .59 4 1.59 3.16 1.46 1.82 --------------------------------------------------------------------- Latency probes for different message sizes Exact throughput measurements ( not shown) Send and receive errors, dropped packets ( not shown )
    27. 28. “Blocks Lost”: Solution <ul><li>Fix interconnect NICs and switches </li></ul><ul><li>Tune IPC buffer sizes </li></ul>
    28. 29. Disk IO Performance Issues <ul><li>Log flush IO delays can cause “busy” buffers </li></ul><ul><li>“Bad” queries on one node can saturate an interconnect link </li></ul><ul><li>IO is issued from ALL nodes to shared storage </li></ul><ul><li>Use Automatic Database Diagnostic Monitor (ADDM) /AWR </li></ul><ul><ul><li>single system image of I/O across cluster </li></ul></ul><ul><ul><li>Cluster-wide impact of IO or query plan issues responsible for 23% of escalations </li></ul></ul>
    29. 30. Cluster-Wide I/O Impact Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------ log file sync 286,038 49,872 174 41.7 gc buffer busy 177,315 29,021 164 24.3 gc cr block busy 110,348 5,703 52 4.8 `` Load Profile ~~~~~~~~~~~~ Per Second --------------- Redo size: 40,982.21 Logical reads: 81,652.41 Physical reads: 51,193.37 Node 2 Node 1 Expensive Query in Node 2 1. IO on disk group containing redo logs is bottlenecked 2. Block shipping for “hot” blocks is delayed by log flush IO 3. Serialization/Queues build up
    30. 31. IO and/or Bad SQL problem fixed Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class --------------------------- --------- ----------- ---- ------ ---------- CPU time 4,580 65.4 log file sync 276,281 1,501 5 21.4 Commit log file parallel write 298,045 923 3 13.2 System I/O gc current block 3-way 605,628 631 1 9.0 Cluster gc cr block 3-way 514,218 533 1 7.6 Cluster 1. Log file writes are normal 2. Global serialization has disappeared
    31. 32. Drill-down: An IO capacity problem Symptom of Full Table Scans I/O contention Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- -------- ------- ---- ---- ---------- db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O
    32. 33. IO issues: Solution <ul><li>Tune IO layout </li></ul><ul><li>Tune queries with a lot of IO </li></ul>
    33. 34. CPU Saturation or Long Run Queues Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ----------------- --------- ------ ---- ----- ---------- db file sequential 1,312,840 21,590 16 21.8 User I/O read gc current block 275,004 21,054 77 21.3 Cluster congested gc cr grant congested 177,044 13,495 76 13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way gc cr block congested 85,975 8,917 104 9.0 Cluster “ Congested” : LMS could not dequeue messages fast enough Cause : Long run queue, CPU starvation
    34. 35. High CPU Load: Solution <ul><li>Run LMS at higher priority (default) </li></ul><ul><li>Start more LMS processes </li></ul><ul><li>Reduce the number of user processes </li></ul><ul><li>Find cause of high CPU consumption </li></ul>
    35. 36. Contention Event Waits Time (s) AVG (ms) % Call Time ---------------------- --------- -------- -------- -------- gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7 gc cr block busy 40,688 1,670 41 5.5 ------------------------------------------------------- Global Contention on Data Serialization Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related
    36. 37. Contention: Solution <ul><li>Identify “hot” blocks in application </li></ul><ul><li>Reduce concurrency on hot blocks </li></ul>
    37. 38. High Latencies Event Waits Time (s) AVG (ms) % Call Time ---------------------- ---------- ---------- --------- -------- gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7 gc cr block busy 40,688 1,670 41 5.5 ------------------------------------------------------- Tackle latency first, then tackle busy events Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms)
    38. 39. High Latencies : Solution <ul><li>Check network configuration </li></ul><ul><ul><li>Private </li></ul></ul><ul><ul><li>Running at expected bit rate </li></ul></ul><ul><li>Find cause of high CPU consumption </li></ul><ul><ul><li>Runaway or spinning processes </li></ul></ul>
    39. 40. Health Check <ul><li>Look for: </li></ul><ul><li>Unexpected Events </li></ul><ul><ul><ul><li>gc cr block lost 1159 ms </li></ul></ul></ul><ul><li>Unexpected “Hints” </li></ul><ul><ul><li>Contention and Serialization </li></ul></ul><ul><li>gc cr/current block busy 52 ms </li></ul><ul><ul><li>Load and Scheduling </li></ul></ul><ul><li>gc current block congested 14 ms </li></ul><ul><li>Unexpected high avg </li></ul><ul><ul><ul><li>gc cr/current block 2-way 36 ms </li></ul></ul></ul>
    40. 41. <Insert Picture Here> Application and Database Design
    41. 42. General Principles <ul><li>No fundamentally different design and coding practices for RAC </li></ul><ul><li>BUT: flaws in execution or design have higher impact in RAC </li></ul><ul><ul><li>Performance and scalability in RAC will be more sensitive to bad plans or bad schema design </li></ul></ul><ul><ul><li>Serializing contention makes applications less scalable </li></ul></ul><ul><li>Standard SQL and schema tuning solves > 80% of performance problems </li></ul>
    42. 43. Scalability Pitfalls <ul><li>Serializing contention on a small set of data/index blocks </li></ul><ul><ul><li>monotonically increasing key </li></ul></ul><ul><ul><li>frequent updates of small cached tables </li></ul></ul><ul><ul><li>segment without automatic segment space managmenent (ASSM) or Free List Group (FLG) </li></ul></ul><ul><li>Full table scans </li></ul><ul><ul><li>Optimization for full scans in 11g can save CPU and latency </li></ul></ul><ul><li>Frequent invalidation and parsing of cursors </li></ul><ul><ul><li>Requires data dictionary lookups and synchronizations </li></ul></ul><ul><li>Concurrent DDL ( e.g. truncate/drop ) </li></ul>
    43. 44. Health Check <ul><li>Look for: </li></ul><ul><li>Indexes with right-growing characteristics </li></ul><ul><ul><li>Eliminate indexes which are not needed </li></ul></ul><ul><li>Frequent updated and reads of “small” tables </li></ul><ul><ul><li>“small”=fits into a single buffer cache </li></ul></ul><ul><ul><li>Sparse blocks ( PCTFREE 99 ) will reduce serialization </li></ul></ul><ul><li>SQL which scans large amount of data </li></ul><ul><ul><li>Perhaps more efficient when parallelized </li></ul></ul><ul><ul><li>Direct reads do not need to be globally synchronized ( hence less CPU for global cache ) </li></ul></ul>
    44. 45. <Insert Picture Here> Diagnostics and Problem Determination <ul><ul><li>MOST OF THE TIME, A PERFORMANCE PROBLEM IS NOT A Oracle RAC PROBLEM </li></ul></ul>
    45. 46. Checklist for the Skeptical Performance Analyst ( AWR based ) <ul><li>Check where most of the time in the database is spend (“Top 5” ) </li></ul><ul><li>Check whether gc events are “busy”, “congested” </li></ul><ul><li>Check the avg wait time </li></ul><ul><li>Drill down </li></ul><ul><ul><li>SQL with highest cluster wait time </li></ul></ul><ul><ul><li>Segment Statistics with highest block transfers </li></ul></ul><ul><ul><li>or JUST USE ADDM with Oracle RAC 11g! </li></ul></ul>
    46. 47. Drill-down: An IO capacity problem Symptom of Full Table Scans I/O contention Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- -------- ------- ---- ---- ---------- db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O
    47. 48. Drill-down: SQL Statements “ Culprit”: Query that overwhelms IO subsystem on one node Physical Reads Executions per Exec %Total -------------- ----------- ------------- ------ 182,977,469 1,055 173,438.4 99.3 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC The same query reads from the interconnect: Cluster CWT % of CPU Wait Time (s) Elapsd Tim Time(s) Executions ------------- ---------- ----------- -------------- 341,080.54 31.2 17,495.38 1,055 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC
    48. 49. Drill-Down: Top Segments GC Tablespace Subobject Obj Buffer % of Name Object Name Name Type Busy Capture -------- ------------- -------- ------ ------- ------ ESSMLTBL ES_SHELL SYS_P537 TABLE 311,966 9.91 ESSMLTBL ES_SHELL SYS_P538 TABLE 277,035 8.80 ESSMLTBL ES_SHELL SYS_P527 TABLE 239,294 7.60 … Apart from being the table with the highest IO demand it was the table with the highest number of block transfers AND global serialization
    49. 50. Findings Summary in EM <ul><li>Each finding type has a descriptive name </li></ul><ul><ul><li>Facilitates search / aggregation / directives etc. </li></ul></ul>
    50. 51. Recommendations <ul><li>Most relevant data for analysis can be derived from the wait events </li></ul><ul><li>Always use Enterprise Manager (EM) and ADDM reports for performance health checks and analysis </li></ul><ul><li>Activity Session History (ASH) can be used for session-based analysis of variation </li></ul><ul><li>Export AWR repository regularly to save all of the above </li></ul>
    51. 52. ADDM Diagnosis for RAC <ul><li>Data sources are: </li></ul><ul><ul><li>Wait events (especially Cluster class and buffer busy) </li></ul></ul><ul><ul><li>ASH </li></ul></ul><ul><ul><li>Instance cache transfer data </li></ul></ul><ul><ul><li>Interconnect statistics (throughput, usage by component, pings) </li></ul></ul><ul><li>ADDM analyzes for both the entire database (DATABASE analysis mode) and for each instance (INSTANCE analysis mode). </li></ul><ul><li>Analysis of both database and instance resources summarized in a single report </li></ul><ul><li>Allows drill down to specific instance. </li></ul>
    52. 53. What ADDM Diagnoses for RAC <ul><li>Latency problems in interconnect </li></ul><ul><li>Congestion (identifying top instances affecting the entire cluster) </li></ul><ul><li>Contention (buffer busy, top objects etc.) </li></ul><ul><li>Top consumers of multiblock requests </li></ul><ul><li>Lost blocks </li></ul><ul><li>Reports information about interconnect devices. Warns about using PUBLIC interfaces. </li></ul><ul><li>Reports throughput of devices, and how much of it is used by Oracle and for what purpose (GC, locks, PQ) </li></ul>
    53. 54. A Q & Q U E S T I O N S A N S W E R S
    54. 55. OTHER SESSIONS to CHECKOUT S291670 Oracle Database 11g:  First Experiences with Grid Computing (with Mobiltel and BCF) South 310 4:00 PM S291662 Using Oracle RAC and Microsoft Windows 64-bit as the Foundation (with Intel and Talx) South 309 1:00 PM S291242 Demystifying Oracle RAC Internals South 104 10:00 AM Title THURSDAY TIME
    55. 56. For More Information http://search.oracle.com or otn.oracle.com/rac REAL APPLICATION CLUSTERS