Oracle database performance diagnostics - before your begin


Published on

This is an article that I had written in 2011 for publication on OTN. It never did appear. So I am making it available here. It is not "slides" but is only 7 pages long. I hope you find it useful.

Published in: Technology, News & Politics
1 Comment
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Oracle database performance diagnostics - before your begin

  1. 1. Oracle Database Performance Diagnostics Oracle Database “Performance diagnostics” : Before you begin Hemant K Chitale Introduction What do you, as the DBA / Developer / System Administrator / Analyst / Performance Analyst / Application Manager, do when you get calls like: 1. The “system” is slow 2. The batch job is “hanging” 3. Users cannot login Are these Database Performance issues? Always? Where do you begin diagnostics? Do you jump into trace files, StatsPack / AWR, OS statistics etc ? This article is a primer on what you should be aware of *before* you begin looking at Oracle Trace Files, Explain Plans, Statistics and what-not. The diagnostic process must be able to help the Oracle Database Performance Analyst identify : a. Whether there really is an “issue” b. How well the issue is defined, if necessary redefine it c. Where the cause arises d. What can be done to address the cause Note : This article is NOT about how to use Oracle and OS methods to diagnose a performance issue and/or tune an SQL/Application/Schema/Database. It is about what you should be aware of before you begin. Environmental Factors Let’s begin with some basic factors: 1. Response Time “Response Time” is what users (and application servers!) see. They do not see ‘consistent gets’ or ‘redo size’ or ‘enq: TX - row lock contention”. User perception of a system’s usability is significantly impacted by Response Time. “fit for use” (the application is usable) must co-exist with “fit for purpose” (the application does what it is supposed to do). On the other hand, Response Time for a batch job can vary from execution time for a (significant) single SQL call to the elapsed time for a key stage in the job. © Hemant K Chitale 2011.
  2. 2. Oracle Database Performance Diagnostics 2. Tiers There are very many tiers through which a response reaches a user (or an application server, depending on who/what has “response issues”). From the desktop, via a browser, over the internet/intranet to an application server, rewritten as an SQL call to the database, parsed and executed by the database, CPU and I/O cycles consumed to fetch, filter and compute values, round-trips between the application server and database server, formatting on the application server, latency down to the user’s desktop; there are very many tiers that are comprised in an application’s performance. Such tiers also exist in a batch job – often ignored are the round-trips between the application server and database server. 3. Capacity Each “component” (be it the User’s Desktop or the WAN Link or the App Server CPU or the App Server RAM etc … down through the Tiers) has a defined Capacity – theoretical and practical. Within a database instance, also, there are capacity parameters – e.g. SGA sizing parameters, the processes parameter etc. 4. Usage Usage of the available capacity of any component varies from time to time. Any tool that “measures” usage has to collect a snapshot of usage at a certain point in time. Multiple snapshots must be analyzed together. 5. Throughput Throughput is the volume of “load” (Transactions/Queries/Rows/Users – each is a different facet of “load”) that is being serviced by the “system”. 6. Constraints Capacity is a constraint. Concurrency is a constraint as well. Two users/processes/sessions may not be permitted to modify the same row/resource at the same time. 7. Serialisation Because Capacity is not Unlimited and because there are Constraints (automatic/system/artificial/user-defined), there may well be some points in application code or database code or the operating system where serialisation occurs. 8. Requirements Volume requirements, usability requirements and control requirements are defined by users / analysts and must be built into the “system”. Requirements also add to code complexity. 9. Scalability Scalability of the system is it’s ability to handle additional workload without more than a proportional increase in component resources (CPU, RAM, I/O) usage. Scalability is © Hemant K Chitale 2011.
  3. 3. Oracle Database Performance Diagnostics adversely impacted by points of contention or serialisation in the requirements / design / code. 10. Non-Linearity Many systems are non-linear. If a query that processes ten thousand rows that are always in memory and never overflows to disk for Group/Sort operations takes 1second to run, it doesn’t necessarily follow that a hundred thousand rows would take 10seconds. The hundred thousand row query may require multiple disk reads because not all rows are cached in memory and, furthermore, the Group/Sort operations also overflow to disk. 11. Shared Resources A database server may be configured to host multiple databases. The CPU and I/O load of one or more “other” databases may well be “interference” in the performance of a database under review. The “cost” of such “interference” must be computed and accounted for. Similarly, within a database, Batch reports may interfere with online queries. Also, when multiple schemas (e.g. for different “applications”) are provided for within a database, they share and contend for shared pool, library cache and buffer cache resources as well as for CPU and I/O. These basic Factors apply to any System. They apply to Airports and Aeroplanes. They apply to Factories and Refineries. They apply to Hotels and Restaurants. They apply to Applications using Oracle Databases. As an Oracle Database Performance Analyst (a DBA or a Developer or a System Administrator), it is necessary to be aware of these Factors. Definition of “Issue” The definition is the first step in the process. First start with identify what the command/process/job is that is under contention. Is it a daily task? How many components (see Factor 2 “Tiers”) does it involve? Do you / the team need to evaluate the capacity, usage and throughput of each of the Tiers? Can a specific Tier be identified as a constraint? Typically for a performance issue, the best reference is “Time Taken”. How long does the particular command/process/job take to run? How long did it take to run on previous occasions? Was there any variance in run times on previous occasions? Can a test system / test run be executed? Can the test be traced (end to end, from the user to system level waits and back to the user)? Can the production run be traced? Can © Hemant K Chitale 2011.
  4. 4. Oracle Database Performance Diagnostics both traces be compared? Remember: The test may not have the same level of components, capacity, throughput, usage and may have a different set of constraints. Also important to understand when analysing the performance of a particular system/job/process/function is to be able to differentiate between “short, sharp” queries and sessions and “medium to long running” batch jobs and reports. A system may have a mix of such operations. Some of these questions may not need to be formally asked. The answers may be well known or documented (e.g. the components and capacity). Others may need to be discovered (e.g. previous response times, usage). Throughput and constraints may get identified only during the diagnostics phase (unless some of them are “well known” and documented). A good definition of an issue might be: Program “A” that takes 15minutes to run at (approximately) the same time every day (on the same server), for the same volume of data, is now (since the last 2 days) taking 45minutes, although no change to program code or parameters has been made. Another good definition might be: Users are usually able to view the details on screen within 5seconds of submitting the query and navigate through all the screens in 15seconds and commit in 2seconds but the same query and same data is now taking 25seconds, 30seconds and 5seconds respectively, under the same user workload. Another example might be: We have exactly doubled the incoming data volume for the ETL job but processing time is now 5x with no other changes to the system. Collection of Data Use the “Questionnaire for Issue Identification” in the Appendix. Remember, not all questions need to be formally raised. Some information may be available from documentation. Some recursion may be necessary – questions or answers that were deemed “insignificant” during the first round of diagnostics may have to be revisited and reviewed. (e.g. early discussion may have considered that the network was always stable but testing or trace files may indicate that network round trips are significant so that network component (“Tier”) may have to be revisited). © Hemant K Chitale 2011.
  5. 5. Oracle Database Performance Diagnostics Some of the data collection may take time -- .e.g. running a trace and analyzing the trace file. You may need to prioritize which data is to be collected early while other collection can run “in the background”. Time Data should always be the first priority. Time Data Data about “Time taken to process/run the query/request/job/batch” should be in terms of Seconds or Minutes (where the time exceeds an hour). Data about “Time for on-screen query” should be in terms of seconds. Data about “SQL Execution time” should be in terms of Milli-seconds, Seconds or Minutes. Time data for previous runs (including min/avg/max) and test runs should also be collected. When collecting data about different executions, ensure that the executions are comparable – e.g. at the same time of day, for the same volume. Time Series Data Time Series Data (as different from “Time Data”) is about plotting performance information and statistics over time and validating if a trend exists. If such a trend exists, it must be considered as a factor when evaluating and projecting load and performance. Such Time Series Data covers not only performance and response times but also volume and workload, concurrency and throughput. Components (Tiers) data Data about the Tiers involved should include : a. Hardware Size (number of CPUs/Cores, CPU Speeds, RAM, HBAs, Network Interfaces) b. Operating System and FileSystem types c. OS performance counters – sar, vmstat, iostat, top, topas d. Latency (min/avg/max) Volume / Workload data Data about Volume and Workload should include: a. Number of concurrent, active users b. Number and sizes of rows being processes c. Number and sizes of batch jobs running concurrently Such workloads impact throughput and concurrency. © Hemant K Chitale 2011.
  6. 6. Oracle Database Performance Diagnostics Execution Plans, Statistics, Wait Events Details about SQL Execution Plans and Execution Statistics (e.g. “consistent gets”) and Wait Events are to be collected and analyzed when it is determined that performance within the database needs to be reviewed. Let me emphasise: This is only after you have determined that the database and, in particular, a specific portion of the application needs to be reviewed. Do not jump into this too soon. I put this last in the list of data to be collected. Interpreting the Data The Time data must be interpreted to identify patterns. For example, has the job been taking ever more increasing time as the weeks/months have progressed? Does the job take more time on certain days or at certain times? Is there a correlation between the Time and the Volume? Can a report that is to be run every 30minutes be allowed to take 10minutes to run? Should the report OR the schema OR the data loads be redesigned to have the report run in less than 1minute? Or should the frequency of the report be changed to run every 60minutes? Workload/Volume/Usage and Capacity/Throughput/Tiers data must be correlated. Does a 20% increase in Workload/Volume/Usage result in a 20% increase in CPU usage? Oracle Trace Files, Oracle Wait Statistics, Server Performance (sar, vmstat, ping latency) data must be reviewed to identify component resource utilisation. The key resources CPU, RAM and I/O are used to transfer data to the user. Therefore, it is necessary to correlate the usage of these resources to the volume of data. Does the query that fetches 100 rows without having to do any aggregation really need to do 1million buffer gets? Making Recommendations What changes (schema, code, architecture) you recommend will, to a not inconsiderable degree, depend on your prior experiences and “confidence” level in the tools and methods used. Remember that your proposed changes may interact with and impact other environmental factors! Identify which “environmental factors” are impinging on performance. recommendation should be able to address the factor. © Hemant K Chitale 2011. Your
  7. 7. Oracle Database Performance Diagnostics A cardinal rule of Performance is “never does anything that is not necessary”. For example, when you review a user requirement, you do ask the questions “Is this requirement necessary? Has it already been met by some portion of the design that the user is not aware of? Should the data be duplicated?” Similarly, when reviewing a system, configuration or code (or a diagnostic trace) asks the questions “Is this component necessary? Is it duplicated? Is the same task being done repeatedly (e.g. a lookup on the same rows or a validation being done twice)?” Managing Changes Once the root cause for an issue is identified, and recommendations made the steps of defining, creating, testing and migrating the change (or changes) required have to be careful managed. Some issues can be addressed by workarounds while others may require changes with long term impacts. However, workarounds, themselves, may have adverse consequences. A reasonable degree of confidence in the impact assessment is a requirement. Appendix Example Questionnaire for Issue Definition: What is the command/process/job is that is under contention? What is it called? Is it a daily task? How many components (see Factor 2 “Tiers”) does it involve? List each component. Do you / the team need to evaluate the capacity, usage and throughput of each of the Tiers? Can a specific Tier be identified as a constraint? How long does the particular command/process/job take to run? How long did it take to run on previous occasions? Was there any variance in run times on previous occasions? Can a test system / test run be executed? Can the test be traced (end to end, from the user to system level waits and back to the user)? Can the production run be traced? Can both traces be compared? © Hemant K Chitale 2011.