This presentation will discus sources and interpretations of DB2 CPU and response metrics. The intended audience is traditional large system capacity planners and performance analysts.
While the target audience is traditional performance and capacity analysts, some specific DB2 terminology must be used. While attempts will be made to provide high level definitions where appropriate, it may sometimes be desirable to reference more detailed explanations. There is a terminology section near the end. Other good definition sources are the glossary in the DB2 Installation Guide starting with DB2 V8 and DB2 Administration Guide for earlier releases, and your DB2 specialists. The last several pages also include a list of references and where to find DB2 record layouts.
MSTR, DBM1, and IRLM will always be present. They are the primary DB2 system address spaces. DIST will be present if there is server work (from another DB2 or remote system). Requests are not necessarily from another MVS and could be received through TCP/IP. SPAS(s) will probably be present if there is significant server work. They essentially contain preloaded stored procedures to provide good performance for high transaction rate environments. RRSAF is the DB2 Recoverable Resource Manager Services attachment facility which uses MVS Transaction Management and Recoverable Resource Manager Service. It is used by SAP. In many environments most DB2 work is received from batch, TSO, CICS, and IMS address spaces. The connection to DB2 is referred to as a thread. In some cases there might be multiple threads (and thus TCBs) for performance (e.g. a CICS region).
Enclaves and preemptible SRBs are examples of the joint evolution of DB2 and MVS to provide performance, especially for large server workloads. Enclaves are effectively virtual address spaces. They are much faster and more efficient to establish when needed, but still provide a place for a transaction to execute and be managed by WLM. A preemptible SRB has many of the desirable characteristics of a TCB (a WLM manageable and interruptible place to execute work) but is much easier and faster to establish. For most purposes there are still logically two kinds of CPU time – preemptible and non-preemptible. They are effectively what were previously referred to as TCB and SRB. Preemptible includes “traditional” TCB and newer manageable types (preemptible SRB). Non-preemptible is what was always referred to as SRB and is still reported separately. For detailed explanations some of the listed references are excellent sources, especially the Enrico and Arwe articles.
The primary focus will be on the DB2 SMF 100 and 101 records, though there will be some specific references to other sources.
Since server work is received directly from a remote system there is not a normal local requestor (e.g. CICS). The DIST AS is effectively the requestor. There is separate 72 for DDF each enclave service class, which is where most server CPU time is reported. While DB2 system ASs (other than possibly DIST) do not normally represent a significant amount of CPU time, there are specific DB2 functions associated with reported TCB and SRB times. An easy way to track then is with RMF. If you observe any significant change, consultation with a DB2 specialist is appropriate. He would know what functions are involved and if the change is reasonable. While RMF reporting will include DB2 application CPU time with other workloads, there is no easy way to break it out from other work.
Usefulness of SMF 30 varies significantly with desire and environment. For batch and TSO you might be close to the DB2 view of a “transaction”. The desire might also be how much work someone was doing rather than how much was DB2. Obviously not very good for IMS and CICS work if the desire is a transaction view or to separate DB2 function. System functions are easy to measure given the combination of interval records and separate fields for TCB and SRB time. Note that DIST includes system support for DDF and server application time. Typically most DIST reported time will be application. All of the enclave time (ENC & DET) is application, though some SRB also is. Notice that ASR, ENC, and DET are included with CPT (TCB) because they are effectively like TCB time.
SMF 100 provides a variety of useful system level data, most of which is beyond the scope of this discussion. All the activity metrics are measurements since DB2 monitoring started, so it is necessary to calculate deltas. Values can potentially wrap (overflow and start over). Some fields are the level of the measured thing at the time of reporting (e.g, number of active buffers). Prior to DB2 V7 there was no reasonably easy way to synchronize SMF 100 with other SMF records. The interval length also tended to vary but the average length was close to the nominal DB2 specification. Starting with V7, a logical approach is to set STATISTICS SYNC to 59 with the DB2 DSNTIPN installation panel. Server TCB time other than stored procedure is included with DIST SRB (probably because it really is SRB time) and SMF 101 (discussed later). DB2 system CPU times are recorded in QWSA segments. There is a segment for each system AS with a logical identifier (QWSAPROC is typically MSTR, DBM1, IRLM, or DIST).
While most interactive work results in an SMF 101 record for what one would logically consider a transaction, there are exceptions. In some cases records with common identifiers are optionally rolled up to reduce volume (parallel children, DDF, and RRSAF). A DB2 TSO transaction can include many end user interactions because it measures from start to end of connection, which could be a relatively long time. SAP, which uses RRSAF (SAP R/3 uses DDF starting in DB2 V8), looks like a very long running transaction unless commit boundary reporting is used. Given the wealth of identifiers and metrics, this is the best source for detail accounting and analysis. There are two basic types of metrics. Application level are essentially from start of connection until termination. It might include significant elapsed time outside of DB2. In-DB2 measurements are essentially time within DB2. For most analysis, In-DB2 times are the best ones to use. Identifiers available include who, what, where, and type of transaction. Meaningfulness of specific identification fields varies with environment.
The basic thing identified as the execution unit in an accounting record is the plan (QWHCPLAN), which may or may not be very meaningful. Some installations use blanket plan names as the initial thing invoked. Packages are often invoked by the selected plan. A package is essentially a compiled DB2 program. One or more could be used by a specific plan execution. Assuming package level accounting is active, there will be a QPAC segment for each unique package executed. They might be included in the standard IFCID 0003 SMF 101 record (max of ten segments, not rollup, and prior to V8) or in one or more separate IFCID 0239 SMF 101 records. If in a separate record you will not know if rollup might have been involved unless you associate it with the IFCID 0003. SMF time stamps will probably be equal (certainly very close) and correlation headers (QWHC) will be identical. In the case of rollup, you cannot be sure that all transactions necessarily used all packages. Metrics are a subset of thread level ones but there is a good set of In-DB2 elapsed, CPU, and delay times. Obviously since not all transactions use packages and a given transaction can use multiple packages, metrics are not likely to easily cross foot with anything. Stored procedures must be packages but not all packages are necessarily loaded as stored procedures. There is a flag (QPACAAFG) indication if a package was loaded as a stored procedure.
Sometimes requestor measurements are the best sources. With recent code levels, CICS 110 (and other monitors) do a good job of reporting DB2 CPU and response times. There is a further benefit of being closer to real transaction performance if a CICS transaction invokes multiple DB2 requests. L8CPUT is not necessarily just DB2 CPU time. Thread creation and termination time, which is not reported by DB2, is included. If the application is thread safe, related CPU time will be included. In general if it is a CICS DB2 environment, you can treat L8CPUT as DB2 CPU time for the transaction.
For locally (e.g. batch, CICS, IMS) generated requests, application CPU time is charged to the requestor. For workload level reporting, RMF is not very useful. SMF 30 is probably only effective for batch and TSO. SMF 101 is the best source for workload reporting because of the combination of granularity and available metrics.
This is an example of system address space measurements for a basic environment without any DDF. Note that most CPU time is DBM1 SRB, which is primarily from asynchronous database I/O. These values are from SMF 100, but could easily have been RMF or SMF 30.
To get total Class 1, all values must be added when present.
To get total Class 2, all values must be added when present. In-DB2 times are subsets of corresponding totals (Class 1). Note that for triggers (QWACTRTT and QWACTRTE) there is no separate In-DB2 value. Values are reported when Class 1 recording is active. Package times are a subset of In-DB2 values. Note: Recording setup is optional so In-DB2 and package data might not be available even though application metrics are present.
This example adds total application time to the earlier chart. Application data source is SMF 101. Other data sources would only be useful in special cases (e.g. a pure server environment). Notice that most of the CPU time is application.
This example breaks down application CPU time, from the previous chart, by type of request. Type is one of the correlation header identifiers.
There are numerous potential sources for transaction counts. Choices will vary with analysis objective and workload. SMF 101 is typically the most flexible because of all the identifiers, though RMF service and report classes and requestor data might be more attractive. If RMF provides desired granularity, it is an attractive easy approach. Requestor measurements have the potential attraction of being closer to logical application counts. For parallel there is the interesting philosophical choice of what to count. For transaction count (and response time) the root is probably all that matters. For rollup records, QWACPCNT is the aggregate transaction count. With most interactive work, commit count tends to be close to transaction count.
As with transaction counts, there are several source choices for response times, with similar considerations. When working with DB2 101 records there is the further consideration of choosing either application of In-DB2 measurements. Application times might be attractive for some workloads if more complete measurements are desired.
Application time is a superset of In-DB2. The two values are simply start and end time stamps unless rollup is involved. Total In-DB2 is the sum of all 6 values. Package is a subset of In-DB2.
While many of the details are beyond the scope of this discussion, it is worth knowing that numerous time metrics are available that allow for detail response time analysis. We have previously discussed the various CPU components. Given that response targets are not being met, an analysis of component times can direct tuning and upgrade efforts. Note that delay fields appropriate accounting classes to be active. Package (QPAC) data in general is not so rich as thread level, but there is a good set of response components available. Some activities are potentially asynchronous, examples include prefetch (attempt to load records into a buffer pool prior to being needed) and parallel processing. In general the exception and I/O values reported in SMF 101 are delay times rather than total elapsed time of the function.
Teamwork is essential. While DB2 (and other subsystem) specialists might have some different objectives and requirements from the typical MVS analyst, there is valuable knowledge and information to share. Track key system and workload metrics. When something looks strange or changes significantly, confer with the appropriate subsystem specialist. For long term analysis (e.g. month) average hours for period profiles or planning intervals for history are probably appropriate. Shorter intervals (e.g. 30 min) are better for daily glitch analysis. Note that the shortest interval is limited by the SMF 100 reporting intervals where DB2 might be different from SMF defaults. Keep history and incorporate it with management reporting. Be careful of potential double counting or missed data. Server workloads are especially problematic. Mixing data sources can also confuse the issue.
RMF is probably the easiest source for system components, though SMF 100 (DB2 System) and SMF 30 are also usable. Remember that except for server work, application CPU is not included. Also server recording in different across the three sources. (DASD paging is typically not an issue, but RMF is the beat way to watch it. If an issue, check buffer pools for potential over allocation.) If flexible workload granularity is desired, SMF 101 is good choice for almost all workloads. It is the primary source for response components. For some workloads (e.g. CICS) requestor measurements might be a desirable choice.
Typically the primarily measures desired are CPU, response, and transaction counts. They have been available for many years. Since DB2 and MVS are both evolving, new measurements are periodically added. This tends to have the biggest impact on CPU and response profiles for exploiting workloads. Some environments (e.g. CICS) might have minimal value from use of CPU fields added since DB2 V2. In general the analysis discussed does not require detail DB2 skills. When something strange happens or a significant change is observer, consult a DB2 specialist. There are many good references, some of which are listed at the end. There are numerous tools available that can aid with analysis and reporting
Note that SMF records are not documented in the SMF manual. The combination of the layouts and DSNWMSGS provide good detail. Another useful field description source is IBM Tivoli OMEGAMON XE for DB2 Performance Expert on z/OS Report Reference. GOGGLE can also be a good source of descriptive information.
Even though several sources are not especially new, they can still be very useful.
Evolving DB2 CPU and Response Metrics Ned Diehl The Information Systems Manager, Inc. [email_address] www.perfman.com 610-865-0300 Philadelphia CMG 17 November 2007
SMF 101 not necessarily one-to-one with user request
Multiple SMF 101 for parallel
Optional rollup records for children
RRSAF (e.g. WebSphere) might use commit level reporting
DDF & RRSAF might be in rollup records
QWACPCNT is aggregate count for rollup records
QWACRINV = 1, 2, or 3 for DDF & RRSAF
Commit count is good for most interactive work
Choose based on requirement
Coordinate with other reporting
Key Performance Metrics Transaction Rates N/A ROLL-UP COUNT SET Simple Rollup PARENT QWHSACE CHILD COUNT SET Parallel Rollup PARENT QWHSACE ZERO OFF Parallel Child ZERO CHILD COUNT OFF Parallel Parent ZERO ZERO OFF Normal QWACPACE QWACPCNT QWACPARR Record Type