1. Workload Manager
Our Experiences Implementing IMS And CICS Transaction Goals
And
DB2 Stored Procedure Management
Len Jejer
Ray Smith
The Hartford Financial Services Group
This paper relates our experiences converting to Workload Manager
response time goals for our CICS and IMS online transaction
environments. Also included are experiences with implementing WLM-
managed DB2 Stored Procedures.
Introduction
This is the story of the conversion of our online
workloads to WLM transaction management. Out of
necessity, we first converted a CICS system that was
not making its SLA. After we achieved some success
with the CICS conversion, we moved on to converting
our IMS regions to transaction management. We were
later presented with an opportunity to use WLM to
manage our DB2 Stored Procedures.
WLM Level-Set
First, keep in mind that WLM’s entire philosophy is to
meet workload goals and to optimize resource
utilization. If consideration isn’t given to that
philosophy when developing goals and classifications,
it’s going to be difficult to obtain good results.
WLM uses samples to tell what’s going on. It will
sample each performance block once every quarter
second. There is one performance block for each
address space. When managing response time goals
for a CICS region, WLM builds a PB (Performance
Block) “table” with the number of entries equal to the
MAXTASK parameter for that region. WLM goes
through each PB whether it’s used or not in each
sampling cycle. To avoid unnecessary WLM
overhead, remember to review the MAXTASK’s for
each region.
The single most influential input WLM uses is service
class importance. This is how WLM prioritizes goal
achievement. Higher importance work tends to
receive resources; lower importance tends to donate
resources. Make sure you have enough donors. We
didn’t at first and that made it difficult to achieve our
project goals.
WLM can take cycles from higher importance work
that is exceeding the goal and re-distribute them to
lower importance or discretionary work; however, this
isn’t the norm. WLM does this by imposing internal
resource group capping. You may see “RG-Cap” for
delay reason, even though you have no capped
resource groups.
The three types of goals WLM supports are velocity,
average response time and percentile response time
goals. The easiest to understand and most precise
are percentile response time goals. They are less
likely to be influenced by out-lying transactions and do
not have to be revisited frequently due to workload or
environmental changes. Average response time goals
aren’t the best choice because they are easily
influenced by outliers.
When using either of the response time goals, make
sure that the goals you set are realistic. WLM will spin
its wheels trying to make goals when the goals cannot
possibly be made. If this work is high importance
work, all the lower importance levels will not get the
help they may need.
Velocity goals are difficult in concept and
implementation. “Not making the goal” can be
attributed to real delays or simply lack of samples in
the service class. Measured velocity can fluctuate
with changes in hardware as well, meaning that
velocity goals must constantly be re-evaluated. We
weren’t having good luck with velocity, even with a
goal of 90% velocity on the problem CICS regions. In
general, we take the approach of only using velocity
2. goals where we absolutely have to. Most batch
workloads and STC workloads are good candidates
for velocity goals because they tend to be long
running, in some cases for the life of the IPL.
Response time goals aren’t usually suited for these
types of background work.
The CICS Project
Faced with failing to meet the SLA (98% of all
application transactions complete in 2 seconds or
less) in two critical CICS applications and having
failed using WLM velocity goals, we decided to try
WLM transaction management to get the transaction
response time more in line with the customer
requirements.
In our SLA diagrams, the graphs represent two high
profile applications with service expectation that 98%
of transactions will complete in two seconds or less.
They are located in two separate CICS regions.
Figure 1 shows the end-to-end response time
percentiles before the WLM conversion began.
Figure 1
Online CICS Performance Prior to WLM
Conversion
Environment
The CICS systems were on an Amdahl 8-way CMOS
processor running 2 logical partitions and were
processing 1.5 million transactions a day. The
operating system was OS/390 2.10 and CICS was
Version 4. The processor was at 100% utilization
during month-end processing with considerable latent
demand. Among the month-end batch, there were 10
jobs taking 5 hours each of CPU time. They ran 5 of
these at a time, which significantly impeded other
work in the system. There were also DDF enclaves
that were not well behaved.
Resources and Tools
We started by lining up resources. We found out
IBMLINK is the richest place to go. Between the
IBMLINK database and the RMF/WLM ETR Q&A,
there was a wealth of information. There are also
Redbooks covering WLM that discuss transaction
management. 75% of our knowledge came from the
answers to questions we asked the Q&A ETR folks
and help they gave us, the Redbooks and hits we
found on IBMLINK.
We were using TMON at the time for our CICS
monitor and we were using RMF for our MVS Monitor.
We also had SAS/MXG and we used TYPE72GO
records. We used RMF/PM and Monitor III as well as
RMF postprocessor reports:
SYSRPTS(WLMGL(SCPER(<serviceclassname>)))
SYSRPTS(WLMGL(RCPER(<reportclassname>))).
These two reports will give you response time
distributions.
Data Gathering and Analysis
CICS 110 data was invaluable. We gathered and
dumped a portion of it just to see what there was that
we could use. We ran statistical programs to get
volume percentiles. We discovered that 80% of the
transaction volume was covered by less than 12
individual transactions. We used the 110 data to help
us establish the goals to use.
Goal Determination
Using the response time field in the CICS 110 data,
we developed 3 buckets that more or less delineated
our CICS transactions. Those 3 buckets became our
first attempt at establishing service classes. We tried
using the high volume/short running transactions to lift
the region. We started out with three service classes
TRANFAST, TRANSLOW and TRANMED. We put
the CICS system transactions in TRANSLOW.
We set the importance for the transaction service
classes at 1. We took a conservative approach in
choosing the percentage making the response time
goal. This number turned out to be a good tweaking
tool.
3. Keep in mind that since there is only one address
space and only one dispatching priority, the region’s
DP will be managed to the most aggressive goal. This
means that some transactions will get a “free ride” as
the high volume transactions with high importance will
tend to “lift” the region.
Classification Determination
In hindsight and with more WLM experience, we
“over-classified” in this project. Later CICS
implementations consisted of one service class for a
region. In this project we had about 12 transactions
classified. Report classes were used with these so we
could get reporting granularity. Also, for a first shot at
doing transaction response time management, going
through picking out transactions and classifying them
is not a bad thing. We saw stuff that we would not
have seen in a blanket classification. If this is your
first attempt at this, go through the motions to better
understand the workload in the enterprise and how
WLM works.
The first step is to put the data in the CICS subsystem
classification rules. We kept it simple, just using
transaction name. Later on, we got a little fancier and
added subsystem instances. We changed service
class names as well. See Figure 2 for a sample
screenshot of the CICS subsystem classification
panel.
Figure 2
Sample CICS Subsystem Classification
We put the highest volume transactions first, to have a
better chance of getting out of classification routines
relatively quickly and reducing WLM overhead.
Our CICS’s are run as jobs. So the regions are in the
JES section of the classification rules. If you PF11
over a couple times in the classification section (same
in STC and JES), you will see a column that says
“Manage Region Using Goals Of”. See Figure 3 for a
sample screenshot of the JES subsystem
classification panel.
Figure 3
Managing According to Goals Of Region
Be careful, as the default for management is
TRANSACTION. At the start, make sure you set this
to REGION. This way, when you put the information
in the CICS subsystem section, WLM will still manage
according to the velocity goals set for the region and
not the transaction response time goals. It’s not until
you change these over to “TRANSACTION” that WLM
will actually start using the response time goals.
So with “REGION” in that column and the CICS
transactions and service classes in the policy, we
installed and activated the new policy.
What We Saw
CICS 110 data started picking up the service class,
and TMON was showing it in the transaction screen.
So we knew we did something right and we were then
able to use the 110 data to make sure that we had the
classification rules right. We also used that data to
simulate response time goal achievement to some
extent.
We went through cycles of tweaks, still not managing
according to response time goals, and re-iterating the
measurements.
When we were somewhat satisfied with the numbers
we were getting, we put the response time goals to
work.
Transaction Management Implementation
We went into the policy definition, changed “Manage
Region Using Goals Of” to TRANSACTION (see
Figure 4) for each region, installed and activated the
policy and saw a wonderful thing. Using the Monitor
III SYSSUM report, we saw transaction flow start to
smooth out. We had started to achieve WLM goals
which meant achieving the SLA.
4. Figure 4
Managing According to Goals Of Transaction
We tuned by moving transactions around to different
service classes and by modifying response time
objectives. We eventually got to a point of diminishing
returns.
Customer Reaction
“How come my transaction is in the slow class?” A
TMON user noticed one of his transactions was in
TRANSLOW. We found out that it’s not a good idea
to have connotations of speed in service class names.
After we explained that the transactions weren’t being
slowed down, we changed the service class names
changed from TRANFAST, TRANMED and
TRANSLOW to TRANCP01, TRANCP02 and
TRANCP03.
Results
With no other changes to the system except putting
the transactions in response time goal mode, the SLA
graph looked like Figure 5.
Figure 5
Online CICS Performance After to WLM
Conversion
This represents 1.5 million transactions with the
processor at 100% utilization.
We noticed we consistently had the dip starting
around 10AM in the blue application’s response time
results. We couldn’t figure out exactly what was
causing it. Peter Enrico spoke at a Connecticut CMG
meeting about taking work of out of SYSSTC.
We got back home and took another look at the policy.
We took Netview, DB2 and HSM out of SYSSTC;
however, we did leave the IRLM address space in
SYSSTC. There were concerns raised about taking
DB2 out of SYSSTC, but the transaction PB would
govern any threads presented by CICS and DB2
would be managed according to that PB.
We implemented the policy and measured, and the 10
AM dip on the blue application disappeared. Figure 6
shows the SLA graph for a month-end processing day
after taking work out of SYSSTC.
Figure 6
Online CICS Performance After Tweaking
CICS System Transactions
We eventually weeded out the CICS system
transactions, some long running, some never ending
and put them in their own service class. The overall
performance was not affected in our case.
IMS Was Next
On another sysplex, we have a bigger IMS
environment. The IMS was well behaved to some
degree, but there was some month-end stress. We
have 4 production control regions and 5 test control
regions. By the time we got around to IMS, we were
at z/OS 1.4 (+OA06672 for SYSRTD reports) and IMS
V7 (+PQ71906 for OTMA classification). The
processor was an IBM Z900 1C6. At the z/OS 1.2
level of WLM, IBM includes the capability of
measuring response time distribution from report
classes defined in the IMS subsystem section without
having management by response time actually turned
on. This made things much easier.
5. Classification of IMS Transactions
Our MPR’s were already set up to handle transaction
classes with similar response time requirements. So it
made sense for us to classify by IMS subsystem and
transaction class. There was a lot to type into the
policy, but it paid off in the long run.
We mapped out our report classes according to the
IMS region and IMS transaction class. This gave us
homogenous report classes which are required to
effectively use RMF response time distribution reports
(+OA06672) by report class. Our approach was to be
all inclusive, non-overlapping and non-defaulting.
We installed the policy with the definitions in the IMS
subsystem section and started to measure. Our
implementation provided for granularity at the
transaction class level. We could get RMF response
time distribution reports by IMS region/transaction
class.
What We Measured
We were surprised at first to see that when we added
up the counts in the report classes and then added up
the transactions in the IMS log data, they didn’t match.
After investigating where the missing transactions
were appearing and then working with IBM, we came
to the conclusion that the OTMA transactions were the
culprits. This led to APAR PQ71906 for IMS Version
7--the transaction class wasn’t being passed properly.
Using the Report Class response time distribution
statistics (from MXG TYPE72GO), we were able to
model exactly what was going on by using SAS to
map the report class statistics to what would be
service class statistics. We could move stuff around,
implement a new policy, do response time distribution
measurements again and repeat.
We knew ahead of implementation where we would
be as far as making the goals. We were able to model
at 100% accuracy with no disruption.
Implementation
We went into the JES/STC classification rules and
changed all the related IMS address spaces to
“Manage Region Using Goals Of” to TRANSACTION.
This included control regions, MPR’s, DBRC, DLI and
IMS Connect address spaces. We installed and
activated the policy.
There wasn’t much tweaking to do at this point, as we
had already done all the tweaking there was to do
during the modeling.
Observations
We saw more consistent IMS response times during
times of resource shortages. We were able to over-
achieve aggressive IMS goals with a volume of
3,000,000 customer transactions in an 8 hour period
with the processor running 100%.
Overall Benefits
We had a better handle on IMS performance. Month-
end became more hands-off. We were able to take
DB2 out of SYSSTC on that sysplex in preparation for
a policy overhaul that took place after attending Peter
Enrico’s “Revisiting WLM Goals” class.
Using Monitor III, we could tell at a glance which IMS
region and which transaction class was contributing to
any missed goals using the report class in SYSSUM.
Any missed goals were usually due to looping or
abending transactions. We still revisit the goals and
make adjustments as needed.
After our policy re-write, we only have some of the
online transactions in importance 1, along with some
DDF enclaves. That’s it. The bulk of our work is in
IMP=3, 4, 5 or discretionary. We now keep IMP=1 for
production online transactions only.
The whole transaction conversion process taught us a
lot about WLM. Having the IMS workload in
transaction response time mode helped us a lot when
we started using WLM-managed DB2 Stored
Procedures. We had some problems getting that
running smooth at first, as we had to deal with
dependent and independent enclaves, all doing the
same thing. Having the IMS structured as we did,
made it easy to shift IMS workloads around by
transaction classes to different service classes.
Managing DB2 Stored Procedures
The DB2 administrator came over one day and started
talking about a new application that was going to use
DB2 Stored Procedures. He wanted to use WLM to
manage the address spaces. It sounded good and we
started another WLM adventure.
DB2 SPAS Application Environment
6. We started by reading the Redbook on DB2 Stored
Procedures and talked with the DB2 administrator
about setting up the WLM Application Environment
(AE). After we had some operational issues, we
agreed that the parameters, such as NUMTCB, would
be coded in the DB2 SPAS JCL and not the AE
definition in the WLM policy. This was more because
of our particular organization, rather than any
technical reasons.
A little later, we changed the NUMTCB, as we had the
number too low and there were too many SPAS
started. If you specify that WLM can start an unlimited
number of SPAS, WLM will start a SPAS AE when a
delay “for server” contributes to not making the goals.
Also, WLM will start a SPAS for each service class
served by the Application Environment. Stored
procedures of different service classes will never
execute in the same SPAS. We also learned that you
needed to refresh the AE when you made changes
such as NUMTCB. See Figure 7 for a sample AE
definition in WLM. Refer to the z/OS System
Commands Reference for the commands to display,
start, stop and refresh the AE.
Figure 7
Sample AE Definition
Managing the Enclaves
It was time to get an understanding of how the work
was going to flow and how the work would be
classified.
There would be one DB2 handling the stored
procedures. Requests could come in from local and
remote IMS’s, local and remote batch jobs and
customers on PC’s. This meant dependent and
independent enclaves. The WLM ETR team gave
some ideas for the foundation for the classification.
Dependent (local) enclaves would retain the
classification of the invoker. Independent (remote)
enclaves would have to be classified in the DDF
subsystem.
The difficulty came in where the same stored
procedure could be called by an IMS transaction or a
batch program. Since we had IMS’s on other LPAR’s
generating these calls and batch jobs on still other
LPAR’s generating these calls, we had to figure out a
way to keep the “online” enclaves within response
time goals of the original transaction and let “batch”
enclaves fare as batch work.
We talked with DB2 support at IBM on how to get
some detail data. We were having problems figuring
out what classification criteria to use to distinguish the
DDF work coming in from the remote IMS and the
remote batch. DB2 support told us to turn on
accounting trace options 7 and 8 in DB2 to provide
statistics at the detail enclave/stored procedure level
in the DB2 101’s.
Learning The Data
We collected data from the application testing that
was going on, dumped it and just looked at what we
had. The DB2 101 data had the 7 & 8 trace data in it,
along with a whole bunch of other stuff. In dumping it,
we saw a lot of good information, including the
origination of the enclave, from which we could tell if it
was a batch job or IMS online transaction. All we had
to do was get that information into our WLM policy.
Classifying The DDF Work
We put some basic classification rules in WML DDF
subsystem, based on stored procedure name. We
used Monitor III to look at enclave classification data.
Figure 8 shows the ENCLAVE report. Using the
Options for the ENCLAVE report, we put some
classification data in the column labeled “Attributes”.
Figure 8
Sample Enclave Report
If you put your cursor on the “ENCnnnnn” field and
press enter, you will get all the data available for this
enclave that WLM could use for classification. Figure
9 shows the first screen that pops up. Now you can
see all the information WLM has available to it and
how it relates to what’s on the DB2 accounting
7. records. It will start to come together for you when
you see it compared to the dumped DB2 101 data.
Figure 9
Enclave Drill-Down
Because Monitor III is a sampling monitor, not a real
time monitor, and things in testing being what they
are, this took some time to get a good idea of the data
involved. You won’t see every enclave in the system
with Monitor III. You will only see an enclave in the
ENCLAVE report if it’s been in the system for two
WLM sampling cycles (.5 secs) and it is there at the
end of the Monitor III MINTIME interval. You can use
the enclave command in either SDSF or EJES (V3.6)
to get a detail snapshot display of the enclaves.
We then went back to the DB2 101 data and the data
available to WLM and started looking for things to help
us distinguish batch from online enclaves in the WLM
classification rules. We came up with nothing.
Moving on to Plan B, we measured CPU consumption
of the enclaves coming from an online transaction and
translated that into service units. Then we put in a
service class (DDFP001) that had two periods—the
first period was long enough to keep the longest
running IMS transaction enclave in period 1; the
second period would cover the batch ones. For period
1, we used a percentile response time goal, for period
2 we went with velocity. The importance of the first
period was 1, to match the importance of the
transactions that were spawning them and the
importance of period 2 was 4 to match batch.
On the Friday prior to the Monday application
production implementation, the application developers
decided to do some volume testing. We knew at that
point that Monday wasn’t going to be pretty. The
results showed that the number of stored procedure
executions (with respect to the number of IMS
transactions) was higher than the application
developers forecasted.
“All Set” For Production Day
Production day came, and everything looked good
until about 9-10 AM. In looking at Monitor III,
everything was being delayed for enclaves. We were
struggling to make the goals for the fastest transaction
goals. In talking with the WLM ETR team, they
suggested we might want to consider lowering the
goals on the enclaves, so we gave that a try. We
increased the response time and decreased the
percentile for the DDFP001 service class in period 1,
but left it at importance 1. We didn’t see as much
delay for enclaves, but there was still more than there
should be. And it was important work (MPR’s) being
delayed. We started to see more service class goals
not being made as volume increased.
Over the next couple of days, giving WLM time to get
settled and giving us time to gather more data and
think about things, we progressed further in the SPAS
adventure. We didn’t want to make knee jerk changes
to the policy, and didn’t want to rely on 100 second
intervals, so it took time. But all in all, if you see 10-20
100 second intervals all looking pretty sad, the 15
minute/hour intervals won’t look much better.
We were seeing intermittent queuing in the IMS
control regions. Our fastest IMS service class was
95% in .5 secs or less. The wrong 5% were the ones
not making the goal, along with some others. These
particular IMS transactions weren’t even part of the
Stored Procedure application. Phones started to ring,
and there were some unhappy customers out in the
field. The online performance guys did some work,
gave us a new service class with 98% having to
complete in .2 seconds and gave us the list of IMS
transaction classes to which it should be applied. In
less than 5 minutes we made the changes, installed
and implemented the policy. That helped those IMS
transactions, and that problem went away. This is
where the IMS classification granularity paid off for
us--we could quickly re-arrange work.
We still had delay problems with the enclaves.
Lowering the goal helped us some, but not enough.
We looked at the local (dependent) enclaves that were
spawned off the local IMS transactions in Importance
1. We were able to isolate them by transaction class
and put them into importance 2. Right about this time,
we also found out there was an application design
defect that was causing the stored procedure to be
called multiple times from a transaction instead of
once. In this case, multiple times = 10-15 times.
Once we put the offending IMS transactions in
importance 2, things got a little better.
Instead of watching both the enclave goals and the
transaction goals, we decided to watch the transaction
goals of the foreign IMS. We wanted to try putting the
8. DDFP001 in a velocity goal adjust up the velocity if we
saw issues. After all, meeting the transaction goal
was what was important in the end.
We put the DDFP001 service class period 1 into a
velocity goal (45%) with importance 1, and things
calmed down even more. The mix of delay reasons
and who was being delayed was good. It was
indicative of everyone taking turns instead of one
workload dominating the system.
We did some minor tweaking here and there, but we
were pretty much running OK at this point. After the
applications fixed the design defect, we put the
transactions (and their dependent enclaves) back in
importance 1. All was well and we were at the end of
the adventure.
Summary
What works for one place, won’t work for another. If
there was one magnificent WLM policy, IBM would
have published it a long time ago. The end results are
important with WLM, and sometimes the ends do
justify the means.
In this paper, we looked at a couple of WLM-managed
workloads in our shop taken out of the context of our
entire workload. How these workloads fare in our
shop is dependent on how we have the whole policy
set up and the workload itself. The structure might or
might not work in another shop.
However, the migration to WLM transaction
management process entails some basic concepts
that apply everywhere.
1. Get resources lined up.
2. Learn the measurement data.
3. Gather more data than you think you need.
4. Don’t be afraid to change plans.
5. Don’t be afraid to ask IBM questions. This
benefits you and you’ll find out the ETR folks
are great.
6. Set reasonable expectations and know how to
react when things go awry. “What am I going
to do if…..”
7. Understand that you won’t get it right the first
time (see #6).
8. You manage WLM. Let WLM manage the
system.
References and Acknowledgments
• SG24-5326-00 – WLM Redbook
• SG24-6404-00 – IMS V7 Performance
Monitoring and Tuning
• MVS Planning: Workload Management (z/OS
Library)
• RMF Suite of Manuals
• SG24-4693-01 - Getting Started with DB2
Stored Procedures
• Special thanks go to the RMF/WLM ETR Q&A
support team for their contributions to the
project and patience with the “customer”.
• Thanks also go to the RMF, WLM, IMS, DB2
and CICS support folks at IBM who listened
and in a couple instances provided fixes for
us.
• Thanks also to Peter Enrico for his efforts in
preparing and delivering his WLM
presentations and for the WLM/HTML tool
which helped us see the policy better.
Trademarks and Disclaimers
• CICS and DB2 are registered trademarks of
IBM Corporation in the US and other
countries.
• RMF, WLM, z/OS, IMS are trademarks of IBM
Corporation in the US and other countries.
• SAS is a registered trademark of The SAS
Institute, Inc. in the US and other countries.
• MXG is a trademark of Barry Merrill in the US
and other countries.
• Use of and references to products in this
presentation is not intended to be a product
endorsement or recommendation of that
product by The Hartford Financial Services
Group or employees of The Hartford Financial
Services Group.