Is your IBM i (AS/400) system experiencing high CPU use, memory faulting, poor response time, high disk activity, long−running batch jobs, poorly performing SQL or client/server or Web requests? Are you about to spend thousands of dollars on more memory, faster disk drives, or other upgrades? Before you do, make sure you know why you’re having performance issues. A couple of days gathering data from your system and analyzing it from the highest level down to individual lines of code is time well spent − and will likely yield a result you may not have considered. New hardware may have little, if any, effect on system responsiveness, because the root cause of 90 percent of all performance issues is excessive I/O in one or more applications on your system. This white paper will focus on part of the process of identifying and solving the root causes of these application performance issues. The sample data was collected using a toolset for the iSeries from MB Software called the Workload Performance Series. It collects data from a system and presents it in a way that lets you analyze your environment from various perspectives and different levels of detail. The data is gleaned from a real analysis of an actual system, so it accurately reflects what you might expect to find in your own environment. But regardless of whether you use the Workload Performance Series or another data−gathering solution, what’s important is how you use the data to address the root causes of performance issues within your environment.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
First Things First
1. First Things First: Software−based Performance Tuning
I S YOUR SYSTEM EXPERIENCING HIGH
CPU use, memory faulting, poor response
time, high disk activity, long−running batch
jobs, poorly performing SQL or client/server or
Web requests? Are you about to spend thousands of
This white paper will focus on part of the
process of identifying and solving the root causes
of these application performance issues. The
sample data was collected using a toolset for the
iSeries from MB Software called the Workload
dollars on more memory, faster disk drives, or Performance Series. It collects data from a system
other upgrades? Before you do, make sure you and presents it in a way that lets you analyze your
know why you're having performance issues. A environment from various perspectives and
couple of days gathering data from your system and different levels of detail. The data is gleaned from a
analyzing it from the highest level down to real analysis of an actual system, so it accurately
individual lines of code is time well spent − and reflects what you might expect to find in your own
will likely yield a result you may not have environment. But regardless of whether you use the
considered. New hardware may have little, if any, Workload Performance Series or another
effect on system responsiveness, because the root data−gathering solution, what's important is how
cause of 90 percent of all performance issues is you use the data to address the root causes of
excessive I/O in one or more applications on your performance issues within your environment.
system.
2. Subsystem Analysis Figure 1
First, look at a high−level view of what's going on in
your system. (Figure 1) shows CPU use by subsystem.
This particular system had significant response time
issues, and when we look at the data over a 24−hour
period, we find that QBATCH consumed 55 percent of
the CPU on this system. MIMIXSBS consumed another
15 percent, and QINTER took 11 percent. Typically,
we find that a one−to−one relationship exists between
CPU use and I/O performed, and the subsystem that
consumes the most CPU also performs the most I/O.
We explore this relationship in more detail later on; but
if we can reduce these 8.5 billion I/Os, we're going to
improve the duration of jobs, consume less CPU,
experience better response time, less memory use, less
memory faulting, and less disk activity.
Physical I/O is one of the slowest things on a box,
relative to accessing data that's already in memory or Figure 2
being used internally in your programs. We want to
block data as efficiently as possible, so whenever disk
arms are physically going to disk to retrieve data we
want to make sure its being done with the fastest disk
drives, the most disk arms, and the newest caching
algorithms. But we also want to make sure that we're
only performing I/O when absolutely necessary to
accomplish a task. Looking at the same data in another
way (Figure 2) confirms the one−to−one relationship
between CPU and I/O. QBATCH, which was
responsible for 55 percent of the CPU consumption, is
responsible for 52 percent of the I/O.
On the other hand, sorting the subsystem data by
number of jobs (Figure 3) shows that more jobs run in
the QPGMBATCH subsystem than in QBATCH.
Initiating and terminating jobs creates a tremendous
amount of overhead, even if the job only takes one CPU Figure 3
second each time it's run. It starts a new job number,
opens 15 files, loads 12 programs into memory just to
do a tiny bit of work; then it has to remove all those
programs from memory, close all the files, terminate
the job, and perhaps even generate a job log each time.
In QPGMBATCH, that happened 288,793 times in
a week − that could have been one job sitting in the
background, waiting on a data queue. Those files could
have been left open, and the programs left in memory;
then QPGMBATCH would only be responsible for one
percent of the jobs on the system, not 23 percent. In
turn, you may have that much less I/O, that much less
CPU.
3. percent, which isn't much better.
Figure 4 The same job data, sorted by physical I/O (Figure
5), shows that PCC7C8619 also performs the most I/O,
clearly causing the CPU issue. However, CL1072CL
isn't here; maybe that job doesn't have an I/O issue, but
another problem, such as excessive initiation and
termination. From the previous Figures, we know that it
impacts the system negatively, but we'll have to dig
deeper to find the reason.
Figure 6
Job Identification
If we look at the same data by job name, we find more
areas to drill down into for root cause. (Figure 4) shows
the top 10 CPU−consuming jobs on the system.
Thousands of jobs run on this system all day long, but
on job (PCC7C8619) was responsible for 22 percent of
all CPU consumed on this system − that's a great
opportunity. If you optimize that process to consume
only 1 percent of the CPU, the whole CPU usage curve Sorting the data by number of jobs (Figure 6)
that rises at 8 a.m. and drops at 5 p.m. in a typical bell shows that the jobs running under the name
curve would drop by 21 percent. That's a dramatic ROBOTCNL comprise 38 percent of all jobs on the
improvement not only in CPU and system capacity, but system. The QZRCSRVS jobs are remote procedure
also in I/O − this job performs almost two billion calls coming from other systems; the chart shows that
physical I/Os (remember, I/O typically is the largest 60,126 times during the test period, another platform
resource−consuming function on the system). The did a remote procedure call and triggered a call to a
second−largest CPU hog (CL1072CL) takes up 14 native program − that's something that could have been
done more efficiently. It's simpler to just use that
built−in Client Access capability for doing these remote
Figure 5 procedure calls, but when you discover what kind of
impact it has on the system from a performance
standpoint, you might consider doing things more
efficiently. For example, you could implement a
TCP/IP socket client and server capability within that
application, or use data queues instead of remote
procedure calls − either could have consumed a lot less
resource.
QZDASOINIT is ODBC requests responsible for
10 percent of the jobs on the system. All day long, these
job start connections, open SQL requests, do a little bit
of work and close the requests. Over and over again,
these requests start and stop jobs, open and close files,
and load and unload programs into and out of memory.
4. Many of these jobs could have been left active all day
Figure 8
instead of being initiated and terminated thousands of
times throughout the day.
User Utilization
Typically, we believe that the end user using the
WRKQRY command, SQL, or some type of
client/server application like MS Access is the
performance issue on our box. But when we look at
utilization by user we may find that what's being done
in operations is actually the culprit. Regardless of what
application you're running, or what tools you're using to
schedule jobs or replicate data for high availability, you
must address the underlying root causes of issues within
your environment.
be running every three seconds to check for data to send
Figure 7 or receive through EDI. If you have a job that's running
every three seconds, checking, checking, and checking
again all day long, but never actually doing work,
you're wasting a lot of resource. Using triggers, data
queues, or other better−performing techniques for
detecting transactions at the proper status could
dramatically reduce the amount of resource consumed
by this process.
Figure 9
(Figure 7) immediately shows that two user IDs
(ROBOT and MIMIXOWN) are responsible for a
significant amount of the CPU resource consumed on
this box. The third, TRANSFER, is probably
transferring data between platforms. BATCH and
EDIOPR round out the top five resource consumers.
But the hundreds of other users on the system represent
a very small piece of what's actually being utilized from
a resource standpoint. (Figure 8) shows once again that
the top CPU consumer is the top I/O performer, and
DASD Analysis
hammers home the old message: focus on the I/O and
you will address many of the performance issues that Continue drilling down into the data; which files are
you experience on your systems. most often accessed, and which could be archived or
(Figure 9) shows the data sorted by number of jobs, even purged? The key thing to remember is that you
and immediately highlights an area for improvement. only want to process the data that will be selected −
Note that the EDI process − EDIOPR − accounts for 57 don't read a million records if you're only going to
percent of the jobs on the system. This job might
5. select 1,000. If you have seven years' worth of history
Figure 11
in a general ledger transaction history file, and your
accounting users only use it to look for recent
transactions to close the books at month's end, you need
to streamline that file. Move the old data from the
current history file into an archived history file. Don't
delete it − it's hard to get users to agree to delete
anything. Instead, create a file that has only the last
month's data in it for daily use, and move the previous
six years' data to an archived file. If you users want
current information, they can access one file, and if they
want old history (which doesn't happen nearly as often)
they can access the other.
Figure 10
creating database files that account for 46 percent of the
storage, TESTBENCH accounts for 18 percent, and
ROBOT accounts for 13 percent. But SMARINO, an
end user, is consuming 300 GB of disk; is this end user
going wild with queries? You should keep a close eye
on this data.
Looking at file size by database name (Figure 13)
yields a few surprises as well. The CLMFILESAV file
immediately jumps out; this is a save file that's
apparently never been used, but it's 112 GB − 13
percent of the disk on the system. OBNMAS accounts
for 100 GB, but has only been used 16 times in its
history; is this another 100 GB of wasted space?
The files on (Figure 14) offer great opportunity for
improvement through simple file reorgs. IER200P1 has
(Figure 10) shows that reorganizing database files
29 million deleted records and only 9 million active
could help reduce I/O of the applications that access
them. One main production library (MHSFLP) is
responsible for 43 percent of the storage on the system,
using almost a terabyte of disk space and comprising Figure 12
nearly 2 billion records. How much of that library could
be archived?
The same data, when sorted by deleted records,
shows that library IHA440FP has 700 million active
records and 72 million deleted ones (Figure 11).
Interestingly, it's not even the main production data
library − in fact, it's only a fraction of the size − but it
accounts for 70 percent of the deleted records on the
system. Performing some basic system management
tasks, such as database file reorgs, could dramatically
improve performance of all processes that access these
files.
(Figure 12) goes even deeper into the data, to the
user level. It shows that user QSECOFR is responsible
for
6. in 14 billion I/Os − to select 8,000 records. This seems
Figure 13
like a ridiculous example, but it's real. This invisible
use of resource occurs every day in many
environments. It's not truly "invisible," but it happens
so quickly that most people never see it. These 3,000
requests might have appeared only momentarily, and by
the time you have refreshed your screen they were
gone. The 14 billion unnecessary I/Os could have
paralyzed the system from an interactive response time
standpoint, but were individually invisible until we
analyzed the data this way.
CO405JTCP isn't much better; it performed three
billion I/Os via almost 3,000 queries to select 38
million records. However, another issue exists in this
subsystem − why is it selecting so much? What in the
ones. IIT001W03 is an even better opportunity, simply world are they doing with so many records for each
because it will take less time to clean up − it only has query? This problem results from applications that
88,000 active records. preload entire data sets into memory, while the user
only accesses the first small set of records. When the
user pages down, the application accesses the next page
Figure 14
of data out of memory. From a coding perspective, it's
much easier to preload 10,000 records into memory and
let the user page through them; from a performance
perspective − with hundreds of simultaneous users −
it's deadly. For system performance, it's worth the
additional time and effort to code the application
properly − the application should read 10 records at a
time, and not a single record more. When the user pages
down, read 10 more records. It's horribly inefficient to
load 10,000 records into memory that the user will
never even think of reading.
Let's look at it another way (Figure 16), sorted by
records selected. Remember CO405JTCP? Not
surprisingly, it accounts for 89 percent of all of the
Query Analysis
Figure 15
Looking at queries provides an even more detailed level
of analysis. This example (Figure 15) shows the system
during normal business hours. QINTER had 13 billion
skipped records; put another way, 13 billion records
were read and not selected, and therefore didn't need to
be read. If you read a record and don't select it, that's a
full table scan, and full table scans unnecessarily
consume tons of resource.
This example is particularly bad. In one day, users
performed a total of 3,000 queries − which resulted
7. selected records. QBATCH, which made an appearance
Figure 16
in our high−level analysis, is number two. The worst
performer, however, is QINTER. It reads 13 billion
records to select only 8,000. (Figure 17) shows the
same data organized by number of queries, and again,
QINTER stands out; it accounts for half of all the
queries on the system. Unnecessary initiation and
termination is the likely culprit here.
(Figure 18) shows the data by job. Remember
PCC7C8619? It was our largest CPU consumer, our
largest physical I/O performer, and now we that it has
the most skipped records: 36 percent of the total. That's
more than 3 billion records skipped to select 38 million.
This job is consuming 22 percent of our CPU, and now
you see why.
We need to address a few issues here. First, we
must make sure that this job has the proper permanent
indexes so that it reads 38 million records instead of 3
Figure 17
billion. Next, we should find out why it's selecting so
many records each time.
Finally, we need to ask ourselves, "Why is it
running so many times?" Most likely, you have 15
users throughout the company all wanting a copy of a
single report, so they're all running the job individually.
It would've been a lot more efficient to run the job once
and print 15 copies of that report.
Ultimately, you want to get to the level of detail
that shows what database needs tuning. (Figure 19)
shows what looks to be the master file (ENBMAS) that
can clearly benefit from additional indexing. The
application performed 2,000 queries to select 300,000
records, and it skipped 7 billion records along the way.
Fixing this
Figure 18 Figure 19
8. tweaking become obvious. From a program standpoint,
Figure 20
(Figure 22) shows that IOMEM001 is skipping almost
12 billion records in 3,000 queries to select 3,000
records. That's 3,000 full table scans in one day to
select one record each time.
Even beyond the database level is the index level.
(Figure 23) seems confusing at first, but it's saying this:
if the file ENPMAS had a new permanent logical file
keyed by MEMBNO, GRPNUM, EFFDAT, and
RECTYP, you would eliminate almost 7 billion I/Os on
this system. These I/Os were occurring because an
access path didn't exist as a permanent logical or it
didn't get used. Further analysis will show what's
causing that − look at the job log of a job that's running
queries with debug turned on, and see what that
operating system is doing. See what decisions it's
making.
Figure 21 Figure 22
problem won't involve the building of 1,000 new
logical files on your system. Some people are very Figure 23
reluctant to build any new logical files, and in fact some
corporate standards prohibit it. What a mistake that is
from a performance standpoint. You don't want
redundant indexes of course, but you sometimes need a
new one, and developers or database administrators
must be allowed to properly index and tune databases
so that applications can perform well. In this example,
we see three databases that desperately need indexing −
they're reading billions of records to select just a few
million records.
(Figure 20) shows the data sorted by selected
records, while (Figure 21) shows it sorted by queries.
Notice that the same files are showing up in each of
these graphs; when you can see this happen, the areas
that need
9. Figure 24
(Figure 24) sorts the query data by selected records.
SVCMAS is another master file that's missing an index
key by MODCOD. Because of that there is 10 times
more I/O than is necessary. But another issue exists
here as well: why did we select 40 million records? The
reason is that we probably neglected to put another
criteria in this query. We selected by MODCOD and
downloaded 10 million records, then did another pass to
further subset the data. If we had done all of that in one
pass, we might have been able to select just a few
thousand records.
Use Good Judgment
This white paper focuses primarily on SQL−related
issues. You also may have code issues with your RPG,
Cobol, CL, or C code. Or, you may have 2 GB of
journal receivers per day that your high availability tool
is having trouble trying to process. Other MB Software
white papers focus on these other areas.
No matter what analysis you do, you must use good
judgment. You would never want to build 1,000 new
permanent logical files. You have to know when to
stop. Look at different periods of time and different
samples of data; maybe the numbers vary dramatically
during month end, weekend or day end processing.
Look at the data during a variety of key business times,
and correct the issues that will have the greatest impact
on your system.