SlideShare a Scribd company logo
1 of 9
Download to read offline
First Things First: Software−based Performance Tuning




I      S YOUR SYSTEM EXPERIENCING HIGH
       CPU use, memory faulting, poor response
       time, high disk activity, long−running batch
jobs, poorly performing SQL or client/server or
Web requests? Are you about to spend thousands of
                                                        This white paper will focus on part of the
                                                      process of identifying and solving the root causes
                                                      of these application performance issues. The
                                                      sample data was collected using a toolset for the
                                                      iSeries from MB Software called the Workload
dollars on more memory, faster disk drives, or        Performance Series. It collects data from a system
other upgrades? Before you do, make sure you          and presents it in a way that lets you analyze your
know why you're having performance issues. A          environment from various perspectives and
couple of days gathering data from your system and    different levels of detail. The data is gleaned from a
analyzing it from the highest level down to           real analysis of an actual system, so it accurately
individual lines of code is time well spent − and     reflects what you might expect to find in your own
will likely yield a result you may not have           environment. But regardless of whether you use the
considered. New hardware may have little, if any,     Workload Performance Series or another
effect on system responsiveness, because the root     data−gathering solution, what's important is how
cause of 90 percent of all performance issues is      you use the data to address the root causes of
excessive I/O in one or more applications on your     performance issues within your environment.
system.
Subsystem Analysis                                          Figure   1
First, look at a high−level view of what's going on in
your system. (Figure 1) shows CPU use by subsystem.
This particular system had significant response time
issues, and when we look at the data over a 24−hour
period, we find that QBATCH consumed 55 percent of
the CPU on this system. MIMIXSBS consumed another
15 percent, and QINTER took 11 percent. Typically,
we find that a one−to−one relationship exists between
CPU use and I/O performed, and the subsystem that
consumes the most CPU also performs the most I/O.
We explore this relationship in more detail later on; but
if we can reduce these 8.5 billion I/Os, we're going to
improve the duration of jobs, consume less CPU,
experience better response time, less memory use, less
memory faulting, and less disk activity.
   Physical I/O is one of the slowest things on a box,
relative to accessing data that's already in memory or      Figure   2
being used internally in your programs. We want to
block data as efficiently as possible, so whenever disk
arms are physically going to disk to retrieve data we
want to make sure its being done with the fastest disk
drives, the most disk arms, and the newest caching
algorithms. But we also want to make sure that we're
only performing I/O when absolutely necessary to
accomplish a task. Looking at the same data in another
way (Figure 2) confirms the one−to−one relationship
between CPU and I/O. QBATCH, which was
responsible for 55 percent of the CPU consumption, is
responsible for 52 percent of the I/O.
   On the other hand, sorting the subsystem data by
number of jobs (Figure 3) shows that more jobs run in
the QPGMBATCH subsystem than in QBATCH.
Initiating and terminating jobs creates a tremendous
amount of overhead, even if the job only takes one CPU      Figure   3
second each time it's run. It starts a new job number,
opens 15 files, loads 12 programs into memory just to
do a tiny bit of work; then it has to remove all those
programs from memory, close all the files, terminate
the job, and perhaps even generate a job log each time.
   In QPGMBATCH, that happened 288,793 times in
a week − that could have been one job sitting in the
background, waiting on a data queue. Those files could
have been left open, and the programs left in memory;
then QPGMBATCH would only be responsible for one
percent of the jobs on the system, not 23 percent. In
turn, you may have that much less I/O, that much less
CPU.
percent, which isn't much better.
                                     Figure       4            The same job data, sorted by physical I/O (Figure
                                                             5), shows that PCC7C8619 also performs the most I/O,
                                                             clearly causing the CPU issue. However, CL1072CL
                                                             isn't here; maybe that job doesn't have an I/O issue, but
                                                             another problem, such as excessive initiation and
                                                             termination. From the previous Figures, we know that it
                                                             impacts the system negatively, but we'll have to dig
                                                             deeper to find the reason.

                                                                                                  Figure      6




Job Identification
If we look at the same data by job name, we find more
areas to drill down into for root cause. (Figure 4) shows
the top 10 CPU−consuming jobs on the system.
Thousands of jobs run on this system all day long, but
on job (PCC7C8619) was responsible for 22 percent of
all CPU consumed on this system − that's a great
opportunity. If you optimize that process to consume
only 1 percent of the CPU, the whole CPU usage curve           Sorting the data by number of jobs (Figure 6)
that rises at 8 a.m. and drops at 5 p.m. in a typical bell   shows that the jobs running under the name
curve would drop by 21 percent. That's a dramatic            ROBOTCNL comprise 38 percent of all jobs on the
improvement not only in CPU and system capacity, but         system. The QZRCSRVS jobs are remote procedure
also in I/O − this job performs almost two billion           calls coming from other systems; the chart shows that
physical I/Os (remember, I/O typically is the largest        60,126 times during the test period, another platform
resource−consuming function on the system). The              did a remote procedure call and triggered a call to a
second−largest CPU hog (CL1072CL) takes up 14                native program − that's something that could have been
                                                             done more efficiently. It's simpler to just use that
                                                             built−in Client Access capability for doing these remote
                                     Figure       5          procedure calls, but when you discover what kind of
                                                             impact it has on the system from a performance
                                                             standpoint, you might consider doing things more
                                                             efficiently. For example, you could implement a
                                                             TCP/IP socket client and server capability within that
                                                             application, or use data queues instead of remote
                                                             procedure calls − either could have consumed a lot less
                                                             resource.
                                                               QZDASOINIT is ODBC requests responsible for
                                                             10 percent of the jobs on the system. All day long, these
                                                             job start connections, open SQL requests, do a little bit
                                                             of work and close the requests. Over and over again,
                                                             these requests start and stop jobs, open and close files,
                                                             and load and unload programs into and out of memory.
Many of these jobs could have been left active all day
                                                                                                 Figure       8
instead of being initiated and terminated thousands of
times throughout the day.

User Utilization
Typically, we believe that the end user using the
WRKQRY command, SQL, or some type of
client/server application like MS Access is the
performance issue on our box. But when we look at
utilization by user we may find that what's being done
in operations is actually the culprit. Regardless of what
application you're running, or what tools you're using to
schedule jobs or replicate data for high availability, you
must address the underlying root causes of issues within
your environment.

                                                             be running every three seconds to check for data to send
                                     Figure       7          or receive through EDI. If you have a job that's running
                                                             every three seconds, checking, checking, and checking
                                                             again all day long, but never actually doing work,
                                                             you're wasting a lot of resource. Using triggers, data
                                                             queues, or other better−performing techniques for
                                                             detecting transactions at the proper status could
                                                             dramatically reduce the amount of resource consumed
                                                             by this process.

                                                                                                 Figure       9




   (Figure 7) immediately shows that two user IDs
(ROBOT and MIMIXOWN) are responsible for a
significant amount of the CPU resource consumed on
this box. The third, TRANSFER, is probably
transferring data between platforms. BATCH and
EDIOPR round out the top five resource consumers.
But the hundreds of other users on the system represent
a very small piece of what's actually being utilized from
a resource standpoint. (Figure 8) shows once again that
the top CPU consumer is the top I/O performer, and
                                                             DASD Analysis
hammers home the old message: focus on the I/O and
you will address many of the performance issues that         Continue drilling down into the data; which files are
you experience on your systems.                              most often accessed, and which could be archived or
   (Figure 9) shows the data sorted by number of jobs,       even purged? The key thing to remember is that you
and immediately highlights an area for improvement.          only want to process the data that will be selected −
Note that the EDI process − EDIOPR − accounts for 57         don't read a million records if you're only going to
percent of the jobs on the system. This job might
select 1,000. If you have seven years' worth of history
                                                                                                    Figure       11
in a general ledger transaction history file, and your
accounting users only use it to look for recent
transactions to close the books at month's end, you need
to streamline that file. Move the old data from the
current history file into an archived history file. Don't
delete it − it's hard to get users to agree to delete
anything. Instead, create a file that has only the last
month's data in it for daily use, and move the previous
six years' data to an archived file. If you users want
current information, they can access one file, and if they
want old history (which doesn't happen nearly as often)
they can access the other.


                                       Figure       10
                                                               creating database files that account for 46 percent of the
                                                               storage, TESTBENCH accounts for 18 percent, and
                                                               ROBOT accounts for 13 percent. But SMARINO, an
                                                               end user, is consuming 300 GB of disk; is this end user
                                                               going wild with queries? You should keep a close eye
                                                               on this data.
                                                                 Looking at file size by database name (Figure 13)
                                                               yields a few surprises as well. The CLMFILESAV file
                                                               immediately jumps out; this is a save file that's
                                                               apparently never been used, but it's 112 GB − 13
                                                               percent of the disk on the system. OBNMAS accounts
                                                               for 100 GB, but has only been used 16 times in its
                                                               history; is this another 100 GB of wasted space?
                                                                 The files on (Figure 14) offer great opportunity for
                                                               improvement through simple file reorgs. IER200P1 has
   (Figure 10) shows that reorganizing database files
                                                               29 million deleted records and only 9 million active
could help reduce I/O of the applications that access
them. One main production library (MHSFLP) is
responsible for 43 percent of the storage on the system,
using almost a terabyte of disk space and comprising                                                Figure       12
nearly 2 billion records. How much of that library could
be archived?
   The same data, when sorted by deleted records,
shows that library IHA440FP has 700 million active
records and 72 million deleted ones (Figure 11).
Interestingly, it's not even the main production data
library − in fact, it's only a fraction of the size − but it
accounts for 70 percent of the deleted records on the
system. Performing some basic system management
tasks, such as database file reorgs, could dramatically
improve performance of all processes that access these
files.
   (Figure 12) goes even deeper into the data, to the
user level. It shows that user QSECOFR is responsible
for
in 14 billion I/Os − to select 8,000 records. This seems
                                      Figure       13
                                                              like a ridiculous example, but it's real. This invisible
                                                              use of resource occurs every day in many
                                                              environments. It's not truly "invisible," but it happens
                                                              so quickly that most people never see it. These 3,000
                                                              requests might have appeared only momentarily, and by
                                                              the time you have refreshed your screen they were
                                                              gone. The 14 billion unnecessary I/Os could have
                                                              paralyzed the system from an interactive response time
                                                              standpoint, but were individually invisible until we
                                                              analyzed the data this way.
                                                                 CO405JTCP isn't much better; it performed three
                                                              billion I/Os via almost 3,000 queries to select 38
                                                              million records. However, another issue exists in this
                                                              subsystem − why is it selecting so much? What in the
ones. IIT001W03 is an even better opportunity, simply         world are they doing with so many records for each
because it will take less time to clean up − it only has      query? This problem results from applications that
88,000 active records.                                        preload entire data sets into memory, while the user
                                                              only accesses the first small set of records. When the
                                                              user pages down, the application accesses the next page
                                      Figure       14
                                                              of data out of memory. From a coding perspective, it's
                                                              much easier to preload 10,000 records into memory and
                                                              let the user page through them; from a performance
                                                              perspective − with hundreds of simultaneous users −
                                                              it's deadly. For system performance, it's worth the
                                                              additional time and effort to code the application
                                                              properly − the application should read 10 records at a
                                                              time, and not a single record more. When the user pages
                                                              down, read 10 more records. It's horribly inefficient to
                                                              load 10,000 records into memory that the user will
                                                              never even think of reading.
                                                                 Let's look at it another way (Figure 16), sorted by
                                                              records selected. Remember CO405JTCP? Not
                                                              surprisingly, it accounts for 89 percent of all of the

Query Analysis
                                                                                                  Figure      15
Looking at queries provides an even more detailed level
of analysis. This example (Figure 15) shows the system
during normal business hours. QINTER had 13 billion
skipped records; put another way, 13 billion records
were read and not selected, and therefore didn't need to
be read. If you read a record and don't select it, that's a
full table scan, and full table scans unnecessarily
consume tons of resource.
  This example is particularly bad. In one day, users
performed a total of 3,000 queries − which resulted
selected records. QBATCH, which made an appearance
Figure   16
              in our high−level analysis, is number two. The worst
              performer, however, is QINTER. It reads 13 billion
              records to select only 8,000. (Figure 17) shows the
              same data organized by number of queries, and again,
              QINTER stands out; it accounts for half of all the
              queries on the system. Unnecessary initiation and
              termination is the likely culprit here.
                 (Figure 18) shows the data by job. Remember
              PCC7C8619? It was our largest CPU consumer, our
              largest physical I/O performer, and now we that it has
              the most skipped records: 36 percent of the total. That's
              more than 3 billion records skipped to select 38 million.
              This job is consuming 22 percent of our CPU, and now
              you see why.
                 We need to address a few issues here. First, we
              must make sure that this job has the proper permanent
              indexes so that it reads 38 million records instead of 3
Figure   17
              billion. Next, we should find out why it's selecting so
              many records each time.
                 Finally, we need to ask ourselves, "Why is it
              running so many times?" Most likely, you have 15
              users throughout the company all wanting a copy of a
              single report, so they're all running the job individually.
              It would've been a lot more efficient to run the job once
              and print 15 copies of that report.
                 Ultimately, you want to get to the level of detail
              that shows what database needs tuning. (Figure 19)
              shows what looks to be the master file (ENBMAS) that
              can clearly benefit from additional indexing. The
              application performed 2,000 queries to select 300,000
              records, and it skipped 7 billion records along the way.
              Fixing this


Figure   18                                         Figure       19
tweaking become obvious. From a program standpoint,
                                     Figure       20
                                                             (Figure 22) shows that IOMEM001 is skipping almost
                                                             12 billion records in 3,000 queries to select 3,000
                                                             records. That's 3,000 full table scans in one day to
                                                             select one record each time.
                                                                Even beyond the database level is the index level.
                                                             (Figure 23) seems confusing at first, but it's saying this:
                                                             if the file ENPMAS had a new permanent logical file
                                                             keyed by MEMBNO, GRPNUM, EFFDAT, and
                                                             RECTYP, you would eliminate almost 7 billion I/Os on
                                                             this system. These I/Os were occurring because an
                                                             access path didn't exist as a permanent logical or it
                                                             didn't get used. Further analysis will show what's
                                                             causing that − look at the job log of a job that's running
                                                             queries with debug turned on, and see what that
                                                             operating system is doing. See what decisions it's
                                                             making.
                                     Figure       21                                               Figure       22




problem won't involve the building of 1,000 new
logical files on your system. Some people are very                                                 Figure       23
reluctant to build any new logical files, and in fact some
corporate standards prohibit it. What a mistake that is
from a performance standpoint. You don't want
redundant indexes of course, but you sometimes need a
new one, and developers or database administrators
must be allowed to properly index and tune databases
so that applications can perform well. In this example,
we see three databases that desperately need indexing −
they're reading billions of records to select just a few
million records.
  (Figure 20) shows the data sorted by selected
records, while (Figure 21) shows it sorted by queries.
Notice that the same files are showing up in each of
these graphs; when you can see this happen, the areas
that need
Figure       24




(Figure 24) sorts the query data by selected records.
SVCMAS is another master file that's missing an index
key by MODCOD. Because of that there is 10 times
more I/O than is necessary. But another issue exists
here as well: why did we select 40 million records? The
reason is that we probably neglected to put another
criteria in this query. We selected by MODCOD and
downloaded 10 million records, then did another pass to
further subset the data. If we had done all of that in one
pass, we might have been able to select just a few
thousand records.

Use Good Judgment
This white paper focuses primarily on SQL−related
issues. You also may have code issues with your RPG,
Cobol, CL, or C code. Or, you may have 2 GB of
journal receivers per day that your high availability tool
is having trouble trying to process. Other MB Software
white papers focus on these other areas.
   No matter what analysis you do, you must use good
judgment. You would never want to build 1,000 new
permanent logical files. You have to know when to
stop. Look at different periods of time and different
samples of data; maybe the numbers vary dramatically
during month end, weekend or day end processing.
Look at the data during a variety of key business times,
and correct the issues that will have the greatest impact
on your system.

More Related Content

Similar to First Things First

Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101Tim Penhey
 
Artificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyArtificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyvenkatvajradhar1
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerIOSRjournaljce
 
Distributed Systems at Scale: Reducing the Fail
Distributed Systems at Scale:  Reducing the FailDistributed Systems at Scale:  Reducing the Fail
Distributed Systems at Scale: Reducing the FailKim Moir
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesFinalyear Projects
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...Finalyear Projects
 
Oracle database performance diagnostics - before your begin
Oracle database performance diagnostics  - before your beginOracle database performance diagnostics  - before your begin
Oracle database performance diagnostics - before your beginHemant K Chitale
 
Why software performance reduces with time?.pdf
Why software performance reduces with time?.pdfWhy software performance reduces with time?.pdf
Why software performance reduces with time?.pdfMike Brown
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...vtunotesbysree
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recoverygeekmodeboy
 
Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications EMC
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...butest
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONijaia
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilizationgerogepatton
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONgerogepatton
 

Similar to First Things First (20)

Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101
 
35 dbatune3
35 dbatune335 dbatune3
35 dbatune3
 
Artificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyArtificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficiently
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
 
Distributed Systems at Scale: Reducing the Fail
Distributed Systems at Scale:  Reducing the FailDistributed Systems at Scale:  Reducing the Fail
Distributed Systems at Scale: Reducing the Fail
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
 
Oracle database performance diagnostics - before your begin
Oracle database performance diagnostics  - before your beginOracle database performance diagnostics  - before your begin
Oracle database performance diagnostics - before your begin
 
Why software performance reduces with time?.pdf
Why software performance reduces with time?.pdfWhy software performance reduces with time?.pdf
Why software performance reduces with time?.pdf
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recovery
 
Lesson 6 Processor Management
Lesson 6 Processor ManagementLesson 6 Processor Management
Lesson 6 Processor Management
 
Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
36575
3657536575
36575
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

First Things First

  • 1. First Things First: Software−based Performance Tuning I S YOUR SYSTEM EXPERIENCING HIGH CPU use, memory faulting, poor response time, high disk activity, long−running batch jobs, poorly performing SQL or client/server or Web requests? Are you about to spend thousands of This white paper will focus on part of the process of identifying and solving the root causes of these application performance issues. The sample data was collected using a toolset for the iSeries from MB Software called the Workload dollars on more memory, faster disk drives, or Performance Series. It collects data from a system other upgrades? Before you do, make sure you and presents it in a way that lets you analyze your know why you're having performance issues. A environment from various perspectives and couple of days gathering data from your system and different levels of detail. The data is gleaned from a analyzing it from the highest level down to real analysis of an actual system, so it accurately individual lines of code is time well spent − and reflects what you might expect to find in your own will likely yield a result you may not have environment. But regardless of whether you use the considered. New hardware may have little, if any, Workload Performance Series or another effect on system responsiveness, because the root data−gathering solution, what's important is how cause of 90 percent of all performance issues is you use the data to address the root causes of excessive I/O in one or more applications on your performance issues within your environment. system.
  • 2. Subsystem Analysis Figure 1 First, look at a high−level view of what's going on in your system. (Figure 1) shows CPU use by subsystem. This particular system had significant response time issues, and when we look at the data over a 24−hour period, we find that QBATCH consumed 55 percent of the CPU on this system. MIMIXSBS consumed another 15 percent, and QINTER took 11 percent. Typically, we find that a one−to−one relationship exists between CPU use and I/O performed, and the subsystem that consumes the most CPU also performs the most I/O. We explore this relationship in more detail later on; but if we can reduce these 8.5 billion I/Os, we're going to improve the duration of jobs, consume less CPU, experience better response time, less memory use, less memory faulting, and less disk activity. Physical I/O is one of the slowest things on a box, relative to accessing data that's already in memory or Figure 2 being used internally in your programs. We want to block data as efficiently as possible, so whenever disk arms are physically going to disk to retrieve data we want to make sure its being done with the fastest disk drives, the most disk arms, and the newest caching algorithms. But we also want to make sure that we're only performing I/O when absolutely necessary to accomplish a task. Looking at the same data in another way (Figure 2) confirms the one−to−one relationship between CPU and I/O. QBATCH, which was responsible for 55 percent of the CPU consumption, is responsible for 52 percent of the I/O. On the other hand, sorting the subsystem data by number of jobs (Figure 3) shows that more jobs run in the QPGMBATCH subsystem than in QBATCH. Initiating and terminating jobs creates a tremendous amount of overhead, even if the job only takes one CPU Figure 3 second each time it's run. It starts a new job number, opens 15 files, loads 12 programs into memory just to do a tiny bit of work; then it has to remove all those programs from memory, close all the files, terminate the job, and perhaps even generate a job log each time. In QPGMBATCH, that happened 288,793 times in a week − that could have been one job sitting in the background, waiting on a data queue. Those files could have been left open, and the programs left in memory; then QPGMBATCH would only be responsible for one percent of the jobs on the system, not 23 percent. In turn, you may have that much less I/O, that much less CPU.
  • 3. percent, which isn't much better. Figure 4 The same job data, sorted by physical I/O (Figure 5), shows that PCC7C8619 also performs the most I/O, clearly causing the CPU issue. However, CL1072CL isn't here; maybe that job doesn't have an I/O issue, but another problem, such as excessive initiation and termination. From the previous Figures, we know that it impacts the system negatively, but we'll have to dig deeper to find the reason. Figure 6 Job Identification If we look at the same data by job name, we find more areas to drill down into for root cause. (Figure 4) shows the top 10 CPU−consuming jobs on the system. Thousands of jobs run on this system all day long, but on job (PCC7C8619) was responsible for 22 percent of all CPU consumed on this system − that's a great opportunity. If you optimize that process to consume only 1 percent of the CPU, the whole CPU usage curve Sorting the data by number of jobs (Figure 6) that rises at 8 a.m. and drops at 5 p.m. in a typical bell shows that the jobs running under the name curve would drop by 21 percent. That's a dramatic ROBOTCNL comprise 38 percent of all jobs on the improvement not only in CPU and system capacity, but system. The QZRCSRVS jobs are remote procedure also in I/O − this job performs almost two billion calls coming from other systems; the chart shows that physical I/Os (remember, I/O typically is the largest 60,126 times during the test period, another platform resource−consuming function on the system). The did a remote procedure call and triggered a call to a second−largest CPU hog (CL1072CL) takes up 14 native program − that's something that could have been done more efficiently. It's simpler to just use that built−in Client Access capability for doing these remote Figure 5 procedure calls, but when you discover what kind of impact it has on the system from a performance standpoint, you might consider doing things more efficiently. For example, you could implement a TCP/IP socket client and server capability within that application, or use data queues instead of remote procedure calls − either could have consumed a lot less resource. QZDASOINIT is ODBC requests responsible for 10 percent of the jobs on the system. All day long, these job start connections, open SQL requests, do a little bit of work and close the requests. Over and over again, these requests start and stop jobs, open and close files, and load and unload programs into and out of memory.
  • 4. Many of these jobs could have been left active all day Figure 8 instead of being initiated and terminated thousands of times throughout the day. User Utilization Typically, we believe that the end user using the WRKQRY command, SQL, or some type of client/server application like MS Access is the performance issue on our box. But when we look at utilization by user we may find that what's being done in operations is actually the culprit. Regardless of what application you're running, or what tools you're using to schedule jobs or replicate data for high availability, you must address the underlying root causes of issues within your environment. be running every three seconds to check for data to send Figure 7 or receive through EDI. If you have a job that's running every three seconds, checking, checking, and checking again all day long, but never actually doing work, you're wasting a lot of resource. Using triggers, data queues, or other better−performing techniques for detecting transactions at the proper status could dramatically reduce the amount of resource consumed by this process. Figure 9 (Figure 7) immediately shows that two user IDs (ROBOT and MIMIXOWN) are responsible for a significant amount of the CPU resource consumed on this box. The third, TRANSFER, is probably transferring data between platforms. BATCH and EDIOPR round out the top five resource consumers. But the hundreds of other users on the system represent a very small piece of what's actually being utilized from a resource standpoint. (Figure 8) shows once again that the top CPU consumer is the top I/O performer, and DASD Analysis hammers home the old message: focus on the I/O and you will address many of the performance issues that Continue drilling down into the data; which files are you experience on your systems. most often accessed, and which could be archived or (Figure 9) shows the data sorted by number of jobs, even purged? The key thing to remember is that you and immediately highlights an area for improvement. only want to process the data that will be selected − Note that the EDI process − EDIOPR − accounts for 57 don't read a million records if you're only going to percent of the jobs on the system. This job might
  • 5. select 1,000. If you have seven years' worth of history Figure 11 in a general ledger transaction history file, and your accounting users only use it to look for recent transactions to close the books at month's end, you need to streamline that file. Move the old data from the current history file into an archived history file. Don't delete it − it's hard to get users to agree to delete anything. Instead, create a file that has only the last month's data in it for daily use, and move the previous six years' data to an archived file. If you users want current information, they can access one file, and if they want old history (which doesn't happen nearly as often) they can access the other. Figure 10 creating database files that account for 46 percent of the storage, TESTBENCH accounts for 18 percent, and ROBOT accounts for 13 percent. But SMARINO, an end user, is consuming 300 GB of disk; is this end user going wild with queries? You should keep a close eye on this data. Looking at file size by database name (Figure 13) yields a few surprises as well. The CLMFILESAV file immediately jumps out; this is a save file that's apparently never been used, but it's 112 GB − 13 percent of the disk on the system. OBNMAS accounts for 100 GB, but has only been used 16 times in its history; is this another 100 GB of wasted space? The files on (Figure 14) offer great opportunity for improvement through simple file reorgs. IER200P1 has (Figure 10) shows that reorganizing database files 29 million deleted records and only 9 million active could help reduce I/O of the applications that access them. One main production library (MHSFLP) is responsible for 43 percent of the storage on the system, using almost a terabyte of disk space and comprising Figure 12 nearly 2 billion records. How much of that library could be archived? The same data, when sorted by deleted records, shows that library IHA440FP has 700 million active records and 72 million deleted ones (Figure 11). Interestingly, it's not even the main production data library − in fact, it's only a fraction of the size − but it accounts for 70 percent of the deleted records on the system. Performing some basic system management tasks, such as database file reorgs, could dramatically improve performance of all processes that access these files. (Figure 12) goes even deeper into the data, to the user level. It shows that user QSECOFR is responsible for
  • 6. in 14 billion I/Os − to select 8,000 records. This seems Figure 13 like a ridiculous example, but it's real. This invisible use of resource occurs every day in many environments. It's not truly "invisible," but it happens so quickly that most people never see it. These 3,000 requests might have appeared only momentarily, and by the time you have refreshed your screen they were gone. The 14 billion unnecessary I/Os could have paralyzed the system from an interactive response time standpoint, but were individually invisible until we analyzed the data this way. CO405JTCP isn't much better; it performed three billion I/Os via almost 3,000 queries to select 38 million records. However, another issue exists in this subsystem − why is it selecting so much? What in the ones. IIT001W03 is an even better opportunity, simply world are they doing with so many records for each because it will take less time to clean up − it only has query? This problem results from applications that 88,000 active records. preload entire data sets into memory, while the user only accesses the first small set of records. When the user pages down, the application accesses the next page Figure 14 of data out of memory. From a coding perspective, it's much easier to preload 10,000 records into memory and let the user page through them; from a performance perspective − with hundreds of simultaneous users − it's deadly. For system performance, it's worth the additional time and effort to code the application properly − the application should read 10 records at a time, and not a single record more. When the user pages down, read 10 more records. It's horribly inefficient to load 10,000 records into memory that the user will never even think of reading. Let's look at it another way (Figure 16), sorted by records selected. Remember CO405JTCP? Not surprisingly, it accounts for 89 percent of all of the Query Analysis Figure 15 Looking at queries provides an even more detailed level of analysis. This example (Figure 15) shows the system during normal business hours. QINTER had 13 billion skipped records; put another way, 13 billion records were read and not selected, and therefore didn't need to be read. If you read a record and don't select it, that's a full table scan, and full table scans unnecessarily consume tons of resource. This example is particularly bad. In one day, users performed a total of 3,000 queries − which resulted
  • 7. selected records. QBATCH, which made an appearance Figure 16 in our high−level analysis, is number two. The worst performer, however, is QINTER. It reads 13 billion records to select only 8,000. (Figure 17) shows the same data organized by number of queries, and again, QINTER stands out; it accounts for half of all the queries on the system. Unnecessary initiation and termination is the likely culprit here. (Figure 18) shows the data by job. Remember PCC7C8619? It was our largest CPU consumer, our largest physical I/O performer, and now we that it has the most skipped records: 36 percent of the total. That's more than 3 billion records skipped to select 38 million. This job is consuming 22 percent of our CPU, and now you see why. We need to address a few issues here. First, we must make sure that this job has the proper permanent indexes so that it reads 38 million records instead of 3 Figure 17 billion. Next, we should find out why it's selecting so many records each time. Finally, we need to ask ourselves, "Why is it running so many times?" Most likely, you have 15 users throughout the company all wanting a copy of a single report, so they're all running the job individually. It would've been a lot more efficient to run the job once and print 15 copies of that report. Ultimately, you want to get to the level of detail that shows what database needs tuning. (Figure 19) shows what looks to be the master file (ENBMAS) that can clearly benefit from additional indexing. The application performed 2,000 queries to select 300,000 records, and it skipped 7 billion records along the way. Fixing this Figure 18 Figure 19
  • 8. tweaking become obvious. From a program standpoint, Figure 20 (Figure 22) shows that IOMEM001 is skipping almost 12 billion records in 3,000 queries to select 3,000 records. That's 3,000 full table scans in one day to select one record each time. Even beyond the database level is the index level. (Figure 23) seems confusing at first, but it's saying this: if the file ENPMAS had a new permanent logical file keyed by MEMBNO, GRPNUM, EFFDAT, and RECTYP, you would eliminate almost 7 billion I/Os on this system. These I/Os were occurring because an access path didn't exist as a permanent logical or it didn't get used. Further analysis will show what's causing that − look at the job log of a job that's running queries with debug turned on, and see what that operating system is doing. See what decisions it's making. Figure 21 Figure 22 problem won't involve the building of 1,000 new logical files on your system. Some people are very Figure 23 reluctant to build any new logical files, and in fact some corporate standards prohibit it. What a mistake that is from a performance standpoint. You don't want redundant indexes of course, but you sometimes need a new one, and developers or database administrators must be allowed to properly index and tune databases so that applications can perform well. In this example, we see three databases that desperately need indexing − they're reading billions of records to select just a few million records. (Figure 20) shows the data sorted by selected records, while (Figure 21) shows it sorted by queries. Notice that the same files are showing up in each of these graphs; when you can see this happen, the areas that need
  • 9. Figure 24 (Figure 24) sorts the query data by selected records. SVCMAS is another master file that's missing an index key by MODCOD. Because of that there is 10 times more I/O than is necessary. But another issue exists here as well: why did we select 40 million records? The reason is that we probably neglected to put another criteria in this query. We selected by MODCOD and downloaded 10 million records, then did another pass to further subset the data. If we had done all of that in one pass, we might have been able to select just a few thousand records. Use Good Judgment This white paper focuses primarily on SQL−related issues. You also may have code issues with your RPG, Cobol, CL, or C code. Or, you may have 2 GB of journal receivers per day that your high availability tool is having trouble trying to process. Other MB Software white papers focus on these other areas. No matter what analysis you do, you must use good judgment. You would never want to build 1,000 new permanent logical files. You have to know when to stop. Look at different periods of time and different samples of data; maybe the numbers vary dramatically during month end, weekend or day end processing. Look at the data during a variety of key business times, and correct the issues that will have the greatest impact on your system.