Mainframe cost is heavily dependent on the real CPU load through the IBM mechanism for charging Software by the 4 hours rolling average. By precisely monitoring various loads (as detecting rapidly abnormal CPU peaks, optimizing Disk I/O and using new features as Large pages), EPV provides a toolset to reduce the CPU load (and hence the IBM Software charging) while better using it.
3. Introduction
Reduce mainframe cost while
improving application performance is
still one of the most important goals
of companies running z/OS
applications
In many situations needed actions
require both a technical analysis and
a management decision
In this presentation, starting from
real life examples, we will focus on
what are the most common tuning
opportunities we found at many sites
3
4. Agenda
1. Who’s Using My CPU?
2. The Best I/O is no I/O
3. Large Memory Pages
4. WLC Checks for Managers
4
6. Who’s Using My CPU?
6
This is an example of the abnormal
behaviour of a monitoring tool
It normally uses few MIPS but for some
reasons on Saturday morning started to
loop using almost a full CPU
Customer technical team tried to restart
the STC; it worked; in the mean time
they asked for a correction from the ISV
7. 7
Two heavy TSO users in the peak hours
Customer created a Type3 WLM
Resource Group with a maximum limit
of 30% including the ALLTSO service
class
A management decision may be needed
Who’s Using My CPU?
9. Application tuning requires a joint
effort between technical and
developent teams
Most of the times management
decision and commitment is needed
9
Who’s Using My CPU?
11. Accessing data in memory provides better
performance and less CPU usage
Many Data In Memory possibilities available
in z/OS; most of them since many years
Because of current disk performance most
sites don’t care about the number of I/Os
they do
To understand if the system I/O load is
excessive we suggest to use the IOC index
(calculated dividing the AVERAGE DISK I/O
RATE by AVERAGE MIPS USED)
Values higher than 3 should be investigated
11
The Best I/O is no I/O
13. Most common reasons for excessive
I/Os:
Library not included in LLA/VLF or not
frozen
13
The Best I/O is no I/O
14. 14
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE
8 309 IMS10A 1947 Y 2,1 686 4 0,0
9 309 IMS10A 1947 Y 1,4 1.148 4 0,0
10 309 IMS10A 1947 Y 1,5 1.184 4 0,0
11 309 IMS10A 1947 Y 1,6 1.332 4 0,0
12 309 IMS10A 1947 Y 1,2 873 4 0,0
13 309 IMS10A 1947 Y 1,1 603 4 0,0
14 309 IMS10A 1947 Y 1,3 649 4 0,0
15 309 IMS10A 1947 Y 1,3 1.026 4 0,0
16 309 IMS10A 1947 Y 1,1 622 4 0,0
17 309 IMS10A 1947 Y 1 463 4 0,0
8 412 IMS20A 122D Y 3,1 1.099 4 0,0
9 412 IMS20A 122D Y 4,3 1.623 4 0,0
10 412 IMS20A 122D Y 4,4 1.783 4 0,0
11 412 IMS20A 122D Y 4,4 1.901 4 0,0
12 412 IMS20A 122D Y 4,2 1.306 4 0,0
13 412 IMS20A 122D Y 3,1 985 4 0,0
14 412 IMS20A 122D Y 3,2 1.041 4 0,0
15 412 IMS20A 122D Y 4,2 1.628 4 0,0
16 412 IMS20A 122D Y 3,1 882 4 0,0
17 412 IMS20A 122D Y 2 656 4 0,0
The Best I/O is no I/O
15. Most common reasons for excessive
I/Os:
Library not included in LLA/VLF or not
frozen
Small DB2 Buffer Pools
15
The Best I/O is no I/O
16. 16
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE
8 325 DB1111 9D0C Y 9,3 14.696 160 0,0
9 325 DB1111 9D0C Y 11,9 14.379 125 0,0
10 325 DB1111 9D0C Y 11,5 13.852 136 0,0
11 325 DB1111 9D0C Y 15 16.619 126 0,0
12 325 DB1111 9D0C Y 9,7 11.784 166 0,0
13 325 DB1111 9D0C Y 7,2 9.323 220 0,0
14 325 DB1111 9D0C Y 13,2 11.294 200 0,0
15 325 DB1111 9D0C Y 11,7 15.884 203 0,0
16 325 DB1111 9D0C Y 5,8 7.324 225 0,0
17 325 DB1111 9D0C Y 3,3 3.622 197 0,1
The Best I/O is no I/O
17. Most common reasons for excessive
I/Os:
Library not included in LLA/VLF or not
frozen
Small DB2 Buffer Pools
Bad access paths
Bad SQL
...
17
The Best I/O is no I/O
18. How much CPU does an I/O cost?
Our study (some years ago) estimated 1
MIPS every 50 I/O per second for
directory reads
1000 I/O per second =
1000 / 50 = 20 MIPS
Recent IBM study (Feb 2015) estimated
35 CPU microseconds (on a 2827-712)
per DB2 synchronous I/O
1000 I/O per second =
0,035 * 14166 / 12 = 41 MIPS
18
The Best I/O is no I/O
20. Virtual memory above 2 GB can only
be allocated by using memory objects
A memory object is a contiguous range
of virtual addresses that is allocated in
units of megabytes on a megabyte
boundary
Memory objects can be written to 4K,
1MB and 2GB pages (available since
zEC12)
1MB and 2GB pages are called large
memory pages
Exploiting Large Pages
21. From “ABCs of z/OS System Programming - Volume 1”
64 bit addressing
In addition to Segment
and Page tables:
• Region 3 tables to
map 2048 segment
tables (up to 4 TB)
• Region 2 tables to
map 2048 Region 3
tables (up to 8 PB)
• Region 1 tables to
map 2048 Region 2
tables (up to 16 EB)
Exploiting Large Pages
23. As a general rule large pages may provide
performance value to long-running
memory access-intensive applications
First large memory pages exploiters:
the z/OS nucleus (since z/OS 1.12)
DB2 buffer pools (since V10) when the
PGFIX=YES parameter is specified
JVM can use large memory pages (both for
code-cache and heap) by specifying the –
Xlp option; more recent JVM versions will
automatically use large memory pages if
they are available
ADABAS
Exploiting Large Pages
24. Additional exploiters:
DB2 executable code (since V11)
IMS CQS (since V12)
Various IMS pools (since V13)
IMS OLDS (since V13)
System Logger (since z/OS 1.13)
USS
Exploiting Large Pages
26. WLC Checks for Managers
Customers have the primary responsibility for preventing
uncontrolled loops, operator errors, or unwanted
utilization spikes. However, IBM understands that,
occasionally, situations that could not be prevented
(especially situations related to disaster recovery) might
cause exceptional utilization values. In these situations,
IBM does not normally expect customers to pay for the
increased utilization associated with the unusual
situation. Use your best judgement to determine if an
unusual situation has occurred. IBM does not publish a
list of unusual situations because, by their nature, they
will be unpredictable.
From the “Using the Sub-Capacity Reporting Tool”
manual.
26
27. Not a “beautiful” day ?
• Machine is a 2097-717
valued 1,329 MSUs
• Report refers to February
2012
• 4-hour rolling average
monthly peak is 1,309
MSUs
• It happened on Sunday
• Note the big difference
with the second peak
value (354 MSUs)
27
28. Not a “beautiful” day ?
Bad news
At this customer site Saturday and
Sunday are not business days so a
such high value on Sunday has to be
considered abnormal
In this case it was caused by a long,
recovery activity needed to fix a data
corruption issue following the
migration to new storage processors
which happened on the previous day
28
29. • Machine is a 2827-711
valued 1.593 MSUs
• Report refers to
December 2014
• 4-hour rolling average
monthly peak is 1.017
MSUs
• It happened on Friday
• The difference with the
second peak value is 97
MSUs
(un)Happy Hour
DATE DAY TYPE MODEL MSU USED
19/12/2014 Fri 2827 711 1.593 1.017
03/12/2014 Wed 2827 711 1.593 914
04/12/2014 Thu 2827 711 1.593 866
15/12/2014 Mon 2827 711 1.593 836
30/12/2014 Tue 2827 711 1.593 827
29/12/2014 Mon 2827 711 1.593 824
16/12/2014 Tue 2827 711 1.593 824
18/12/2014 Thu 2827 711 1.593 823
23/12/2014 Tue 2827 711 1.593 809
17/12/2014 Wed 2827 711 1.593 782
24/12/2014 Wed 2827 711 1.593 774
02/12/2014 Tue 2827 711 1.593 738
22/12/2014 Mon 2827 711 1.593 728
05/12/2014 Fri 2827 711 1.593 722
31/12/2014 Wed 2827 711 1.593 702
19/12/2014 Fri 2827 711 1.593 621
01/01/2015 Thu 2827 711 1.593 584
06/12/2014 Sat 2827 711 1.593 574
20/12/2014 Sat 2827 711 1.593 572
25/12/2014 Thu 2827 711 1.593 532
13/12/2014 Sat 2827 711 1.593 261
22/12/2014 Mon 2827 711 1.593 257
28/12/2014 Sun 2827 711 1.593 218
21/12/2014 Sun 2827 711 1.593 213
29
30. Looking at the different systems’
contributions, it appeared clear that
the peak was due to something running
inside the SYS2 system
Our customer asked the technical team
for a deeper analysis
(un)Happy Hour
SYSTEM 12 13 14 15 16 17 18 19 20 21 22 23
SYS1 96 103 120 130 130 125 106 87 75 69 56 21
SYS2 699 720 746 538 549 594 736 878 898 867 746 580
SYS3 4 4 3 4 5 3 4 4 4 4 3 3
SYS4 44 43 38 35 38 43 49 48 40 39 30 23
TOTAL 843 870 907 707 722 765 895 1017 1017 979 835 627
30
31. The late afternoon peak was caused by a
TSO user running into a loop
As you can see in the above report,
TSO001 used about all the MSUs of 1 CP
continuously for about 5 hours
(un)Happy Hour
WKL ADDRESS SPACE SRVCLASS MEAN 12 13 14 15 16 17 18 19 20 21 22 23
TSO TSO001 TSO 71 97 138 143 142 141 142 47
JOB BATCH001 BATCHHI 31 4,8 56,3
JOB BATCH002 BATCHHI 25 27,3 49,2 8,6 14,5
JOB BATCH005 BATCHHI 24 31,3 38,8 0,5
JOB BATCH006 BATCHHI 23 29,9 38,8 0,5
JOB BATCH008 BATCHHI 22 8,3 18,8 28,2 49,4 29 19,1 22,7 1,3
DB2 DB2DIST DDFDB2 22 29,5 22,7 33,4 52,4 63 30,7 6,6 5,6 3,3 8,7 3,8 1,2
31
32. • ZNET workload used
to be very stable
• something happened
on October 24th
• It was Monday !
• First idea was to
check for
maintenance
activities performed
in the week end
The system you don’t expect
DATE DAY MSU SYSA TST1 TST2 ZNET TOT
16/10/2011 Sun 1.139 395 5 5 18 423
17/10/2011 Mon 1.139 886 7 7 43 942
18/10/2011 Tue 1.139 896 8 7 43 954
19/10/2011 Wed 1.139 869 9 8 43 928
20/10/2011 Thu 1.139 851 8 7 45 910
21/10/2011 Fri 1.139 796 7 7 41 850
22/10/2011 Sat 1.139 684 5 5 24 718
23/10/2011 Sun 1.139 376 5 5 16 402
24/10/2011 Mon 1.139 863 7 7 79 955
25/10/2011 Tue 1.139 891 9 7 78 985
26/10/2011 Wed 1.139 900 10 8 78 996
27/10/2011 Thu 1.139 892 8 8 79 987
28/10/2011 Fri 1.139 842 7 7 75 931
29/10/2011 Sat 1.139 698 5 5 40 748
30/10/2011 Sun 1.139 385 5 5 38 433
31/10/2011 Mon 1.139 979 7 7 84 1077
01/11/2011 Tue 1.139 988 10 8 86 1092
32
33. The system you don’t expect
A more detailed ZNET workload analysis
showed a correspondent CPU increase of the
session manager address space
The new version of the session manager
caused such a big increase (about 40 MSUs).
In this case most of these MSUs were
recovered thanks to some PTFs
Being able to measure and report this issue
gave the customer the possibility of
discussing the October and November
monthly bills with IBM in order to reduce
them
33
34. DATE CURR NO IIPCP
2014-10 1267 1192
2014-09 1218 1092
2014-08 1182 1076
2014-07 1206 1146
2014-06 1200 1140
2014-05 1194 1134
2014-04 1188 1129
2014-03 1152 1094
2014-02 1134 1077
2014-01 1128 1128
2013-12 1140 1140
2013-11 1110 1110
2013-10 1120 1120
• IIPCP was always
substantially less
than CURR
• In October 2014 peak
hour the difference is
75 MSUs
Could we save more money with zIIP ?
34