Mainframe Fine Tuning - Fabio Massimo Ottaviani

Mainframe Fine
Tuning
Fabio Massimo Ottaviani
EPV Technologies
(fabio.ottaviani@epvtech.com)
NRB Mainframe Day 2015

Disclaimer, copyright & trademarks
Disclaimer:
THE INFORMATION CONTAINED IN THIS PRESENTATION HAS NOT BEEN SUBMITTED TO
ANY FORMAL REVIEW AND IS DISTRIBUTED ON AN “AS IS” BASIS WITHOUT ANY
WARRANTY EITHER EXPRESS OR IMPLIED. THE USE OF THIS INFORMATION OR THE
IMPLEMENTATION OF ANY OF THESE TECHNIQUES IS A USER RESPONSIBILITY AND
DEPENDS ON THE USER’S ABILITY TO EVALUATE AND INTEGRATE THEM INTO THE
USER’S OPERATIONAL ENVIRONMENT. WHILE EACH ITEM MAY HAVE BEEN REVIEWED
FOR ACCURACY IN A SPECIFIC SITUATION, THERE IS NO GUARANTEE THAT THE SAME
OR SIMILAR RESULTS WILL BE OBTAINED ELSEWHERE. USERS ATTEMPTING TO ADAPT
THESE TECHNIQUES TO THEIR OWN ENVIRONMENTS DO SO AT THEIR OWN RISK.
Copyright Notice:
© EPV Technologies. All rights reserved.
Trademarks:
All the trademarks mentioned here belong to their respective
companies.
2

Introduction
 Reduce mainframe cost while
improving application performance is
still one of the most important goals
of companies running z/OS
applications
 In many situations needed actions
require both a technical analysis and
a management decision
 In this presentation, starting from
real life examples, we will focus on
what are the most common tuning
opportunities we found at many sites
3

Agenda
1. Who’s Using My CPU?
2. The Best I/O is no I/O
3. Large Memory Pages
4. WLC Checks for Managers
4

Who’s Using My CPU?
6
 This is an example of the abnormal
behaviour of a monitoring tool
 It normally uses few MIPS but for some
reasons on Saturday morning started to
loop using almost a full CPU
 Customer technical team tried to restart
the STC; it worked; in the mean time
they asked for a correction from the ISV

7
 Two heavy TSO users in the peak hours
 Customer created a Type3 WLM
Resource Group with a maximum limit
of 30% including the ALLTSO service
class
 A management decision may be needed

8
APPLID DATE TRANNAME FREQ 8 9 10 11 12 13 14 15 16 17
CICSP1 15/12/2014 TRX7 147.268 0,058 0,067 0,059 0,059 0,063 0,055 0,060 0,064 0,065 0,058
CICSP1 16/12/2014 TRX7 148.083 0,062 0,062 0,059 0,057 0,061 0,052 0,051 0,058 0,058 0,052
CICSP1 17/12/2014 TRX7 130.336 0,061 0,062 0,057 0,056 0,059 0,052 0,051 0,059 0,059 0,047
CICSP1 18/12/2014 TRX7 129.313 0,061 0,063 0,058 0,057 0,059 0,051 0,055 0,059 0,060 0,052
CICSP1 19/12/2014 TRX7 134.382 0,062 0,062 0,057 0,064 0,063 0,057 0,056 0,063 0,062 0,053
AVG CPU seconds per Execution
CICSP1 15/12/2014 TRX7 147.268 76 1.502 1.634 1.098 997 480 460 759 797 531
CICSP1 16/12/2014 TRX7 148.083 89 1.892 1.558 778 599 658 528 605 678 435
CICSP1 17/12/2014 TRX7 130.336 78 1.494 1.492 766 539 373 341 580 716 327
CICSP1 18/12/2014 TRX7 129.313 87 1.387 1.421 942 567 333 322 650 601 408
CICSP1 19/12/2014 TRX7 134.382 78 1.763 1.555 746 699 376 311 549 724 355
CPU seconds
CICSP1 15/12/2014 TRX7 147.268 21 415 451 303 275 133 127 210 220 147
CICSP1 16/12/2014 TRX7 148.083 25 523 430 215 165 182 146 167 187 120
CICSP1 17/12/2014 TRX7 130.336 22 413 412 212 149 103 94 160 198 90
CICSP1 18/12/2014 TRX7 129.313 24 383 393 260 157 92 89 180 166 113
CICSP1 19/12/2014 TRX7 134.382 22 487 430 206 193 104 86 152 200 98
MIPS

 Application tuning requires a joint
effort between technical and
developent teams
 Most of the times management
decision and commitment is needed
9

 Accessing data in memory provides better
performance and less CPU usage
 Many Data In Memory possibilities available
in z/OS; most of them since many years
 Because of current disk performance most
sites don’t care about the number of I/Os
they do
 To understand if the system I/O load is
excessive we suggest to use the IOC index
(calculated dividing the AVERAGE DISK I/O
RATE by AVERAGE MIPS USED)
 Values higher than 3 should be investigated
11
The Best I/O is no I/O

12
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
4,00
4,50 2014-W49
2014-W50
2014-W51
2014-W52
2015-W01
2015-W02
2015-W03
2015-W04
2015-W05
2015-W06
2015-W07
2015-W08
2015-W09
I/O rate - MIPS ratio
PRDA
PRDB

 Most common reasons for excessive
I/Os:
 Library not included in LLA/VLF or not
frozen
13

14
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE
8 309 IMS10A 1947 Y 2,1 686 4 0,0
9 309 IMS10A 1947 Y 1,4 1.148 4 0,0
10 309 IMS10A 1947 Y 1,5 1.184 4 0,0
11 309 IMS10A 1947 Y 1,6 1.332 4 0,0
12 309 IMS10A 1947 Y 1,2 873 4 0,0
13 309 IMS10A 1947 Y 1,1 603 4 0,0
14 309 IMS10A 1947 Y 1,3 649 4 0,0
15 309 IMS10A 1947 Y 1,3 1.026 4 0,0
16 309 IMS10A 1947 Y 1,1 622 4 0,0
17 309 IMS10A 1947 Y 1 463 4 0,0
8 412 IMS20A 122D Y 3,1 1.099 4 0,0
9 412 IMS20A 122D Y 4,3 1.623 4 0,0
10 412 IMS20A 122D Y 4,4 1.783 4 0,0
11 412 IMS20A 122D Y 4,4 1.901 4 0,0
12 412 IMS20A 122D Y 4,2 1.306 4 0,0
13 412 IMS20A 122D Y 3,1 985 4 0,0
14 412 IMS20A 122D Y 3,2 1.041 4 0,0
15 412 IMS20A 122D Y 4,2 1.628 4 0,0
16 412 IMS20A 122D Y 3,1 882 4 0,0
17 412 IMS20A 122D Y 2 656 4 0,0

I/Os:
frozen
 Small DB2 Buffer Pools
15

16
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE
8 325 DB1111 9D0C Y 9,3 14.696 160 0,0
9 325 DB1111 9D0C Y 11,9 14.379 125 0,0
10 325 DB1111 9D0C Y 11,5 13.852 136 0,0
11 325 DB1111 9D0C Y 15 16.619 126 0,0
12 325 DB1111 9D0C Y 9,7 11.784 166 0,0
13 325 DB1111 9D0C Y 7,2 9.323 220 0,0
14 325 DB1111 9D0C Y 13,2 11.294 200 0,0
15 325 DB1111 9D0C Y 11,7 15.884 203 0,0
16 325 DB1111 9D0C Y 5,8 7.324 225 0,0
17 325 DB1111 9D0C Y 3,3 3.622 197 0,1

I/Os:
frozen
 Small DB2 Buffer Pools
 Bad access paths
 Bad SQL
 ...
17

 How much CPU does an I/O cost?
 Our study (some years ago) estimated 1
MIPS every 50 I/O per second for
directory reads
1000 I/O per second =
1000 / 50 = 20 MIPS
 Recent IBM study (Feb 2015) estimated
35 CPU microseconds (on a 2827-712)
per DB2 synchronous I/O
1000 I/O per second =
0,035 * 14166 / 12 = 41 MIPS
18

 Virtual memory above 2 GB can only
be allocated by using memory objects
 A memory object is a contiguous range
of virtual addresses that is allocated in
units of megabytes on a megabyte
boundary
 Memory objects can be written to 4K,
1MB and 2GB pages (available since
zEC12)
 1MB and 2GB pages are called large
memory pages
Exploiting Large Pages

From “ABCs of z/OS System Programming - Volume 1”
64 bit addressing
In addition to Segment
and Page tables:
• Region 3 tables to
map 2048 segment
tables (up to 4 TB)
map 2048 Region 3
tables (up to 8 PB)
map 2048 Region 2
tables (up to 16 EB)

-
5
10
15
20
25
30
35
0,0%
1,0%
2,0%
3,0%
4,0%
5,0%
6,0%
7,0%
8,0%
9,0%
10,0%
06MAY13 07MAY13 08MAY13 09MAY13 10MAY13
%CPU cycles due to TLB1 miss CPU cycles/TLB1 miss

 As a general rule large pages may provide
performance value to long-running
memory access-intensive applications
 First large memory pages exploiters:
 the z/OS nucleus (since z/OS 1.12)
 DB2 buffer pools (since V10) when the
PGFIX=YES parameter is specified
 JVM can use large memory pages (both for
code-cache and heap) by specifying the –
Xlp option; more recent JVM versions will
automatically use large memory pages if
they are available
 ADABAS

 Additional exploiters:
 DB2 executable code (since V11)
 IMS CQS (since V12)
 Various IMS pools (since V13)
 IMS OLDS (since V13)
 System Logger (since z/OS 1.13)
 USS

WLC Checks for Managers
Customers have the primary responsibility for preventing
uncontrolled loops, operator errors, or unwanted
utilization spikes. However, IBM understands that,
occasionally, situations that could not be prevented
(especially situations related to disaster recovery) might
cause exceptional utilization values. In these situations,
IBM does not normally expect customers to pay for the
increased utilization associated with the unusual
situation. Use your best judgement to determine if an
unusual situation has occurred. IBM does not publish a
list of unusual situations because, by their nature, they
will be unpredictable.
From the “Using the Sub-Capacity Reporting Tool”
manual.
26

Not a “beautiful” day ?
• Machine is a 2097-717
valued 1,329 MSUs
• Report refers to February
2012
• 4-hour rolling average
monthly peak is 1,309
MSUs
• It happened on Sunday
• Note the big difference
with the second peak
value (354 MSUs)
27

Not a “beautiful” day ?
Bad news
 At this customer site Saturday and
Sunday are not business days so a
such high value on Sunday has to be
considered abnormal
 In this case it was caused by a long,
recovery activity needed to fix a data
corruption issue following the
migration to new storage processors
which happened on the previous day
28

• Machine is a 2827-711
valued 1.593 MSUs
• Report refers to
December 2014
• 4-hour rolling average
monthly peak is 1.017
MSUs
• It happened on Friday
• The difference with the
second peak value is 97
MSUs
(un)Happy Hour
DATE DAY TYPE MODEL MSU USED
19/12/2014 Fri 2827 711 1.593 1.017
03/12/2014 Wed 2827 711 1.593 914
04/12/2014 Thu 2827 711 1.593 866
15/12/2014 Mon 2827 711 1.593 836
30/12/2014 Tue 2827 711 1.593 827
29/12/2014 Mon 2827 711 1.593 824
16/12/2014 Tue 2827 711 1.593 824
18/12/2014 Thu 2827 711 1.593 823
23/12/2014 Tue 2827 711 1.593 809
17/12/2014 Wed 2827 711 1.593 782
24/12/2014 Wed 2827 711 1.593 774
02/12/2014 Tue 2827 711 1.593 738
22/12/2014 Mon 2827 711 1.593 728
05/12/2014 Fri 2827 711 1.593 722
31/12/2014 Wed 2827 711 1.593 702
19/12/2014 Fri 2827 711 1.593 621
01/01/2015 Thu 2827 711 1.593 584
06/12/2014 Sat 2827 711 1.593 574
20/12/2014 Sat 2827 711 1.593 572
25/12/2014 Thu 2827 711 1.593 532
13/12/2014 Sat 2827 711 1.593 261
22/12/2014 Mon 2827 711 1.593 257
28/12/2014 Sun 2827 711 1.593 218
21/12/2014 Sun 2827 711 1.593 213
29

 Looking at the different systems’
contributions, it appeared clear that
the peak was due to something running
inside the SYS2 system
 Our customer asked the technical team
for a deeper analysis
(un)Happy Hour
SYSTEM 12 13 14 15 16 17 18 19 20 21 22 23
SYS1 96 103 120 130 130 125 106 87 75 69 56 21
SYS2 699 720 746 538 549 594 736 878 898 867 746 580
SYS3 4 4 3 4 5 3 4 4 4 4 3 3
SYS4 44 43 38 35 38 43 49 48 40 39 30 23
TOTAL 843 870 907 707 722 765 895 1017 1017 979 835 627
30

 The late afternoon peak was caused by a
TSO user running into a loop
 As you can see in the above report,
TSO001 used about all the MSUs of 1 CP
continuously for about 5 hours
(un)Happy Hour
WKL ADDRESS SPACE SRVCLASS MEAN 12 13 14 15 16 17 18 19 20 21 22 23
TSO TSO001 TSO 71 97 138 143 142 141 142 47
JOB BATCH001 BATCHHI 31 4,8 56,3
JOB BATCH002 BATCHHI 25 27,3 49,2 8,6 14,5
JOB BATCH005 BATCHHI 24 31,3 38,8 0,5
JOB BATCH006 BATCHHI 23 29,9 38,8 0,5
JOB BATCH008 BATCHHI 22 8,3 18,8 28,2 49,4 29 19,1 22,7 1,3
DB2 DB2DIST DDFDB2 22 29,5 22,7 33,4 52,4 63 30,7 6,6 5,6 3,3 8,7 3,8 1,2
31

• ZNET workload used
to be very stable
• something happened
on October 24th
• It was Monday !
• First idea was to
check for
maintenance
activities performed
in the week end
The system you don’t expect
DATE DAY MSU SYSA TST1 TST2 ZNET TOT
16/10/2011 Sun 1.139 395 5 5 18 423
17/10/2011 Mon 1.139 886 7 7 43 942
18/10/2011 Tue 1.139 896 8 7 43 954
19/10/2011 Wed 1.139 869 9 8 43 928
20/10/2011 Thu 1.139 851 8 7 45 910
21/10/2011 Fri 1.139 796 7 7 41 850
22/10/2011 Sat 1.139 684 5 5 24 718
23/10/2011 Sun 1.139 376 5 5 16 402
24/10/2011 Mon 1.139 863 7 7 79 955
25/10/2011 Tue 1.139 891 9 7 78 985
26/10/2011 Wed 1.139 900 10 8 78 996
27/10/2011 Thu 1.139 892 8 8 79 987
28/10/2011 Fri 1.139 842 7 7 75 931
29/10/2011 Sat 1.139 698 5 5 40 748
30/10/2011 Sun 1.139 385 5 5 38 433
31/10/2011 Mon 1.139 979 7 7 84 1077
01/11/2011 Tue 1.139 988 10 8 86 1092
32

The system you don’t expect
 A more detailed ZNET workload analysis
showed a correspondent CPU increase of the
session manager address space
 The new version of the session manager
caused such a big increase (about 40 MSUs).
 In this case most of these MSUs were
recovered thanks to some PTFs
 Being able to measure and report this issue
gave the customer the possibility of
discussing the October and November
monthly bills with IBM in order to reduce
them
33

DATE CURR NO IIPCP
2014-10 1267 1192
2014-09 1218 1092
2014-08 1182 1076
2014-07 1206 1146
2014-06 1200 1140
2014-05 1194 1134
2014-04 1188 1129
2014-03 1152 1094
2014-02 1134 1077
2014-01 1128 1128
2013-12 1140 1140
2013-11 1110 1110
2013-10 1120 1120
• IIPCP was always
substantially less
than CURR
• In October 2014 peak
hour the difference is
75 MSUs
Could we save more money with zIIP ?
34

Could we save more money with zIIP ?
35

DATE IMP 0-6 IMP 0-5 IMP 0-4 IMP 0-3 IMP 0-2 IMP 0-1 IMP 0
2015-01 1.087 954 713 628 440 387 140
2014-12 1.167 1.017 590 513 373 323 143
2014-11 1.163 1.013 655 565 351 301 146
2014-10 1.028 847 485 426 291 223 117
2014-09 1.004 841 512 461 302 251 124
2014-08 1.006 833 498 459 314 257 133
2014-07 1.044 862 410 359 283 224 129
2014-06 1.090 924 499 414 295 239 136
2014-05 1.114 952 603 523 323 260 123
2014-04 1.117 946 554 513 299 249 150
2014-03 1.128 970 548 465 302 245 182
2014-02 1.071 900 486 392 292 232 149
2014-01 1.095 899 569 525 293 234 133
IBM 2827 Model 709 - 1.350 MSU
The importance of “importance”
• The difference
between these two
columns is the
contribution of
discretionary
workloads to the
software bill
• At this customer site,
this difference was
very high in every
month
36

The importance of “importance”
37

Mainframe Fine Tuning - Fabio Massimo Ottaviani

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Mainframe Fine Tuning - Fabio Massimo Ottaviani

Similar to Mainframe Fine Tuning - Fabio Massimo Ottaviani (20)

More from NRB

More from NRB (20)

Recently uploaded

Recently uploaded (20)

Mainframe Fine Tuning - Fabio Massimo Ottaviani