The purpose of this presentation is to help storage administrators by providing a high level over view of the DS8000 performance characteristics, storage performance analysis processes, and storage tools.
The Goal of this Presentation is to provide some practical tips for storage administrators Author intro: End to End Performance Support for Managed Storage Service – 17 DS8000 Systems as of 5/25 Over 125 ESS Model 800 Over 2 Petabytes of Managed Storage Proactive, Reactive support for all customers
DS8000 not likely to have Backend DA saturation as looped SSA architecture is gone! Backend access is 2 GB/sec switched fibre! The back end DA saturation on the 200MB fiber will happen at rates not attainable by 40MB SSA. Since it is easier to generate disk load than it is to create faster disks, there will still be contention. Disk contention on arrays can and will exist. Port contention can still occur LPAR isolation is limited to model 9A2 New cache algorithm (SARC) is faster but perceived performance improvements will vary depending on actual workload. SARC provides up to a 25% improvement of the I/O response time in comparison to the LRU algorithm. It provides some batch/online isolation and is a platform for more isolation. %SARC improvement is more a description of the talent of the person who ran the measurements than it is a description of the product. People want a number, 25% is okay. I'm glad to have 256GB.
Various tests were ran with differing block sizes, and same number of I/Os, disk size, disk speed and workload 70/30 50% Read Hit For any given number of I/Os per second, the block size only seems to affect the utilization of the host adapters and the I/O bus. It doesn't appear to affect the utilization of the disk or the processors A mixed size workload (4k, 8k, 16k, 32k,) will perform exactly the same as a fixed size workload of, say all 16k blocks for the same number of IO/s and the same storage configuration. Don't worry about a mixture of large and small block sizes; worry about mixing large and small block sizes. Overall I/O throughput (MB/s) seems to be limited by the number of transactions, not the block size (up to a point!)
Various tests were ran with the same block size and differing number of I/Os. Disk size, disk speed and same workload 70/30 50% Read Hit Increasing the number of I/Os per second with a fixed block size does impact the utilization of disk together with SMP utilization Side note…other tests showed: Cache hit service time = .5 ms Disk size affects capacity not necessarily performance! Per Siebo, I did 18-36 GB comparisons on DS4000 for various workloads. There is not a difference due to disk size. If they seek the same, spin the same, and transfer the same, you get the same response for any I/O rate.
Gather Storage Server performance and configuration data Gather SAN fabric configuration and exception data If port saturation then contact SAN design team Analyze storage server configuration and performance data If DS8000 issue exists then recommend corrective actions Why do we use I/O response time? On most systems with virtualized storage with multiple paths the disk utilization numbers are misleading. You might have a device that shows 100% busy but it could have excellent response time. The device is not actually 100% busy because it is not really a device but a path to a logical storage unit located on multiple devices. The most telling type of I/O metric is the I/O response time. The I/O response time should be viewed in conjunction with activity rates to provide an understanding of the impact to the response time on the overall performance. There is a need to limit analysis to only the disks used by the application having a problem. LUN Serial Numbers can be used to correlate the Storage Server performance data with the server physical device information.
Different people have different ways of identifying the ripest fruit to harvest. The key is to make sure that the resource you identify and eventually take action upon is the resource the one that has the best balance between providing the biggest performance benefit and minimizing the impact of the change
PDCU Requires Root (AIX, Linux) or Administrator (Windows) privileges are not required for installation or execution of PDCU Note: File sizes are large (40-60MB) Windows note: Do not install PDCU into a path that contains the character sequence &quot;pd“ (a common choice!). Open a Tech Support Request to engage a specialist http://dalnotes1.sl.dfw.ibm.com/atss/techxpress.nsf/request?OpenForm There are ways to gather performance data about I/Os issued to Volumes residing on external storage subsystems from a server perspective. MVS probably has the most comprehensive set of measurements available. Distributed systems have information also and it varies from platform to platform.
With utilizations > 50% the response time starts to increase noticeably Port Weirdness…You can attach 4 channels(ports) to a Host Adapter. With one channel, you can get 200 MB/second; 2 channels, 400 MB/sec; 3 Channels 540 MB /sec, and 4 channels 540 MB/sec The response times are not completely arbitrary. For Ports, the write response time should be < 1 ms as it should be write hits 100% of time. Another measurement that is available for ports is the population. Busy = Population / (1+Population) For volume response time, it is my experience that the customers will not typically complain if the avg response time is < 6 ms for volume I/Os…this includes cache hits/cache misses. If the avg response times are > 6 ms it typically indicates contention on the backend at DA or DDM. Typically DDM for high I/Os small size transfer & DA if transfer size is > 256 KB For array response times > 6 ms is where you see the ddm utilization start to go beyond 50 % although 10ms is a more reasonable number.
In this case the read response times for several of the ports were > 6 ms…this could be an issue where 1 or more servers zoned to the hba is accessing a volume or set of volumes that are on a constrained array During this period there was no corresponding spikes in read i/o rate or throughput. Drill down to volume to see if there is any correlation
There appears to be correlation in the average read response time for all volumes and the average read response time for ports The next step is to drill down to the volumes that had high avg read response times during the point 7 – 16 on the x-axis
Put the Volume Statistics in Excel Table Create a pivot with Volumes in the row fields and time in the column field (Limit time to desired range) Place Average Read Response Time in Data Field! Create an average in the grand total field Copy volume ID to a new analysis table along with avg response times
In the pivot table add the summary of read i/o rate and copy to analysis table Create a Total I/O Response Time field (=sum (Avg Read RT * sum Read I/O Rate) + sum(Avg Write RT* SUM(Write I/O Rate). In this example I did not include the writes as they were neglible. I like to create a % Total colum which is the Total I/O RT for an individual Volume / the sum of the total response time for all Volumes. This provides an I/O metric that gives a sense of the overall impact and is a good metric to sort on.
In order to issue the CLI commands you must have installed the DSCLI component on a system with connectivity to the DS8000 management console (HMC) and you must have a userid with read privileges The column headers provide a sample of the output. In this case we are looking to ID the extpool (extent pool) id of the hot volumes previously identified and use the extent pool ID to identify the arrays in the lsarray output Drop the first character of the Rank ID and convert the rank to hex to match the output of ‘lsarray’ with that gathered in SDD output.
Now you that you have identified the arrays in question, look at the array response time and see if there is any correlation between the array response time and the volume response time. In this case there is a slight correlation between the spike in vol read response time and the response time on the array In reality you should look at the other metrics available for the array including I/O rates, transfer sizes, etc.
TPC 3.1.1 – The current GA version has several views that may be useful in the bottom up approach. Graph the array utilization for the subsystem you are interested in… By hovering over the spikes in the array utilization you can determine the time of the spikes. By referencing the legend, you can identify the array with the spike. In this case, the spikes were on array 18 at: 2:35pm, 4:18pm, 7:05pm, 11:32pm, 1:14 am,5:22 am,12:24 pm. By going back to the list of arrays you can drill down on the array to see the volumes stored on the array The first view is the Array view. This allows you to historically chart an array metric for up to 10 arrays within a single view. You can export the data to CSV. You can view additional arrays by selecting Next You can also modify the time window by checking the Limit days check box and specifying a start and stop time The array provides information on the lowest level of the disk storage subsystem (the physical disks) and can be very useful in PD. A list of metrics can be found in the following publication: IBM TotalStorage Productivity Center User’s Guide GC32-1775
After you drill down from the array to the volumes you are provided a list of volumes. At this point you can either chart the volume performance data (10 per chart) or export the data.
Requires that the CLI component gets installed either locally on the TPC server or remotely If remote, then access will need to be set up and you will need a userid/password. The userid/pwd you have to access the GUI works In order to gather stats for more than 1 component you need to issue the command multiple times
Confirm that the issue is NOT with server resources Verify that host CPU utilization, Paging I/O, and local HBA saturation are not source of performance issues Identify any host disks with high I/O response time Map the host disk to Storage Server device name Gather Storage Server performance and configuration data Gather SAN fabric configuration and exception data If port saturation then contact SAN design team Analyze storage server configuration and performance data If DS8000 issue exists then recommend corrective actions Why do we use I/O response time? On most systems with virtualized storage with multiple paths the disk utilization numbers are misleading. You might have a device that shows 100% busy but it could have excellent response time. The device is not actually 100% busy because it is not really a device but a path to a logical storage unit located on multiple devices. The most telling type of I/O metric is the I/O response time. LUN Serial Numbers can be used to correlate the Storage Server performance data with the server physical device information.
The AIX filemon tool is a trace based facility and should only be ran for a couple of minutes at a time. The other UNIX flavors provide I/O response time data that can be gathered continuously at reasonable intervals as they are not trace based (See Appendix C for other flavors). The read size is always in 512 byte blocks chunks. So in this case there were 620 reads. The avg read size was 8 Blocks (512 Byte blocks) or 4096 Bytes (4 KB) chunks. These are random I/Os as the number of read sequences is the same as the number of I/Os. The minimal information that you need to pull from this is: Time When filemon started, Volume, Reads, Avg Read Time, Writes, Avg Write Time. I would filter out any records that have 1 I/O or less. For the LUN >hdisk mapping, run the SDD command ‘datapath query essmap’ command. Minimally you will want to pull the Hostname and the hdisk information on a daily basis if you have access to the servers or install an agent that ftp’s the information to somewhere where you can load it in a configuration database. After the data has been formatted, sort by the highest average response time. It is helpful to create a pivot table and average the I/O response times for each of the LUNs and create a sorted list of the LUNs with the highest response time.
The first view sorts and sums the total Response Time (RT) for all I/Os associated with each LSS. In this case I/Os to volumes on LSS 7,10,8,9 & 12 make up 81% of the total I/O RT. These are the ones you should drill down on. I/Os are evenly distributed across the volumes on Rank ‘ffff’ – By sorting on different DS8000 components you can drill down to different physical levels. This is a good method to find saturated components when the components are consistently saturated. Another potentially beneficial view is that of the hot Ranks/Volumes over time. This can show the spikes that would not otherwise get observed in the summary tables.
In addition to the host Connection information, the output of the ‘datapath query essmap’ command contains the Rank and LSS. While there is no fixed association between LSS’s and physical placement on the DS8000, it is helpful in knowing that LUNs reside on different Server (Even numbered LSS’s belong to server0 and odd to server1) and on different ranks. Knowing the rank-> LUN association can help in determining which LUNs should be chosen for striped lvs or spread lvs.
The LUN ID provided in SDD is 11 Chars. The first 7 are the SN of the DS8000 image and the last 4 chars are the volume ID The first 2 columns of the ‘lsfbvol’ command are Name ID – The ID will correlate to the volume ID. There is also an Extent Pool Column The name is used in TPC for Data to identify the Volume The volume ID is used in PDCU and the Extent Pool can be used to identify the rank – using the Extent Pool as a key
VG = Volume Group Connection = The shark hba location Device Name = OS device LCU = Logical Control Unit ChPID = Channel Path ID =~ HBA or Path
Generally speaking the I/O response time is the amount of time it takes from the point where the I/O request hits the device driver until the I/O is returned from the device driver The iostat package for Linux is only valid with a 2.4 & 2.6 kernel
Trademarks & Disclaimer <ul><li>The following terms are trademarks of the IBM Corporation: </li></ul><ul><li>Enterprise Storage Server® - Abbreviated: ESS </li></ul><ul><li>TotalStorage® Expert TSE </li></ul><ul><li>FAStT/DS4000/DS8000 </li></ul><ul><li>AIX® </li></ul><ul><li>z/OS® </li></ul><ul><li>Other trademarks appearing in this report may be considered trademarks of their respective companies. </li></ul><ul><li>SANavigator,EFCM McDATA </li></ul><ul><li>UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited. </li></ul><ul><li>Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. EMC is a registered trademark of EMC Inc. </li></ul><ul><li>HP-UX is a registered trademark of HP Inc. </li></ul><ul><li>Solaris is a registered trademark of SUN Microsystems, Inc </li></ul><ul><li>Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. </li></ul><ul><li>UNIX is a registered trademark of The Open Group in the United States and other countries. </li></ul><ul><li>Disk Magic is a trademark of IntelliMagic ( http:// www.intellimagic.net ), </li></ul><ul><li>Disclaimer </li></ul><ul><li>The views in this presentation are those of the author and are not necessarily those of IBM </li></ul>
DS8000 Performance Enhancers Memory DIMMs Memory DIMMs P5 L3 Cache Host 2-way P5 570 Server 2-way P5 570 Server RIO-G Switched Fibre Interconnect 2Gb Fibre links Switched FC Disk Packs <ul><li>RIO-G </li></ul><ul><li>1 GB/sec per link full duplex </li></ul><ul><li>Spatial reuse (use all links) </li></ul><ul><li>At 50% utilization a loop supports 2GB/s sustained data transfer </li></ul>Host Adapter A Fibre Channel host port can sustain a 206 MB/s data transfer <ul><li>Back-end </li></ul><ul><li>XOR partial remain in the adapter, No cache bandwidth consumed </li></ul><ul><li>Switched-Fiber: two concurrent ops per loop. </li></ul><ul><li>POWER5 </li></ul><ul><li>Near linear SMP scaling </li></ul><ul><li>Simultaneous Multi Threading </li></ul><ul><li>Large L1, L2 and L3 caches </li></ul><ul><li>L3 cache directory on die </li></ul>Cache SARC provides up to 100% improvement in cache hits over LRU … … Memory DIMMs Memory DIMMs P5 L3 Cache
Disk Magic Introduction <ul><li>Disk Magic is a tool that models current and future performance of Disk Subsystems, IBM or other, attached to Open, iSeries or zSeries servers. </li></ul><ul><li>Disk Magic is a product of IntelliMagic ( http:// www.intellimagic.net ), developed in close cooperation with the IBM performance team in Tucson (AZ) and is licensed to the IBM Server Group, to be used for marketing support purposes. </li></ul><ul><li>Techline provides the service: Sales Support Connect (SPC) at 1-877-707-2727 for US and Canada; or 506-646-7498 for Latin America. </li></ul>
Disk Magic Observations – 70/30/50 – Varying Block Size
Disk Magic Observations – 70/30/50 – Varying # I/Os
Performance Analysis Process – I/O – Bottom Up Host resource issue? Fix it ID hot/slow Host disks ID hot/slow Host disks Host Analysis – Phase 1 Storage server Analysis – Phase 2 Storage Srvr perf data Fix it N Y
Storage Subsystem Performance Analysis Process Always Collect Performance Data! Application Problem? Disk Contention? Yes Fix ! No Look at Performance Data No Identify other resource Identify Ripest Fruit and Harvest
Collecting Performance Data – Storage Subsystem Requires Infrastructure Scalable Complicated Documented Expensive Supported Limited views/analysis Broad range of data collected TPC 3.1 No longer available for customer download Free Excel macro for post-processing provided Low collection system overhead Limited performance data Collects Port, Array, Volume data Limited documentation Easy installation/Usage Performance Data Collection Utility (PDCU) Cons Pros Tool
Bubba Numbers Not available but derived by (> of Avg Read KB/200,000 OR Avg Write KB/200,000) <ul><li>50% </li></ul>Utilization <ul><li>6 msec </li></ul>Avg Read RT SUM(Read Times + Write Times) / Interval length <ul><li>1 </li></ul>Population Backend Disk Response Time <ul><li>10 msec </li></ul>Avg Write RT Average RT NVS Delayed DFW I/Os Avg Read RT Utilization Avg Write RT Metric <ul><li>6 msec </li></ul><ul><li>5 % </li></ul><ul><li>10 msec </li></ul><ul><li>50 % </li></ul><ul><li>1 msec </li></ul>Threshold Volume Backend Disk Response Time Not available in PDCU Array Port Comment Component
Analyzing Volume Data – PDCU/Excel Pivot Table Volume Data Summary Table
Analyzing Volume Data – PDCU/Excel Continued 99% 19% 759.37 93.23 8.14 0x5001 80% 19% 780.99 96.86 8.06 0x4e0b 61% 20% 796.91 97.87 8.14 0x4e02 41% 20% 826.57 101.84 8.12 0x4f0b 21% 21% 828.30 102.32 8.10 0x5000 Cumulative % % Total Total I/O RT Avg Read I/O Rate Avg R RT Volume ID
Analyzing Volume Configuration - Map Volumes to Arrays <ul><li>Within DS CLI issue ‘lsfbvol’ and save output </li></ul>Name ID accstate datastate configstate deviceMTM datatype extpool sam captype cap (2^30B) cap (10^9B) cap (blocks) volgrp <ul><li>Within DS CLI issue ‘lsarray’ and save output </li></ul>Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) <ul><li>Correlate and convert Array ID to hexadecimal </li></ul>0x10 R16 P16 0x5001 0xA R10 P10 0x4e0b 0xA R10 P10 0x4e02 0x9 R9 P9 0x4f0b 0x10 R16 P16 0x5000 Hex Rank ID Rank ID Extent Pool ID Volume ID
Analyze the Arrays Associated with the Hot Volumes - PDCU
Performing Bottom Up Analysis using TPC for Disk – Array Utilization Report
Drill Down from the Array Table - TPC Select the magnifying glass icon to drill down to volumes From the volumes table you can chart all volumes volume
Getting Performance Data - tpctool <ul><li>Syntax is very particular – Read documentation </li></ul><ul><li>Prior to 3.1.2 output did not contain component ID! </li></ul><ul><li>CLI Guide </li></ul>
Performance Analysis Process – I/O – Top Down Host resource issue? Fix it ID hot/slow Host disks ID hot/slow Host disks Host Analysis – Phase 1 Storage server Analysis – Phase 2 Storage Srvr perf data Fix it N Y
Host I/O Analysis - Example of AIX Server Gather LUN ->hdisk information See Appendix A) Disk Path P Location adapter LUN SN Type vpath197 hdisk42 09-08-01[FC] fscsi0 75977014E01 IBM 2107-900 Format the data (email me for the filemon-DS8000map.pl script) Note: The formatted data can be used in Excel pivot tables to perform top-down examination of I/O subsystem performance ------------------------------------------------------------------------ Detailed Physical Volume Stats (512 byte blocks) ------------------------------------------------------------------------ VOLUME: /dev/hdisk42 description: IBM FC 2107 reads: 1723 (0 errs) read sizes (blks): avg 180.9 min 8 max 512 sdev 151.0 read times (msec): avg 4.058 min 0.163 max 39.335 sdev 4.284 Gather Response Time Data ‘filemon’ (See Appendix B) 91.8 2.868 1978 hdisk1278 75977010604 7597701 test1 18:04:05 May/30/2006 93.3 3.832 1605 hdisk42 75977014E01 7597701 test1 18:04:05 May/30/2006 AVG_READ_KB READ_TIMES #READS HDISK LUN DS8000 SERVER TIME DATE
Host I/O Analysis – Helpful Views – Pivot tables from filemon data and ‘datapath query essmap’ LSS 7 & 10 I/Os make up 47% of total RT LSS View Rank View Rank ‘ffff’ LUNs ‘0703’ & ‘0709’ make up 46% of total RT to LSS 7 & 10
DS8000 Port Layout -> ‘datapath query essmap’ Disk hdisk Connection port vpath5 hdisk42 R1-B4-H1-ZA 300 Excerpt from ‘ datapath query essmap’ <ul><li>Connection = 2107 Port information </li></ul><ul><li>B4 = 4 th I/O enclosure OR I/O Enclosure 3 </li></ul><ul><li>H1 = 1 st 4 port slot OR H0 </li></ul><ul><li>ZA = Fabric A in dual fabric </li></ul>
Correlating LUN from SDD with DS8000 Volume DS8000 SN VOLUME ID SDD ‘datapath query essmap’ NX3DA0001 0000 NAME VOLUME ID CLI ‘ lsfbvol’ 7597701 0000
Summary <ul><li>Top down approach is the most efficient way to id hot luns </li></ul><ul><li>PDCU is no longer available to customers but it works! </li></ul><ul><li>TPC GUI is limited in analysis views </li></ul><ul><li>tpctool is the best way to get raw data </li></ul><ul><li>Use Extent Pool as key to correlate Volume with Array </li></ul><ul><li>Develop your own Bubba numbers! </li></ul><ul><li>Look for hot arrays especially with large DDM capacity </li></ul><ul><li># of I/Os drives disk utilization more than transfer size </li></ul><ul><li>Spreading data and load balancing still need to be done! </li></ul>
Appendix A: Configuration - Getting LUN Serial Numbers for DS8000 Devices LCU ID, ChPID, devnum VOLSER RMF PP and online displays RMF zOS VG, hostname, Connection, hdisk,LSS LUN SN datapath query DS8000map SDD 1.6.X AIX Device Name, vpath LUN SN datapath query device SDD Linux SDD SDD Tool Device Name Serial datapath query device Wintel Device Name LUN SN datapath query device HP-UX, Solaris Other Metrics Key Command OS
Appendix B - Measure End-to-End Host Disk I/O Response Time RespTime, ActRate RMF Mon3 DEVR, etc. RMF zOS avgserv iostat -D iostat AIX 5.3 Avg. Disk sec/Read Physical Disk perfmon NT/Wintel svctm (ms) iostat –d 2 5 *iostat Linux iostat –xcn 2 5 sar –d filemon -o /tmp/filemon.log -O all Command/Object iostat sar filemon Native Tool svc_t (ms) Solaris avserv (ms) HP-UX read time (ms) write time (ms) AIX 5.x – 5.2 Metric(s) OS
Appendix C: Resources <ul><li>AIX Documentation </li></ul><ul><ul><li>http://www-1.ibm.com/servers/aix/library/index.html </li></ul></ul><ul><li>Linux – iostat </li></ul><ul><ul><li>http://linux.inet.hr/ </li></ul></ul><ul><li>HP-UX Documentation </li></ul><ul><ul><li>http://docs.hp.com/ </li></ul></ul><ul><li>Solaris Documentation </li></ul><ul><ul><li>http://docs.sun.com/app/docs </li></ul></ul><ul><li>DS8000 Redbooks (www.redbooks.ibm.com) </li></ul><ul><ul><li>IBM TotalStorage DS8000 Series: Performance Monitoring and Tuning </li></ul></ul><ul><ul><li>IBM TotalStorage DS8000 Series: Concepts and Architecture, SG24-6452 </li></ul></ul><ul><li>TPC Documentation (www.redbooks.ibm.com) </li></ul><ul><ul><li>Managing Disk Subsystems using IBM TotalStorage Productivity Center , SG24-7097 </li></ul></ul><ul><ul><li>IBM TotalStorage Productivity Center Installation and Configuration Guide GC32-1774 </li></ul></ul><ul><ul><li>IBM TotalStorage Productivity Center User’s Guide GC32-1775 </li></ul></ul>
Biography Brett Allison has been doing distributed systems performance related work since 1997 including J2EE application analysis, UNIX/NT, and Storage technologies. His current role is Performance and Capacity Management team lead ITDS. He has developed tools, processes, and service offerings to support storage performance and capacity. He has spoken at a number of conferences and is the author of several White Papers on performance