CPU bottleneck issues netapp


Published on

HIGH CPU Utilization Issues on NetApp Filer

Published in: Technology

CPU bottleneck issues netapp

  1. 1. NetApp CPU Bottleneck Issues Some help when dealing with CPU bottleneck issues A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system and reduce the chance of making a false conclusion. But, to begin with, it makes sense to understand – How Data ONTAP makes use of multiple CPUs. Data ONTAP operating system implements coarse-grained symmetric multiprocessing (CSMP). What that means is - Data ONTAP handles processes across multiple CPUs and these processes are divided into different domains, but the key information to know is that although different domains can run simultaneously on different processors, each individual domain can only exist on a single CPU at any one time. This is useful, because it means that any domain showing 100% usage indicates a CPU bottleneck for that bundle of related processes. FAQ: CPU utilization in Data ONTAP: Scheduling and Monitoring [KB ID: 3014084] When you run 'sysstat -M 1' you can see CPU statistics across these domains:  Network  Protocol  Cluster  Storage  Raid  Target  Kahuna  WAFL_Ex(Kahu) Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage, Raid, Target, Kahuna ]
  2. 2. HIGH CPU does not always suggest problem in the filer. For example – On a Multi-Processor Filer the output of sysstat –x 1 may be quite deceiving b’cos it’s not showing the AVG utilization percentage which is more true indicative of system performance. What is Processor utilization? Processor utilization is nothing but the percentage of time the processor is busy. For example – Sysstat –x 1 is showing very high % age Whereas, sysstat –m 1 shows rather normal figures As long as AVG CPU %age is less than 80 % its fine, but realistically it should be around 50% should failover happened for any reason.
  3. 3. USEFUL KBs Block reclamation scanners cause kahuna bottleneck. http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480 What is the 'wafl scan status' command? https://kb.netapp.com/support/index?page=content&id=3011346 How does Data ONTAP make use of multiple CPUs? https://kb.netapp.com/support/index?page=content&id=3010150 [Apparently this KB: 3010150 is removed from the NetApp Support site] What causes High CPU during disk scrub although raid.scrub.perf_impact is set to low? https://kb.netapp.com/support/index?page=content&id=3011323 Data ONTAP 8: sysstat shows high CPU utilization on multiple processor system https://kb.netapp.com/support/index?page=content&id=2013653 How does Data ONTAP schedule work across multiple physical CPUs? https://kb.netapp.com/support/index?page=content&id=3010118 [Apparently this KB: 3010150 is removed from the NetApp Support site] If the Filer acts as a snapmirror destination, then it is busy running the Deswizzler after a snapmirror upgrade which can cause high CPU usage. By the way, what is deswizzler or deswizzling? https://kb.netapp.com/support/index?page=content&actp=LIST&id=3011866 You can monitor the deswizzler work with the command wafl scan status: https://kb.netapp.com/support/index?page=content&id=3011346 Diagnosing NetApp CPU Issues – Kahuna Bottlenecks http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahuna- bottlenecks/
  4. 4. Nice to know FACT: “A high CPU on a Storage Controller does not always mean CPU bottle neck or performance problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller is not busy with user protocols workload, it is doing background work like deswizzling or disk scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this scanner work down in order dedicate the CPU to user workload. “ FACT: “During Disk scrubbing, system will be checking the disk blocks of all disks for media errors and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by reconstructing the data from other disks and rewriting the data and that's the reason you see the CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to non-peak hours or change your RAID scrub speed to Low by using.” filer>options raid.scrub.perf_impact low WAFL SCAN There are many backgrounds WAFL scans for internal Filesystem maintenance. As a result one might "see" read/write activity in sysstat -x 1 command output. wafl scan is one of them which is always on and prioritized to run when the filer is idle. Volume vol0: Scan id Type of scan progress 213 active bitmap rearrangement fbn 1513 of 2230 w/ max_chain_len 3 This is normal!
  5. 5. NetApp performance Diagnosis commands Note: Don’t forget to enable print logging 'on' in the putty session, as the output will often exceed the screen length. Also, note that certain commands may not be available under 'Admin prompt [priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'. TIP: If you are not sure or confident about running these commands on the production filer, then always keep a SIMULATOR running by your side. This way, you can run these commands on the SIMULATOR and get your confidence level up a bit, before running them on the production filer. This command will give you over all stats per second [You can change the internal by providing different value such as 2,3,5,6 etc. for ex – sysstat -x 5]  filer>sysstat -x 1 For summarized results  filer>sysstat -x –s 1 Gives you a second-by-second readout of the filer’s performance. In particular look at the CP Time and CP Type – if you’re constantly hitting 100% CP Time and the CP Type is showing lots of B’s (back to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all the incoming data quickly enough. This conditions is also called -Deferred back to back CPs (CP generated CP) (This probably indicates that the condition is getting worse)  filer>priv set diag  filer>statit -b Then wait 5 secs then  filer>statit -e This command gives detailed stats of filer disk performance. The first begins (-b) the performance snapshot and the second ends (-e) it. The output can indicate which disks are being hammered. You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations Manager] http://media.netapp.com/documents/tr-4090.pdf NetApp Storage Monitoring Using HP OpenView http://www.netapp.com/us/media/tr-3688.pdf
  6. 6. Average CPU HIGH Bottleneck  To check how all the CPUs are doing:  filer>priv set diag  filer>sysstat -m 1 sysstat -m displays per-processor and average utilization. The ANY column in sysstat -m output shows the percentage of the time that one or more CPUs were busy. In addition to this, the utilization of each individual processor is displayed, as well as the average (AVG). As long as average CPU is not 100%, there is nothing to worry about. NetApp Oncommand Performance Advisor might show CPU as high as 100% consistently but do not panic, it’s just plotting the percentage of the time that one or more CPUs were busy. As you can see AVG CPU is pretty NORMAL. Only if you see ANY CPU Percentage @ 100 % consistently for 3 to 4 days that you need to be concerned and talk to Netapp and check if you are hitting any BUG..
  7. 7. Kahuna bottleneck The sum of the Kahuna domain and the (Kahu) from the WAFL_Ex domain reach 100% utilization.  To check how all the CPUs are doing across all domains:  filer>priv set diag  filer>sysstat -M 1 For summarized results  filer>sysstat -M –s 1 In this example below: I have circled 'kahuna domain' and squared 'kahu' just to make it clear. In this example – Kahuna domain + ( kahu) adds up to 95 & 96 percentage, which is quite high but not above 100% mark yet. IMP: Kahuna processes and (Kahu) processes cannot run simultaneously, so a potential Kahuna bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.  It is important to keep a watch on this domain percentage; it will be a matter of concern if it consistently remains at 100% for days [3 to 4 days] together. In most cases, this will get normalized in few hours. Hence, do not panic.
  8. 8. Reach Out to NetApp Support If you are unable to make sense of all this, do not worry, just contact NetApp technical Phone or Email Support, they are really good. In most cases, they will ask you to collect the logs and upload it to the NetApp support site. To help you do this, NetApp support will direct you to following tools for log collection:  Tool : Perfstat C:>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out Download the perfstat tool from the NetApp Support Site – Perfstat tool. http://support.netapp.com/NOW/download/tools/perfstat/ perfstat -f FILERNAME -t 3 -i 6 -l root -S pw: password > CASENUMBER.FILERNAME.PERFSTAT.OUT INFO: -i 6: How many iterations (scans) do you want to run on the filer. -t 3: This defines the delay between each scan Note: This sample is good enough for quick review for your own performance testing. However, NetApp might ask you to provide 24 Hrs sample for detailed analysis. In that case, you just need to alter the iteration and time delay as shown in the example below. perfstat -f filer1 -t 30 -i 48 This example will capture 48 samples of 30 minute intervals. That is total of [48*30/60] 24 hours  Tool: NSanity Collects details of all SAN related components for end-to-end diagnosis. For full command info check the NSanity page on the NOW site. http://support.netapp.com/NOW/download/tools/nsanity/  How to upload a file to NetApp https://kb.netapp.com/support/index?page=content&id=1010090
  9. 9. BUGs that are linked to HIGH CPU Utilization IMPORTANT TIP: Whenever you open a bug page in the NetApp Support site, always go to the link at the bottom of the 'Fixed-In Version' section, Titled: A complete list of releases where this bug is fixed is available here. This is b’cos the Fixed-In version section may not contain the complete list of Data ONTAP versions that are fixed. As shown in the figure below:  BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used' scanners  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=648017  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798 [Note: The BUG 648017 is fixed in the release since 8.1.2P3 onwards, so that indicates this bug is present in 8.1.2, but having said that, it doesn’t mean that you are hitting this BUG.]  BUG:91653: Volume SnapMirror source has high CPU usage  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=91653  BUG:110630: Wildcard searches from CIFS on large directories are CPU-intensive  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=110630  C-MODE BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high number of SAS shelves and disks  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957
  10. 10.  BUG: 590193:WAFL background file system scanner may cause high CPU usage.  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193  BUG:164124: Kerberos replay cache can cause high CPU usage  http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=164124 Courtesy: NetApp ashwinwriter@gmail.com Jan, 2014