Page 2• This document shows Examples of BVQ MDisk Group Cache Partition Analysis.• I have added some text from this IBM RedPaperThanks to Barry WhiteEach page with text taken from this RedPaper is marked with the IBM RedPaper logo
Page 3Terminology• Cache Cache is a combination of the actual physical cache medium and thealgorithms that are managing the data within the cache.• Demote-Ready A track that is in the demote-ready list is either old read cachedata, or write cache data that is successfully destaged to disk.• Destage The act of removing modified data from the cache and writing it to disk• Modified Data Write data in cache that is not destaged to disk• Non-owner Node The non-preferred node for a given VDisk• Owner Node The preferred node for a given VDisk• Partner Node The other node in a SVC IO Group• Page The unit of data held in the cache. In the SVC, this is 4 KB. The data in the cache is managed at the page level.A page belongs to one track.• Track The unit of locking and destage granularity in the cache. In the SVC, a track is 32 KB in size (eight pages).(A track might only be partially populated with valid pages.)
Page 4Node Performance / SVC Write Cache StatisticsWrite Cache is at a very high level – due to the fact that we see a mean value of all mdisk groups here it isvery likely that one or several cache partitions are overloaded. This overlaod will causes performanceissues!CPU %CPU %Write CacheFull %Write CacheFull %Read CacheFull %Read CacheFull %Write Cache up to 80% full Peaksup to 90% too highR/W Cache Max 77.5% OKCPU 20% bis 35% - OK
Page 5XIV Gen1444400ms/00ms/00ms/00ms/opopopop88880%0%0%0%----100%100%100%100%1700ms/1700ms/1700ms/1700ms/opopopopMDG Cache Partition Examples OverviewDS35121500ms/1500ms/1500ms/1500ms/opopopop88880%0%0%0%----100%100%100%100%DS35122400ms/2400ms/2400ms/2400ms/opopopop88880%0%0%0%----100%100%100%100%XIV Gen2600ms/600ms/600ms/600ms/opopopop88880%0%0%0%----100%100%100%100%1700ms/1700ms/1700ms/1700ms/opopopop1750ms/1750ms/1750ms/1750ms/opopopopOverview of all MDisk Group Cache Partitions of the Customer
Page 6MDG Group Cache Partition 3512• This Cache Partition is very heavyly used – it sometimes reaches 100% max Fullness butthe write destage rate stays in normal ranges when this happens.Cache Partition Fullnessmin avg maxCache Partition Fullnessmin avg maxTrack AccessTrack AccessTrack lockTrack lockWrite DestageWrite Destage
Page 7MDG Group Cache Partition 3512-02• This Cache Partition looks overloaded – long periods of 100% full and even more then100% - Write destage rates have phases of extreme high activity.High Destage RatesPanik Destage!High Destage RatesPanik Destage!Long Periods of 100%Max Write Cache FullLong Periods of 100%Max Write Cache Full
Page 8MDG Group Cache Partition XIV01• Looks like this XIV has generally got still some performance reserve. But there are peaks up to 90% - weshould try to figure out, where they are coming from. Most likely some Monster Volumes that write bigamounts of data into the SVC. These Volumes should not start in the moment.
Page 9MDG Group Cache Partition XIV02This Peak is Reference!This Peak is Reference!• Looks like the XIV is very hard working but not yet overloaded. What happens when we addmore load to this system? Write Cache Full will start to raise and problems will start whenwe reach the 100% limit more often. There is one peak up to 90% - this is the upper levelnot the “flat line” at 83%
Page 10Performant System – no Cache issue• This is the system of another customer with only onbe Mdisk Group in the SVC. This one MDISK Groupcan use all the existing Node Cache. We also have very perfromant storage systems in the backend.
Page 11Write Life Cycle to SVCExcerpt from Read Piece• When a write is issued to the SVC, it passes through the upper layers in thesoftware stack and add into the cache. Before the cache returns completion tothe host, the write must mirror the partner node. After the I/O is written to thepartner node, and acknowledged back to the owner node, the I/O is completedback to the host1.• The LRU algorithm places a pointer to this data at the top of the LRU. Assubsequent I/O is requested, they are placed at the top of the LRU list. Overtime, our initial write moves down the list and eventually reaches the bottom.• When certain conditions are met, the cache decides that it needs to free a certainamount of data. This data is taken from the bottom of the LRU list, and in thecase of write data, is committed to disk. This destage operation is only performedby the owner node.• When the owner node receives confirmation that the write to disk is successful,the control blocks associated with the track are modified to mark that the tracknow contains read cache data, and is added to the demote-ready list.Subsequent reads of recently written data are returned from cache. The ownernode notifies the partner node that the write data is complete, and the partnernode discards the data. The data is discarded on the non-owner (partner) node,because reads are not expected to occur to non-owner nodes.• Writeput in Cache as Trackscopy to partner nodeaccnowledge back• Cache LTU AlgorithmNew Data Top of list Tracks movedown the list when new data arrivesin cache• Cache becomes full Destageneeds to be freed.Destage LRU TracksOnly Owner node destages data!• Destage Operationwrite Track to disk systemreceive ACKtrack modified to read cacheTrack added to demote ready list.Partner Node will be informed.Partner Node discards databecause mirror not needed for readcache data
Page 12Read Life CycleExcerpt from Read Piece• When a read is issued to the SVC, it passes through the upper layers in thesoftware stack and the cache checks to see if the required data resides in cache.• If a read is made to a track already in the cache (and that track is populated withenough data to satisfy the read), the read is completed instantly, and the track ismoved to the top of the demote-ready list.• If a read is satisfied with data held in the demote-ready list, the control blocksassociated with the track is modified to denote that the data is at the top of theLRU list. Any reference to the demote-ready list is removed.• There is a distinction made between actual host read I/O requests, and thespeculative nature of old writes that are turned into read cache data. It is for thisreason that the demote-ready list is emptied first (before the LRU list) when thecache algorithms decide they need more free space.• Readput in Cache as Trackscopy to partner nodeaccnowledge back• Cache LTU AlgorithmNew Data Top of list Tracks movedown the list when new data arrivesin cache• Cache becomes full Destageneeds to be freed.Destage LRU TracksOnly Owner node destages data!• Destage Operationwrite Track to disk systemreceive ACKtrack modified to read cacheTrack added to demote ready list.Partner Node will be informed.Partner Node discards databecause mirror not needed for readcache data
Page 13Cache algorithm life cycleExcerpt from Read Piece• The cache algorithms attempt to maintain a steady state of optimal population that is nottoo full, and not too empty. To achieve this, the cache maintains a count of how muchdata it contains, and how this relates to the available capacity.• As the cache reaches a predefined high capacity threshold level it starts to free space ata rate known as trickle. Data is removed from the bottom of the LRU at a slow rate. Ifthe data is a write, it is destaged. If the data is a read, it is discarded.• Destage operations are therefore the limiting factor in how quickly the cache is emptied,because writes are at the mercy of the latency of the actual disk writes. In the case of theSVC this is not as bad as it sounds, as the disks in this case are controller LUNs. Almostevery controller supported by the SVC has some form of internal cache. When the I/Orate being submitted by the SVC to a controller is within acceptable limits for thatcontroller, you expect writes to complete within a few milliseconds.• However, problems can arise because the SVC can generally sustain much greater datarates than most storage controllers can sustain. This includes large enterprise controllerswith very large caches.• SVC Version 4.2.0 added additional monitoring of the response time being measured fordestage operations. This response time is used to ramp up, or down, the number ofconcurrent destage operations the SVC node submits. This allows the SVC todynamically match the characteristics of the environment that it is deployed. Up to 1024destage operations can be submitted in each batch, and it is this batch that is monitoredand dynamically adjusted.• Cachemaintain a steady state of optimalpopulation• High Capacity ThresholdFree space (rate is Trickle)From bottom of LRUDestage for write dataDiscard for read data• Backend Performance is key• SVC measures response time ofstiorage system and dynamicallyadjusts number of destage operationsper batch
Page 14Cache algorithm life cycleExcerpt from Read Piece• If the SVC incoming I/O rate continues, and the trickle of data from the LRU does notreduce the cache below the high capacity threshold, trickle continues.• If the trickle rate is not keeping the cache usage in equilibrium, and the cache usagecontinues to grow, a second high capacity threshold is reached.• This will result in two simultaneous operations:Any data in the demote-ready list is discarded. This is done in batches of 1024 tracks ofdata. However, because this is a discard operation, it does not suffer from any latencyissues and a large amount of data is discarded quickly. This might drop the cache usagebelow both high capacity thresholds.The LRU list begins to drop entries off the bottom at a rate much faster than trickle.• The combination of these two operations usually results in the cache usage reaching anequilibrium, and the cache maintains itself between the first and second high usagethresholds. The incoming I/O rate continues until the cache reaches the third and finalthreshold, and the destage rate increases to reach its maximum• Note: The destage rate, and number of concurrent destaged tracks are twodifferent attributes. The rate determines how long to wait between each batch. Thenumber of concurrent tracks determines how many elements to build into a batch.• Note: If the back-end disk controllers cannot cope with the amount of data beingsent from the SVC, the cache might reach 100% full. This results in a one-in, one-out situation where the host I/O is only serviced as quickly as the back-endcontrollers can complete the I/O, essentially negating the benefits of the SVCcache. Too much I/O is being driven from the host for the environment in whichthe SVC is deployed.• Cachewhat to do, when Cache cannotbe reduced quick enough
Page 15Cache PartitioningExcerpt from Read Piece• SVC Version 4.2.1 first introduced cache partitioning to the SVC code base. Thisdecision was made to provide flexible partitioning, rather than hard coding aspecific number of partitions. This flexibility is provided on a Managed Disk Group(MDG) boundary. That is, the cache automatically partitions the availableresources on a MDG basis.• Most users create a single MDG from the Logical Unit Numbers (LUN)s providedby a single disk controller, or a subset of a controller/collection of the samecontrollers, based on the characteristics of the LUNs themselves. For example,RAID-5 compared to RAID-10, 10K RPM compared to15K RPM, and so on.• The overall strategy is provided to protect the individual controller fromoverloading or faults.• If many controllers (or in this case, MDGs) are overloaded then the overall cachecan still suffer.• Table 1 shows the upper limit of write cache data that any one partition, orMDG, can occupy.Note: Due to the relationship between partitions and MDGs, you must becareful when creating large numbers of MDGs from a single controller. Thisis especially true when the controller is a low or mid-range controller.Enterprise controllers are likely to have some form of internal cachepartitioning and are unlikely to suffer from overload in the same manner asentry or mid-range.• Cache PartitioningCache is divided equal intopartitionsEach MDG receives one equalpart of SVC cacheprotect controller from overloadIsolate performance problems
Page 16ExampleExcerpt from Read Piece• Partition 1 is performing very little I/O, and is only20% full (20% of its 30% limit - so a very smallpercentage of the overall cache resource).• Partition 2 is being written to heavily. When itreaches the defined high capacity threshold withinits 30% limit, write data begins to destage. Thecache itself is below any of its overall thresholds.However, we are destaging data for the secondpartition.• We see that the controller servicing partition 2 isstruggling, cannot cope with the write data that isbeing destaged, and the partition is 100% full, andit is occupying 30% of the available cacheresource. In this case, incoming write data isslowed down to the same rate as the controlleritself is completing writes.• Partition 3 begins to perform heavy I/O, goesabove its high capacity threshold limit, and startsdestaging. This controller, however, is capable ofhandling the I/O being sent to it, and therefore, thepartition stays around its threshold level. Theoverall cache is still under threshold, so onlypartitions 2 and 3 are destaging, partition 2 isbeing limited, partition 3 is destaging well within itscapabilities.
Page 17ExampleExcerpt from Read Piece• Partition 4 begins to perform heavy I/O, when it reaches just over a third of its partition limit,and the overall cache is now over its first threshold limit. Destage begins for all partitions thathave write data - in this case, all four partitions. When the cache returns under the firstthreshold, only the partitions that are over their individual threshold allocation limits continueto destage.• Cache PartitioningPartition 1 – no problemPartition 2 – Controller is not fastenough to destage incoming data.Cache Partition 100% FullPerformance degradationPartition 3 – Controller is fast enoughto handle destage data – noperformance degradation• When overall SVC Cache reachesfirst threshold limitDestage starts for all Partitions
Page 18Deutsche Webseiten• BVQ Webseite und BVQ Wikihttp://www.bvq-software.de/http://bvqwiki.sva.de• BVQ Videos auf dem YouTube SVA Kanalhttp://www.youtube.com/user/SVAGmbH• BVQ Webseite von SVA GmbHhttp://www.sva.de/sva_prod_bvq.phpInternationale Webseiten• Developer Works BVQ Community Bloghttps://www.ibm.com/developerworks/...http://tinyurl.com/bvqblog• Developer Works Documents and Presentationshttps://www.ibm.com/developerworks/...http://tinyurl.com/BVQ-Documents
Page 19Weitere Informationen zu BVQ finden sie unter derWebseitewww.bvq-software.deBei Interesse an BVQ wenden Sie sich bitte an diefolgende E-Mail Adressemailto:firstname.lastname@example.orgBVQ ist ein Produkt derSVA System Vertrieb Alexander GmbH