Hptf 2240 Final

June 21, 2009 Hanging By a Thread: Using Capacity Planning to Survive Session 2240 Surf F 08:00 Wednesday Paul O’Sullivan

Topics Up for Discussion Introduction Current Status Case Study 1 – Capacity Planning Case Study 2 – Performance Analysis Findings Future

Introduction Paul O’Sullivan Capacity Management Consultant Capacity Planning/Performance Analyst since 1994 Infrastructure and Fixed Income Investment Banking/Insurance applications PerfCap Corporation

Current State of Performance Analysis and Capacity Planning Capacity Planning Different climate today to even 5 years ago Massive Proliferation of Servers Multi-platform, and Multi-tier Management non-interest High level data only Capacity Planning: ‘ too difficult to do so we will not bother’ Buy more servers – (not any more)

Issues Lack of specialists Too much data to collect Hard to correlate different platforms and treat application as an entity Top down approach Processes first, data later Diffused Responsibility … and....

Falling hardware costs Following is quotation for typical 4 way database server: 4 x CPU GBP 8,000 1 x Storage Array GBP 13,235 3 x Power supplies GBP 750 15 x Drives for Array GBP 4500 2 x 1GB Memory GBP 10,000 Total 35,500 Year: 2000 Refurbished!

OK anyone can complain…. … But how can we fix it? Two examples of recent work Capacity Planning Itanium Performance Analysis SQL Server and EVA Futures

June 21, 2009 Capacity Planning Oracle RAC on Itanium Linux

A Sample Study Oracle RAC Capacity Planning Currently 3-node RAC running on IA64 Linux Expect 3x workload on current Oracle RAC within next two years. Must evaluate capacity of current cluster. Examine upgrade alternatives if current configuration not capable of sustaining expected load.

RAC Node CPU Utilizations, July-Sept 2008

Selection of Peak Benchmark Load

CPU Utilization by Core Reasonable core load balance at heavy loads.

Workload Characteristics Primary Response Time Components oracleNDSPRD1 oracleLockProcs oracleProcs asmProcs Disk I/O CPU CPU CPU CPU Disk I/O Disk I/O Disk I/O Workload Class Process Count Multi- Processing Level Process Creation Rate (/sec) CPU Utilization Disk I/O Rate (/sec) oracleNDSPRD1 1110 547.1 0.925 73% 639 oracleLockProcs 8 3.2 0.007 5% 277 oracleWorkProcs 46 31.8 0.038 1% 14 ASM processes 20 9.7 0.017 0.2% 10 daemons 6 2.4 0.005 0.05% 4 data collector 1 0.4 0.001 0.3% 26 root processes 1161 266.0 0.968 3% 233 other processes 774 47.5 0.645 2% 311

Current System Response Time Curve 9% Headroom 9%

Current System Headroom Headroom 9% Capacity 100%

Findings - Current System At peak sustained load, 9% headroom CPU is primary resource bottleneck Possible solutions: Horizontal scaling Integrity upgrade Alternate hardware platform

Platform Alternatives (3 or 4 nodes) HP rx7620 (1.1 GHz, Itanium 2) – current configuration HP rx8640 (1.6 GHz, 24MB L3 cache), 16 core HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core IBM p 570 (2.2 GHz, Power 5), 16 core IBM p 570 (2.2 GHz, Power 5), 32 core IBM p 570 (4.7 GHz, Power 6), 16 core Sun SPARC Enterprise M8000 (2.4 GHz) , 16 core Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core Configuration must support 200% workload growth

Response Time vs Workload Growth 3-node RAC Note: CPU is primary resource bottleneck; disk and memory will support 200% growth

Response Time vs Workload Growth 4-node RAC

Qualifying Platforms 3 configuration platforms support growth: HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core IBM p 570 (2.2 GHz, Power 5), 32 core IBM p 570 (4.7 GHz, Power 6), 16 core Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core Horizontal scaling to 4 nodes will not change qualifying platforms.

Response Time vs Workload Growth (reduced core, 3-node configurations)

Response Time vs Workload Growth (reduced core, 4-node configurations)

Optimized Configurations Final choice based on cost and management issues. Platform 3-node 4-node Sun SPARC Enterprise M8000 (2.4 GHz) 32 24 HP rx8640 (1.6 GHz, 25MB L3 cache) 30 24 IBM p 570 (2.2 GHz, Power 5) 26 20 IBM p 570 (4.7 GHz, Power 6) 12 10

June 21, 2009 Performance Analysis SQL Server on HP Blades and EVA

Performance Analysis 1 Large Insurance firm acquisition Migrating applications Requirement of 10x times growth Much new hardware purchased 160 servers in environments Application still slow SQL Developers under the microscope

Performance Analysis Asked to examine SQL Server Application Theory was that EVA 6000 could not cope with IO load generated by SQL Used PAWZ Performance Analysis and Capacity Planning tool to find performance issues. EVA performance data ‘unavailable’, so used SAN modeling ability on PAWZ Capacity Planner

Hardware Configuration 16 way Quad Core HP Blade 460c 2 x FC 4Gb fibre cards SQL Server 2000 EVA 6000 96 disk disk group, 300Gb 15k drives Shared with other window servers

Initial Analysis SQL Server processes was generated very high response times on SAN drives SQL Server processes were themselves paging (flushing data onto disk) at regular intervals Overall IO rates were low 1000 IO/Sec. CPU Usage is low (10%) for a server of this type. (?) Memory Usage is low (15%)for a server of this type (?)

June 21, 2009 Not really high IO counts these days…. IO Rates

June 21, 2009 Very high D: drive response time…. Disk Response Time

June 21, 2009 Very high D: drive response time…. IO Sizes

June 21, 2009 SQL Server process generating all the IO Obviously, something wrong with the application, right? Process-based IO Rates

June 21, 2009 1.7Gb. Excuse me? But the server has 24Gb of memory SQL Server Memory

June 21, 2009 Soft paging into the free list SQL Server paging

June 21, 2009 Soft paging into the free list huge IO load generated as data I s moved to and from the SQL Server process SQL Server paging

So what happened? Although SQL Server Enterprise can be configured to use all available memory it will not use more than 1.7Gb actual memory until Address Windowing Extensions (AWE) is enabled. AWE has to be configured by the sp_configure utility (show advanced options) AWE has to be enabled and then provided a required memory size. AWE will not operate if there is less than 3Gb of free memory on the server: SQL Server will disable it.

June 21, 2009 Production: IO before

June 21, 2009 Production: IO After

June 21, 2009 Production: IO Q Before

June 21, 2009 Production: IO Q After

June 21, 2009 Production: Disk Busy Q Before

June 21, 2009 : Production: Disk Busy Q after HUGE reduction in disk busy

Result CPU increased Application could handle more concurrent users in test Customer very happy No hardware purchase, no project, no application change Rapid resolution to problem Took 2 hours to work it out - Problem was bad since January Relieved pressure on SAN Until another SQL Server with the same problem….

Lessons Even if performance tool is already in place, few people were using it well. Blame game without looking at the facts (data) Need to improve fault-finding capabilities Better ways to correlate data Automatic methods of alerting as to real problem and nature of problem Classic case of the ‘cause behind the cause’

So what do we need? 1 st hurdle overcome – obtaining data 2 nd hurdle overcome – presenting data efficiently 3 rd hurdle overcome – scalability of performance data from clients 4 th hurdle overcome – automatic capacity planning data 5 th hurdle – to do – making sense of the data Expert reports Just showing the issues Removing the need for manual analysis

Want to know more? Booth Number 631 http://www.perfcap.com [email_address] [email_address] [email_address]

Hptf 2240 Final

More Related Content

What's hot

Similar to Hptf 2240 Final

Hptf 2240 Final

Editor's Notes