Dave Wagner, TeamQuest Advocate, and Chris Lynn, Safeway's Capacity and Performance Management, cover the application of automatic, exception-oriented analytics to a wide variety of IT and business metrics in order to simultaneously optimize service performance and IT cost. Multiple conceptual approaches will be presented, including pros and cons. Most of the presentation will be real examples by which Safeway has integrated performance, capacity, business, and power data into an automated optimization process spanning 1000’s of servers and virtual servers and their applications.
11. Background
• Manager of Safeway Capacity and
Performance Team
• ChrisLynn@usa.com
• http://www.linkedin.com/pub/chris-lynn/2/65/309/
• Environment Supported
• ~4000 servers (~1700 physical)
• ~200 significant applications
• Unix, Windows, Mainframe, Teradata, Tandem,
etc.
• Thousands of internal IT Customers, and
millions of shoppers
14. Automated Storage Forecasting:
File System Exceptions
• Weekly automated prioritized scan
• 4500 servers
• 45000 filesystems
• Focused on meaningful exceptions
• A proactive shift from find to fix
• Was – 50 minutes looking for
potential problems, 10 minutes to
fix
• Now- 5 minutes looking for
potential problems, 55 minutes
fixing them
• Impossible to do manually
15. New Automated (global exception)
File System Forecast Analytic Details
• Complex multi-level thresholds
1. Is file system utilization above 90% AND growing by >0.2% for the interval?
2. Is file system utilization above 75% AND growing by >2% for the interval?
3. Is the file system utilization above 15% AND growing by >15% for the interval?
4. Is /appl/patrol above 90% AND growing for the interval?
•
•
•
•
•
•
•
•
•
•
Individual exclusions and special cases
Physical and virtual in same report, but can be treated uniquely.
Sorted by date/time most likely to fill up
Show all candidates for a single server together (sorted by highest one),
minimize the time for operations to respond
Includes historical trend compared to just a point in time (e.g. df)
Forecast utilization trend into the future (multiple statistical options)
more than 24 hours of data to avoid temp FS
must have recent data to avoid shutdown servers
final measured number not below threshold
if final number >99.5% catches the very full fs that might not be growing.
18. Application Capacity Analysis
•
•
•
•
Automated Application Triage
All relevant metrics
Embedded expertise
Enterprise perspective of true
capacity
Capacity Risk Candidates
(OS--#of systems)
100%
50%
0%
Under Used
Well Used
Stressed
Highly Stressed
19. Integration With Business Metrics
System/Platform capacity data:
• Physical servers
• Virtual servers
• Tandem capacity systems
• Teradata capacity systems
• Datacenter facilities
Business perspective:
• Business transaction volumes
• Resource utilization
23. Lessons Learned/ Value Gained
• Reduced service risk
• More proactive less reactive
• Established a baseline to optimize capacity, and a
mechanism to measure the progress
• Business and IT alignment
• Performance and capacity to the business
• Management and technical personnel
• Launch slowly in phases to not overwhelm the groups
• People really do care about formatting and color
choice, not just content
Editor's Notes
Our way is better, faster, and more cost effective than the alternatives to Build a huge “data mart” (i.e. PMDB)Complexity = (data ETL) x (# sources) x (maint. effort) x (SDDC variability/dynamism) x … + Compliance: Data duplication, privacy, audit, etc… + Lock in”= Very costly and time-consumingOr Apply general purpose BI analytics to IT challenges they aren’t designed to handleAnswers Business questions, but…Not focused on IT Resource optimization, performance, capacityAgility? Core competence?
4,000 systems (physical and virtual) evaluated in 1 day for 6-17 metrics each against platform specific thresholds for a 30 day history16,000 capacity risk indicators from 40,000 metric checks on 4,000 systems16,000 capacity risk checks resulting in 3% highly stressed and 6% stressed
5 CPU metrics7 Memory metrics3 IO metrics2 Network metricsEvaluated at 4 levels of criticality per metric groupUnique thresholds per platformAggregated to an overall server capacity ratingOnly systems with a concern have detailed charts createdVisually obvious which metrics of concern4,000 systems (physical and virtual) evaluated in 1 day for 6-17 metrics each against platform specific thresholds for a 30 day history16,000 capacity risk indicators from 40,000 metric checks on 4,000 systems16,000 capacity risk checks resulting in 3% highly stressed and 6% stressed
Phvg06-Prod DMZPhvg07-Prod Server FarmPhvg08-NonProd DMZPhvg11-NonProd DMZPhvg09-NonProd Server FarmPhvg12-14 = WISE