Successfully reported this slideshow.

Stop Losing Sleep V1.0 20100414


Published on

  • Be the first to comment

  • Be the first to like this

Stop Losing Sleep V1.0 20100414

  1. 1. Stop Losing Sleep.  How we deal with our trickiest availability problems and How you can use the techniques, regardless of your size Russell Girten Vice President Process Transformation & Information Technology Alaska Communications Systems
  2. 2. The Environment
  3. 3. The Old User Experience <ul><li>Users expected outages. </li></ul><ul><li>A long duration outage: email out of service for four days, July 2008 </li></ul><ul><li>Users were better prepared for outage than IT. </li></ul>Not an actual user Our users weren’t this relaxed
  4. 4. The Experience of IT Staff <ul><li>Little credibility </li></ul><ul><li>A staff of firefighters and heroes </li></ul><ul><li>Not enablers: Perceived as a drain on the business </li></ul><ul><li>A fragile infrastructure, to be sure. </li></ul>Not an actual member of IT They were much more beaten up.
  5. 5. Today’s IT Environment
  6. 6. What Did We Change?
  7. 7. KEY #1: HOSTING
  8. 8. <ul><li>Standard call center & remote footprint: thin client with Citrix </li></ul><ul><li>Easy client exchange for call centers. </li></ul><ul><li>Approximately 25% of our desktops are served through Citrix </li></ul><ul><li>We do not yet publish applications, but have a strong desire </li></ul>Remote Desktop
  9. 9. <ul><li>We match storage type to the application. </li></ul><ul><li>Virtualized storage remains close to the server in a private cloud </li></ul><ul><li>RAID storage when practical </li></ul><ul><li>Outside: </li></ul><ul><ul><li>ASP applications (HR) </li></ul></ul><ul><ul><li>Failover </li></ul></ul>Storage
  10. 10. <ul><li>Tape Backup </li></ul><ul><li>Redundant Images </li></ul><ul><ul><li>Anchorage </li></ul></ul><ul><ul><li>Hillsboro </li></ul></ul><ul><li>High Availability </li></ul><ul><ul><li>Midrange </li></ul></ul><ul><ul><li>Unix </li></ul></ul><ul><ul><li>Core 20 Windows Apps </li></ul></ul><ul><li>Two Production Restores: </li></ul><ul><ul><li>Bad PTF </li></ul></ul><ul><ul><li>Disk Reconfiguration </li></ul></ul>Backup
  11. 11. <ul><li>Cluster when possible </li></ul><ul><ul><li>SQL Server </li></ul></ul><ul><ul><li>eMail </li></ul></ul><ul><ul><li>Core 20 Applications </li></ul></ul><ul><li>Nodes hosted in our Hillsboro Customer Data Center </li></ul><ul><li>Traffic balanced and redirected via F5 and DNS </li></ul>Failover
  12. 12. <ul><li>Private cloud services for: </li></ul><ul><ul><li>Processor </li></ul></ul><ul><ul><li>Storage </li></ul></ul><ul><li>Public cloud services for: </li></ul><ul><ul><li>Management of selected applications </li></ul></ul><ul><ul><li>Invoice Print </li></ul></ul><ul><ul><li>HRIS Footprint </li></ul></ul><ul><li>Aggressive investigation: </li></ul><ul><ul><li>Google Apps </li></ul></ul><ul><ul><li>Off-site backup </li></ul></ul>Cloud Processing
  13. 13. KEY #2: CONNECTIVITY
  14. 14. Connectivity Inside Alaska <ul><li>Use the most appropriate connectivity for the job. </li></ul><ul><li>Heavily Metro Ethernet-based </li></ul><ul><li>Metro Ethernet is meshed – highly available, very flexible. </li></ul><ul><li>Branch offices supported through Metro Ethernet and DSL </li></ul><ul><li>PCI DSS Compliant </li></ul><ul><li>SAS 70 Ready </li></ul>
  15. 15. Connectivity Outside Alaska <ul><li>Dual paths of connectivity </li></ul><ul><li>Dual Internet ingress </li></ul><ul><li>Dual entrances for critical network segments </li></ul><ul><li>ACS provides end-to-end service management. </li></ul><ul><li>Heavy build of VPN services to connect with vendors and partners </li></ul>
  16. 16. <ul><li>Wireless Connectivity </li></ul><ul><li>For Remote Employees </li></ul><ul><li>For High-Speed Backup </li></ul>The Wireless Option
  17. 17. Network Mgm’t & Redundancy <ul><li>Dual entrances </li></ul><ul><li>Dual modes of connectivity </li></ul><ul><li>Heavy focus on scope management and DNS cleanup </li></ul><ul><li>Branch office print services are important </li></ul><ul><li>Software via SCCM </li></ul>
  18. 18. KEY #3: MANAGEMENT
  19. 19. Established Availability Windows <ul><li>Our commitments to the business are very clear. </li></ul><ul><li>Heavily focused on the Core 20 applications </li></ul><ul><li>Three IT change windows: </li></ul><ul><ul><li>Fri/Sat Overnight </li></ul></ul><ul><ul><li>Sat/Sun Overnight </li></ul></ul><ul><ul><li>Midweek </li></ul></ul>
  20. 20. Core 20 Stoplight <ul><li>Stoplights let us quickly assess the state of our environment. </li></ul><ul><li>Focused on: </li></ul><ul><ul><li>Resources </li></ul></ul><ul><ul><li>Age </li></ul></ul><ul><ul><li>Support </li></ul></ul><ul><li>Where Red/Yellow exists, SIPs are required. </li></ul>
  21. 21. <ul><li>Implementation Plan </li></ul><ul><li>Results of User Acceptance Testing </li></ul><ul><li>Communication Plan </li></ul><ul><li>Post-Implementation Test Plan </li></ul><ul><li>Rollback Triggers and Plan </li></ul>Change Management Culture change is required to make this work. Management discipline is necessary, and the staff must buy-in.
  22. 22. High Usage Devices <ul><li>We are aggressive about managing high-bandwidth utilization on low-speed connections. We use NetFlow. </li></ul><ul><li>We allow most types of Internet traffic, including streaming media, but… </li></ul><ul><li>We will contact high-usage personnel and advise “We noticed your service might not be working well. What can we do to help improve things…” This gets the point across. </li></ul>
  23. 23. High Bandwidth Consumption <ul><li>For Each Terminating Location: </li></ul><ul><ul><li>Busy Hour Throughput </li></ul></ul><ul><li>Standard Business Day Management </li></ul><ul><li>Data collected with Intermapper and processed with Excel </li></ul><ul><li>Managed in 5 minute increments. </li></ul>
  24. 24. <ul><li>Number of tickets: </li></ul><ul><ul><li>In the queue each day </li></ul></ul><ul><ul><li>Closed during week </li></ul></ul><ul><ul><li>Open at end of week </li></ul></ul><ul><li>For: </li></ul><ul><ul><li>All IT </li></ul></ul><ul><ul><li>Each Team </li></ul></ul><ul><ul><li>Each Application </li></ul></ul>Service Volume & Throughput
  25. 25. <ul><li>For each urgent incident (and other selected topics) </li></ul><ul><li>What can we do to put in a foolproof solution to assure this doesn’t happen again </li></ul><ul><li>Would it be helpful to have a change buddy? </li></ul><ul><li>Initiated from data/expert level </li></ul>RCAs & Service Improvement Plans
  26. 26. The Experience of IT Staff Today
  27. 27. The experience of our users today <ul><li>Urgent outages are much reduced </li></ul><ul><li>Reduced ticket load </li></ul><ul><li>More time for project-oriented activities </li></ul><ul><li>We (rarely) fail at the same place/in the same way more than once </li></ul><ul><li>We react quickly to irreversibly correct the causes of outage. </li></ul>
  28. 28. Stop Losing Sleep.  How we deal with our trickiest availability problems and How you can use the techniques, regardless of your size Russell Girten Vice President Process Transformation & Information Technology Alaska Communications Systems
  29. 40. Virus Protection <ul><li>Vipre, Sunbelt Systems </li></ul><ul><li>Lower Overhead </li></ul><ul><li>Lower Cost </li></ul><ul><li>Fewer Servers Required </li></ul><ul><li>Simplified Management </li></ul>
  30. 41. <ul><li>Moderate usage of *nix </li></ul><ul><li>Often when we use open source apps. </li></ul><ul><li>Generally out of our Core 20, with one notable exception </li></ul><ul><li>Implementing High Availability, ETA May 2010 </li></ul>Unix Processing
  31. 42. Power & HVAC <ul><li>UPS-backed power </li></ul><ul><li>Generator-backed power </li></ul><ul><li>Hot aisles, cold aisles </li></ul><ul><li>Improving density management </li></ul>
  32. 43. <ul><li>Core of the business </li></ul><ul><li>Multiple Systems </li></ul><ul><li>Redundant components in cabinet </li></ul><ul><li>Failover to alternate data center </li></ul>Midrange Processing
  33. 44. Throughput Management <ul><li>We manage for 24/7 performance and availability </li></ul><ul><li>Key Network Segments </li></ul><ul><ul><li>Key field locations </li></ul></ul><ul><ul><li>Retail locations </li></ul></ul><ul><li>We look for: </li></ul><ul><ul><li>Incessant Chat </li></ul></ul><ul><ul><li>Spikes of Utilization </li></ul></ul><ul><ul><li>Time of day variation </li></ul></ul>
  34. 45. <ul><li>Heavily virtualized environment </li></ul><ul><li>Dedicated hardware with virtualization for Core 20 systems. </li></ul><ul><li>If not on the Core 20, probably cloud processor and storage </li></ul><ul><li>Turkey Soup, anyone? </li></ul>Server Farms