• Save
Stop Losing Sleep V1.0 20100414
Upcoming SlideShare
Loading in...5

Stop Losing Sleep V1.0 20100414






Total Views
Views on SlideShare
Embed Views



3 Embeds 20

http://www.linkedin.com 11
http://www.lmodules.com 5
https://www.linkedin.com 4



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Stop Losing Sleep V1.0 20100414 Stop Losing Sleep V1.0 20100414 Presentation Transcript

  • Stop Losing Sleep.  How we deal with our trickiest availability problems and How you can use the techniques, regardless of your size Russell Girten Vice President Process Transformation & Information Technology Alaska Communications Systems
  • The Environment
  • The Old User Experience
    • Users expected outages.
    • A long duration outage: email out of service for four days, July 2008
    • Users were better prepared for outage than IT.
    Not an actual user Our users weren’t this relaxed
  • The Experience of IT Staff
    • Little credibility
    • A staff of firefighters and heroes
    • Not enablers: Perceived as a drain on the business
    • A fragile infrastructure, to be sure.
    Not an actual member of IT They were much more beaten up.
  • Today’s IT Environment
  • What Did We Change?
    • Standard call center & remote footprint: thin client with Citrix
    • Easy client exchange for call centers.
    • Approximately 25% of our desktops are served through Citrix
    • We do not yet publish applications, but have a strong desire
    Remote Desktop
    • We match storage type to the application.
    • Virtualized storage remains close to the server in a private cloud
    • RAID storage when practical
    • Outside:
      • ASP applications (HR)
      • Failover
    • Tape Backup
    • Redundant Images
      • Anchorage
      • Hillsboro
    • High Availability
      • Midrange
      • Unix
      • Core 20 Windows Apps
    • Two Production Restores:
      • Bad PTF
      • Disk Reconfiguration
    • Cluster when possible
      • SQL Server
      • eMail
      • Core 20 Applications
    • Nodes hosted in our Hillsboro Customer Data Center
    • Traffic balanced and redirected via F5 and DNS
    • Private cloud services for:
      • Processor
      • Storage
    • Public cloud services for:
      • Management of selected applications
      • Invoice Print
      • HRIS Footprint
    • Aggressive investigation:
      • Google Apps
      • Off-site backup
    Cloud Processing
  • Connectivity Inside Alaska
    • Use the most appropriate connectivity for the job.
    • Heavily Metro Ethernet-based
    • Metro Ethernet is meshed – highly available, very flexible.
    • Branch offices supported through Metro Ethernet and DSL
    • PCI DSS Compliant
    • SAS 70 Ready
  • Connectivity Outside Alaska
    • Dual paths of connectivity
    • Dual Internet ingress
    • Dual entrances for critical network segments
    • ACS provides end-to-end service management.
    • Heavy build of VPN services to connect with vendors and partners
    • Wireless Connectivity
    • For Remote Employees
    • For High-Speed Backup
    The Wireless Option
  • Network Mgm’t & Redundancy
    • Dual entrances
    • Dual modes of connectivity
    • Heavy focus on scope management and DNS cleanup
    • Branch office print services are important
    • Software via SCCM
  • Established Availability Windows
    • Our commitments to the business are very clear.
    • Heavily focused on the Core 20 applications
    • Three IT change windows:
      • Fri/Sat Overnight
      • Sat/Sun Overnight
      • Midweek
  • Core 20 Stoplight
    • Stoplights let us quickly assess the state of our environment.
    • Focused on:
      • Resources
      • Age
      • Support
    • Where Red/Yellow exists, SIPs are required.
    • Implementation Plan
    • Results of User Acceptance Testing
    • Communication Plan
    • Post-Implementation Test Plan
    • Rollback Triggers and Plan
    Change Management Culture change is required to make this work. Management discipline is necessary, and the staff must buy-in.
  • High Usage Devices
    • We are aggressive about managing high-bandwidth utilization on low-speed connections. We use NetFlow.
    • We allow most types of Internet traffic, including streaming media, but…
    • We will contact high-usage personnel and advise “We noticed your service might not be working well. What can we do to help improve things…” This gets the point across.
  • High Bandwidth Consumption
    • For Each Terminating Location:
      • Busy Hour Throughput
    • Standard Business Day Management
    • Data collected with Intermapper and processed with Excel
    • Managed in 5 minute increments.
    • Number of tickets:
      • In the queue each day
      • Closed during week
      • Open at end of week
    • For:
      • All IT
      • Each Team
      • Each Application
    Service Volume & Throughput
    • For each urgent incident (and other selected topics)
    • What can we do to put in a foolproof solution to assure this doesn’t happen again
    • Would it be helpful to have a change buddy?
    • Initiated from data/expert level
    RCAs & Service Improvement Plans
  • The Experience of IT Staff Today http://imgs.xkcd.com/comics/devotion_to_duty.png
  • The experience of our users today
    • Urgent outages are much reduced
    • Reduced ticket load
    • More time for project-oriented activities
    • We (rarely) fail at the same place/in the same way more than once
    • We react quickly to irreversibly correct the causes of outage.
  • Stop Losing Sleep.  How we deal with our trickiest availability problems and How you can use the techniques, regardless of your size Russell Girten Vice President Process Transformation & Information Technology Alaska Communications Systems
  • Virus Protection
    • Vipre, Sunbelt Systems
    • Lower Overhead
    • Lower Cost
    • Fewer Servers Required
    • Simplified Management
    • Moderate usage of *nix
    • Often when we use open source apps.
    • Generally out of our Core 20, with one notable exception
    • Implementing High Availability, ETA May 2010
    Unix Processing
  • Power & HVAC
    • UPS-backed power
    • Generator-backed power
    • Hot aisles, cold aisles
    • Improving density management
    • Core of the business
    • Multiple Systems
    • Redundant components in cabinet
    • Failover to alternate data center
    Midrange Processing
  • Throughput Management
    • We manage for 24/7 performance and availability
    • Key Network Segments
      • Key field locations
      • Retail locations
    • We look for:
      • Incessant Chat
      • Spikes of Utilization
      • Time of day variation
    • Heavily virtualized environment
    • Dedicated hardware with virtualization for Core 20 systems.
    • If not on the Core 20, probably cloud processor and storage
    • Turkey Soup, anyone?
    Server Farms