• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data centre incident nov 2010   v3

Data centre incident nov 2010 v3



University of Glamorgan's data centre incident.

University of Glamorgan's data centre incident.



Total Views
Views on SlideShare
Embed Views



1 Embed 12

http://gregynog.glam.ac.uk 12


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Whilst we had a Disaster Recovery Plan it wasn’t tested to this scale and there was an element of truth about this slide.
  • Lots of standalone physical servers on the floor under the tabletops.Lots more servers around the University hidden away in cupboards.
  • Due to be commissioned on the Tuesday following the incident.
  • 3. Show the System Dependencies Spreadsheet.

Data centre incident nov 2010   v3 Data centre incident nov 2010 v3 Presentation Transcript

  • Disaster and Recovery
    By Alan Davies
    Gregynog Colloquium 17th June 2011
  • Topics
    Before the Flood
    The “Disaster” !
    The Recovery
  • Before Server VirtualisationHow the room looked in 2009
  • Servers
    Over 200 standalone
    Virtualisation – 200 into 20 will go !
    9 new Host Servers, holding 155 Virtual Servers
    Power Savings
    Space Savings
    Resilience ??
  • Storage
    60TB of data
    (100,000 CDs)
    10GB per staff
    Resilience ??
  • Data Backup
    40TB Disk capacity
    Tape cartridges 1.6TB
    48 Cartridge Tape Library
     Secure Fireproof Safes
  • Environment Control
    Diesel Generator
    Humidity !!
  • Secondary Data Centre
  • The DisasterSunday 28 November
    Freezing Temperatures
    Rooftop Air Handler
    Water, Water, Everywhere !!
  • Water Trashed our lovely Server Room !
  • Water Trashed our lovely Server Room !
  • Water Trashed our lovely Server Room !
  • Water Trashed our lovely Server Room !
    Backup 
    Device survived!!
    But Not the overnight tapes
    Library Servers
  • Lets Build Another One..!
  • Lets Build Another One..!
    Boxes x 300 
  • Lets Build Another One..!
    Luverly ! 
    Production Line 
    .. bit by bit ....
  • Now to Restore Services !
    University Gold Team (Chaired by the VC)
    Business Continuity and Recovery
    Prioritising Services
    Tracking Progress
    Regular meetings, 29 Nov to 15 Dec
    ISD Contingency Team
    Recovery and Business Continuity
    Mapping Service Dependencies
    Managing Resources (people, procurement, time)
    Directing operations
    Dealing with Insurance Claim
    Lots of staff involved
    Everyone in the department had a part to play.
  • Now to Restore Services !
    Scale of Operation
    165 Servers destroyed
    121 Live Services
    Core Services – 39 (Telephone, Web Site, Email, VLE...)
    Non Core Services – 82 (Tills, HR, Invoicing...)
    20 Test & Development Environments
    Cleaning the room and salvaging equipment
    Limiting further risk by removing the cause
    Identifying what services were working (not working)
    Recovering services by alternative means (where we could)
    Procuring equipment prior to the rebuild
    Building a new server infrastructure
    Recovering services by priority
    Keeping the Gold Team informed
  • Now to Restore Services !
  • What Next ?
    Options Paper  DISAG
    Independent Review
    Prof David Baker
    Secondary Server Room
    External Services?
  • Lessons Learnt – Management Perspective.
    Successful recovery is based on staff goodwill, commitment, professionalism.
    Having and maintaining good relationships with suppliers.
    Having a strong recovery team with management, operational and administration experience.
    Having the Gold team to agree priorities.
    Everyone wants to help!
    Having a contacts list to get hold of key staff, and key suppliers.
    People are patient and will wait for their systems if they understand the situation
    The value of having a staff and student portal (especially when you don’t have it!)
    The value of Facebook to get messages out to staff and students.
    Sharing personal emails and mobile phone numbers to ease communication.
    Communicating ‘what is happening with the recovery process’ is important for your own department staff.
    Tempering expectations by communicating the right message to the organisation and customers.
  • Lessons Learnt – Management Perspective.
    Keeping an itemised list of parts of equipment held in your Data Centre will allow you to replace equipment quickly.
    Having a list of core services and their dependencies so that you can agree priorities for restoring.
    Don’t put all your eggs in one basket
    Not to keep your backup/restore device in the same building
    Never put equipment in front of a room cooling system which has a fan that is capable of blowing water across the room.
    Never assume that because there is no water in the data centre that water cannot find a way into the building.
    Having the ability to raise orders quickly.
    Using existing framework agreements to reduce time for procurements and European competition.
  • Lessons Learnt – Management Perspective.
    Keep a log of all decisions and actions taken.
    If there is a risk, don’t delay in dealing with it.
    Ensure that every system is backed up.
  • The Future - How it looks today.
  • How it looks today.
  • How it looks today.
  • An IT Infrastructure Incident
    Any Questions?