Data centre incident nov 2010 v3


Published on

University of Glamorgan's data centre incident.

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Whilst we had a Disaster Recovery Plan it wasn’t tested to this scale and there was an element of truth about this slide.
  • Lots of standalone physical servers on the floor under the tabletops.Lots more servers around the University hidden away in cupboards.
  • Due to be commissioned on the Tuesday following the incident.
  • 3. Show the System Dependencies Spreadsheet.
  • Data centre incident nov 2010 v3

    1. 1. Disaster and Recovery<br />By Alan Davies<br />Gregynog Colloquium 17th June 2011<br />
    2. 2.
    3. 3. Topics<br />Before the Flood<br />The “Disaster” !<br />The Recovery<br />Future<br />
    4. 4. Before Server VirtualisationHow the room looked in 2009<br />
    5. 5. Servers<br />Over 200 standalone<br />Virtualisation – 200 into 20 will go !<br />9 new Host Servers, holding 155 Virtual Servers<br />Power Savings<br />Space Savings<br />Resilience ??<br />
    6. 6. Storage<br />60TB of data<br />(100,000 CDs)<br />10GB per staff<br />Resilience ??<br />
    7. 7. Data Backup<br />Disk-to-Disk-to-Tape<br />40TB Disk capacity<br />Tape cartridges 1.6TB<br />48 Cartridge Tape Library<br /> Secure Fireproof Safes<br />
    8. 8. Environment Control<br />Power<br />UPS<br />Diesel Generator<br />Cooling<br />Humidity !!<br />
    9. 9. Secondary Data Centre<br />
    10. 10. The DisasterSunday 28 November<br />Freezing Temperatures<br />Rooftop Air Handler<br />Water, Water, Everywhere !!<br />
    11. 11. Water Trashed our lovely Server Room !<br />
    12. 12. Water Trashed our lovely Server Room !<br />
    13. 13. Water Trashed our lovely Server Room !<br />
    14. 14. Water Trashed our lovely Server Room !<br />Backup <br />Device survived!!<br />But Not the overnight tapes <br />Library Servers<br />
    15. 15. Lets Build Another One..!<br />
    16. 16. Lets Build Another One..!<br />Boxes x 300 <br />
    17. 17. Lets Build Another One..!<br />Luverly ! <br />Production Line <br />.. bit by bit ....<br />
    18. 18. Now to Restore Services !<br />University Gold Team (Chaired by the VC)<br />Business Continuity and Recovery<br />Prioritising Services<br />Tracking Progress<br />Communicating<br />Regular meetings, 29 Nov to 15 Dec<br />ISD Contingency Team<br />Recovery and Business Continuity<br />Mapping Service Dependencies<br />Managing Resources (people, procurement, time)<br />Directing operations<br />Dealing with Insurance Claim<br />Lots of staff involved<br />Everyone in the department had a part to play.<br />
    19. 19. Now to Restore Services !<br />Scale of Operation<br />165 Servers destroyed<br />121 Live Services<br />Core Services – 39 (Telephone, Web Site, Email, VLE...)<br />Non Core Services – 82 (Tills, HR, Invoicing...)<br />20 Test & Development Environments<br />Process<br />Cleaning the room and salvaging equipment<br />Limiting further risk by removing the cause<br />Identifying what services were working (not working)<br />Recovering services by alternative means (where we could)<br />Procuring equipment prior to the rebuild<br />Building a new server infrastructure<br />Recovering services by priority<br />Keeping the Gold Team informed<br />
    20. 20. Now to Restore Services !<br />Timeline<br />
    21. 21. What Next ?<br />Options Paper  DISAG<br />Independent Review<br />Prof David Baker<br /> Secondary Server Room<br />External Services?<br />
    22. 22. Lessons Learnt – Management Perspective.<br />People<br />Successful recovery is based on staff goodwill, commitment, professionalism. <br />Having and maintaining good relationships with suppliers. <br />Having a strong recovery team with management, operational and administration experience.<br />Having the Gold team to agree priorities. <br />Everyone wants to help!<br />Communications<br />Having a contacts list to get hold of key staff, and key suppliers.<br />People are patient and will wait for their systems if they understand the situation <br />The value of having a staff and student portal (especially when you don’t have it!) <br />The value of Facebook to get messages out to staff and students.<br />Sharing personal emails and mobile phone numbers to ease communication.<br />Communicating ‘what is happening with the recovery process’ is important for your own department staff. <br />Tempering expectations by communicating the right message to the organisation and customers. <br />
    23. 23. Lessons Learnt – Management Perspective.<br />Inventory<br />Keeping an itemised list of parts of equipment held in your Data Centre will allow you to replace equipment quickly. <br />Having a list of core services and their dependencies so that you can agree priorities for restoring.<br />Resilience<br />Don’t put all your eggs in one basket<br />Not to keep your backup/restore device in the same building <br />Never put equipment in front of a room cooling system which has a fan that is capable of blowing water across the room. <br />Never assume that because there is no water in the data centre that water cannot find a way into the building.<br />Procurement<br />Having the ability to raise orders quickly.<br />Using existing framework agreements to reduce time for procurements and European competition.<br />
    24. 24. Lessons Learnt – Management Perspective.<br />Operations<br />Keep a log of all decisions and actions taken.<br />If there is a risk, don’t delay in dealing with it.<br />Ensure that every system is backed up.<br />
    25. 25. The Future - How it looks today.<br />
    26. 26. How it looks today.<br />
    27. 27. How it looks today.<br />
    28. 28. An IT Infrastructure Incident<br />Any Questions?<br />