Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Avoiding Data Center Disasters

302 views

Published on

Data centers are mission critical sites with complex components that are susceptible to a wide range of risks and hazards. When a data center outage occurs at another organization it is a "lesson learned." When it happens to you - it is a DISASTER. Watch the webinar: http://go.italerting.com/data-center-risks.html?trk=jimnelson_social

Published in: Business
  • IT Organizations don't realize they can save $100,000's with simple communication automation. It only takes 5 mn to do the math!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Avoiding Data Center Disasters

  1. 1. 1 How to Address IT Service Continuity Risk Factors | June 23, 2016 Avoiding Data Center Disasters
  2. 2. 2 Agenda • Introduction and Housekeeping • Avoiding Data Center Disasters with Jim Nelson • How to Respond to a Data Center Disaster with Vincent Geffray • Q&A Session with the Speakers The slides and recording will be sent to all registrants early next week.
  3. 3. 3 Questions @ITAlerting #ITAWebinar
  4. 4. 4 Speakers Vincent Geffray Senior Director Product Marketing Everbridge Jim Nelson President Business Continuity Service, Inc. (BCS)
  5. 5. Avoiding Data Center Disasters
  6. 6. 6 © 2016 BCS Avoiding Data Center Disasters n Main Causes of Downtime n Challenges we face Complexity & Risks n Some Warning Signs/ Indicators n Some suggestions, Quick Fixes/ Interventions/ Mitigations n References
  7. 7. Who has.. 7© 2016 BCS A data center that is 10 years old? A data center that is brand new-less than 2 years old? A data center that is 5 years old? A data center that is 15 years old? A data center that is 20 years old? A data center that more than 20 years old?
  8. 8. Who has…. 8© 2016 BCS Experienced an Scheduled outage in the last 12 months? Experienced an unplanned outage in the last 12 months? Completed a limited recovery exercise in the last 12 months? Completed a comprehensive recovery exercise, including business units in the last 12 months? Updated your BIA/ RA in the last 12 months?
  9. 9. Ponemon Institute n According to our new study, the average cost of a data center outage has steadily increased from: ¨ $505,502 in 2010 ¨ $690,204 in 2013 ¨ $740,357 in 2016 (or a 38 percent net change). n Maximum downtime costs increased 32 percent since 2013 and 81 percent since 2010. n Maximum downtime costs for 2016 are $2,409,991. 9© 2016 BCS
  10. 10. Ponemon n UPS failures are the #1 cause of unplanned data center outages ¨ accounting for one-quarter of all such events. n Cybercrime fastest growing cause of data center outages. ¨ 2 percent in 2010, 18 percent in 2013, 22 percent in the latest study. Cost impact $981k n Human error 22%. Cost impact $489k n IT equipment failure root cause of 4%. ¨ Cost impact $995k 10June, 2016
  11. 11. Gartner Report n 80% of outages impacting mission-critical services will be caused by people and process issues, and n more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues. 11© 2016 BCS
  12. 12. Gartner n The study highlights 7 items, I highlight 3: 1. How well are standards defined and followed? 2. How well are IT services documented or tracked 3. What is the degree of business risk that IT organizations will tolerate 12© 2016 BCS
  13. 13. D&B According to Dunn & Bradstreet, 59% of Fortune 500 companies experience a minimum of 1.6 hours of downtime per week - See more at: http://www.businesscomputingworld.co.uk/a ssessing-the-financial-impact-of- downtime/#sthash.KQaVtyLh.dpuf 13© 2016 BCS
  14. 14. Challenges n BIG DATA (structured, unstructured) n “Cloud” issues n Green, PUE, efficiencies n Competitive Intelligence, Analytics, Dashboards n Social media, wikis, blogs, video, n Regulatory, compliance n IA, Data Security, Cyber n Customers, Budget, resources, $$$, Time 14 © 2016 BCS
  15. 15. 15© 2013 BCS
  16. 16. Enterprise data center architecture Front end Web serving Web contentaccess Mid tier Application business logic Business transactions,commit/update Back end Data base query/response DC Architecture’s Three-tier Structure Adds Cost / Complexity Web Application Data base •Little resources other than networking, transient data •Large amounts of CPU and memory •Large amounts of CPU and memory, •Fast storage, •Low latency networking Workload characteristics Workload isolation Workload isolation NAS filer NAS filer Fc san Fc san 16
  17. 17. 17© 2016 BCS
  18. 18. Is all DOWNTIME the same? n Unplanned downtime: The Oopsie!... n Scheduled downtime: Backups, hardware, software upgrades, preventative maintenance, capacity upgrades, EOL, retiring old systems, network upgrades, cabling changes…. 18 © 2016 BCS
  19. 19. . The great majority of system and data unavailability is the result of planned downtime that occurs due to required maintenance. Unplanned downtime accounts for only about 10% of all downtime, its unexpected nature means that any single downtime incident may be more damaging to the enterprise, physically and financially, than many occurrences of planned downtime. 19© 2016 BCS
  20. 20. 20© 2016 BCS Warning Signs Data Center Location and Design Building codes, CRAC/ CRAH units on the raised floor, cleaning, vibration, UPS, fire detection / suppression, architectural ceilings, raised floor, walls Data Center Access Staff in room, packing materials, combustibles, service personnel, vendors, customer tours…
  21. 21. 21© 2016 BCS Location hazards, risks evaluation n Sample list of potential natural hazards Lions and tigers and bears oh My! Severe weather, Sandy, lighting, flood, hurricanes, typhoons, super storms, tornado, brush / forest fires, drought, earthquake, Nor'easter, blizzard, ice storm, tsunami, typhoon, slapping wires, etc n Potential man-made hazards Violence in the Workplace, Boston lock down (SIP), Fukishima, Gulf oil spills, terrorism, flight paths, neighbors, rail, hazmat, chemical spills, pollution, EMF, human error, terrorism, denied access, occupy “everything”, cyber attacks, RAID controller failure, bad power supplies, UPS failure, battery failures, cooling failure, change control problem, violence in the workplace, bridge collapse, etc
  22. 22. 22 © 2016 BCS Proximity Evaluation Tank Farm TankFarm TankFarm Power Plant
  23. 23. 23© 2016 BCS Building Evaluations n Rent / Buy / Build / Cloud Hybrids / Consolidation Building codes, existing building history, capacity, # of floors, stacking, security, wind speed, high speed network access, power quality, expansion space and %, architectural features, floor loading, open vertical / slab height, transportation, parking, detention/ retention ponds, backup water, cooling capacity, heat sinks, acoustical, bargaining unit or right to work, access to technical talent, access to physical plant / facilities talent, tax breaks, tax incentives, etc , etc nTCO: Design, Build, Operate, Decommission or repurpose
  24. 24. 24© 2016 BCS Causes of Outages Human Error Reaching design limits, poor change control, weak documentation, Lack of standards / processes / best practices, no training, tours and visitor access, accidents, water incursion ….. Power Quality issues Poor voltage / current / frequency regulation, common mode noise, grounding problems, harmonic issues, EMF, RFI, wireless… Design and operation Most design are fine, operations and changes compromise the design intent, poor or deferred maintenance, generator, UPS and batteries, mechanical moving parts, transfer switches, breakers, belts, pumps, capacitors, filters, Environment Temperature / Humidity, contamination, corrosion, pollution….
  25. 25. 25© 2016 BCS Cabling Standards?
  26. 26. 26 © 2016 BCS Cabling Standards?
  27. 27. OK so… Ideas, suggestions and approaches to address the integration and pervasiveness of technologies throughout the organization. 27© 2016 BCS
  28. 28. Get help! n The risks of positioning yourselves (ICT) as the “experts” on all things is you will be viewed and held accountable (read as blamed!). This can be a “resume updating event”. n I did NOT say it was your fault-I said I am going to BLAME you. 28© 2016 BCS
  29. 29. Suggestions and strategies n Engage Top Management n Timing is appropriate n Engage the “Business” n Establish a Steering Committee n Update, conduct a Business Impact Analysis (BIA), Risk Assessment (RA) n Use standards approach-ISO, TIA, BCSI etc. 29© 2016 BCS
  30. 30. Have a Plan n Align to standards! n Backups are fine n Restores are minimum n Testing is Critical n Alignment with Business-BCM, RM, IA, Resilience, n The “traditional” BIA and RA falls short unless done well! 30© 2016 BCS
  31. 31. Review n Location n Process control n Documentation n Train your people n Mind the little things n Project Management / Change Control n Engage the Business n KYP-Know your personnel 31© 2016 BCS
  32. 32. Organizational approach n Leverage other organizational areas (such as Risk, Information Assurance, BCM, Internal Audit, EH&S, Security, Insurance) n Hire it done---engage some experts n DIY-do it yourself n What are trade-offs? n Disruption –vs- Cost Benefit n Document! 32 © 2016 BCS
  33. 33. 33 Some suggestions n A tidy shop is a happy shop-Keep a clean facility n Hire a technical writer n Train your people—cost versus benefit n Noah (2x2) escort everyone n Clearly define expectations, procedures to define “what to do” and “not do” n Manage / Monitoring -Immediate response © 2016 BCS
  34. 34. 34 Some suggestions n Take readings and monitor the environment n Integrate with CAB n Keep Copper Communication Cables away from power cables, motors, transformers n Handheld devices –distance from EDP equipment © 2016 BCS
  35. 35. 35© 2016 BCS Some suggestions n Define and ENFORCE policies & procedures n No emergency deliveries-keep spares n No packing / unpacking inside the data center n No food, drinks, people if possible. n AVOID rushing, running and moving too quickly n Contractor selection & specifications before and after work n Walk off mats n Anti static procedures-USE them!
  36. 36. Some suggestions n Communicate with your teams n If you do not define expectations-they will make them up n Leverage your vendors n Talk with PEOPLE-yes that is allowed n Go outside your “comfort zone” n Look for choices 36© 2016 BCS
  37. 37. References n Gartner http://www.rbiassets.com/getfile.ashx/4211 2626510 n Ponemon http://www.emersonnetworkpower.com/en- US/Brands/Liebert/Documents/White%20 Papers/data-center-costs_24659-R02- 11.pdf 37
  38. 38. 38© 2016 BCS n ICOR http://theicor.org/ http://www.theicor.org/art/pdfs/ICORCEBroch ure-Web.pdf n Business Computing Week http://www.businesscomputingworld.co.uk/a ssessing-the-financial-impact-of-downtime/ n Symantec Dennis Wenk
  39. 39. Thank you n Jim Nelson-BCS, Inc. 866-629-6327 n www.businesscontinuitysvcs.com n International Consortium for Organizational Resilience (ICOR) 866-765-8321 n www.theicor.org 39June, 2016
  40. 40. 40 Vincent Geffray | Everbridge |@Vgeffray How to Respond to Data Center Disasters
  41. 41. 41 What are the Root Causes of an unplanned outage? 69% of the ROOT CAUSE
  42. 42. 42 What’s the cost of an unplanned data center outage? In 2016 An average of $7,000 per minute
  43. 43. 43 © 2016 Everbridge - Critical Communications for IT. ReproductionProhibited • Evolution of Enterprise IT An unplanned outage is a Business issue
  44. 44. 44 Define Major Incidents with your business Disruption of service? à Severe disruption or Interruption Impact to the business operations? à Large Urgency to restore à High
  45. 45. 45 Assess the business impact of each mission-critical service
  46. 46. 46 Create a Major Incident Response Virtual Team Major Incident Response Team Major Incident Response team may include members from: Ø Network engineering, Application support, DB support, Middleware team, Server/infrastructure team, ERP support, EMR support team Ø Service Desk manager Ø Customer service manager Ø Change manager Ø IT Service Director Ø 3rd Party vendor (technical contact) Primary Secondary Assign someone to be the Major Incident Manager
  47. 47. 47 Create a Communication Strategy + Who will you be contacting? • Your IT experts • Senior management • Impacted customers + How/where do you keep contact information up-to-date? • Individuals • On-call personnel • Static groups • Dynamic groups • Subscriptions + How will you be communicating? • Text, SMS, • Voice, text to voice • Emails • Mobile app, etc… + How often will you be communicating during the crisis? + What content will be communicated, to whom ? • Notification templates + What’s the escalation rule if your IT experts don’t respond
  48. 48. 48 Identify Your Contacts and Stakeholders Cyber attack leading to data breach
  49. 49. 49 Have Virtual Crisis Rooms Available + You need at least 2 conference bridges: • Your IT experts • Senior Management + Provide collaboration tools (Lync, Skype for Business, Slack…) + How will you be contacting Your IT experts? • Manually • From your ticketing system • From your Monitoring tools
  50. 50. 50 What You Need To Do: ü Definition of Major Incident ü Assess the Business Impact ü Create Major Incident Response Team ü Define Communication Strategy ü Have 2 Virtual Crisis Rooms
  51. 51. 51 All-inclusive IT Communications solution ü Unlimited Global Voice & SMS ü Intelligent Notification Templates ü On-call Schedules and Rotations ü Automatic Escalation ü 1-click Conference Call ü Integration with Ticketing Systems ü 99.99% true uptime ü Open APIs
  52. 52. 52 Questions @ITAlerting #ITAWebinar
  53. 53. 53 Thank you! www.ITAlerting.com

×