Managing a Major Incident


Published on

Presented by Ms Mayda Lim, Head of Implementation and Support, Thomson-Reuters at NUS-ISS ITSM CoP on 24 Apr.

Managing a Major Incident

  1. 1. Managing a Major Incident Case Study in Thomson Reuters Realtime Technology Operations Mayda Lim Head of Implementation & Support, Technology Operations 24 April 2014 Version: Final
  2. 2. AGENDA • Thomson Reuters • Our Service Management Journey • Managing a Major Incident
  3. 3. INTRODUCTION Thomson Reuters
  4. 4. INTRODUCTION Thomson Reuters Trading Investors Marketplaces Governance Risk & Compliance Large Law Firms Small Law Firms General Counsels Government Intellectual Property Scientific & Scholarly Research Life Sciences LegalFinancial & Risk IP & ScienceTax & Accounting Corporate Professional Knowledge Solutions Government Reuters News Media 4
  5. 5. INTRODUCTION Finance & Risk • We serve more than 40,000 customers and 400,000 end users in over 155 countries with a strong presence in North America and Europe and a growing presence in emerging markets. At least 50,000 customer applications also use our information • Our customers include energy companies, investment management firms, brokerage houses, industrial conglomerates, the world’s top corporations and the 25 largest global banks • We have the number 1 or 2 position in every segment we serve. We anticipate strongest growth in Governance Risk & Compliance, Commodities & Energy, Marketplaces, Transactions and Enterprise Content, Buy-side and Corporations and Global Markets. 5
  6. 6. Real-Time Technology Infrastructure • Probably, the largest private commercial network in the world, delivering news & content to desktops and trading applications across 155 countries • Connecting to ~250 exchanges with over 7000 non Exchange sources • 350,000 customer end-points across 50,000 customer sites • 10 million Reuters Instrument Codes in its head-end database, with Editorial and 3rd party news providing >50,000 stories a day • Delivery options available - depending upon content and latency needs and geography • 2.6 Million updates per second • Real-time, mission critical traffic 6 Client Client Client Exchanges News Contributors
  7. 7. Data Data Data Data Data Data SDC SDC SDC SDC SDC Client Client Client Client Client Resilience ◘ Resilience is a key aspect of our design, development and builds ◘ At the shared infrastructure level and at the Service level Service ◘ Dual System Installations  Automated switching ◘ Dual Power  Either at server/device level or through the use of Power finders ◘ Dual International Communications lines  Utilising multiple Telecom providers ◘ Dual Illumination  Dual Uplinks  Dual Receivers Network Resiliency Topology
  8. 8. Service Management Our Journey in Service Management
  9. 9. SERVICE MANAGEMENT Our Journey in Service Management Programme’s Transformation Objectives laid out in 2004 : • An organisation with a customer oriented proactive culture • A full implementation of appropriate ITIL Service Management processes in line with business requirements • Staff fully trained and motivated to provide great customer service • An integrated tool set to provide seamless end to end processing • A single managed source of reliable trusted data Improve Optimize Automate 9
  10. 10. SERVICE MANAGEMENT ITIL Processes Adoption 10 Process Detail Status Incident ◘ Severity levels, prioritization framework, escalation procedures, improved data capture, improved customer communications ◘ Standard Process, Standard tools & Process governance & roles in place Complete Problem ◘ Problem classifications, root cause analysis process and problem database ◘ Standard Process, Standard tools, Process governance & roles in place Complete Change ◘ Improved risk assessment and reporting, enhanced alignment with assets database ◘ Standard Process, Standard tool, Process governance & roles in place Complete Release ◘ Release policy, standardization of release documentation templates and guidelines, improved resource management via Forward Schedule of Release ◘ Standard Process, Process governance & roles in place Complete Capacity ◘ Systems under watch increased, capacity risk dashboard developed ◘ Standard Process, Standard tool, Process governance & roles in place Complete
  11. 11. SERVICE MANAGEMENT ITIL Processes Adoption 11 Process Detail Status Configuration ◘ SM tools Rollout following a complete audit, process supported by Change Management. ◘ Standard Process, Standard tools & Process governance & roles in place Complete Financial ◘ Technology operation sis fully align to business ◘ Accountability of CTO Complete Service Level ◘ Central Sourcing function ◘ Back to back internal and external SLA ◘ Service Target agreed Complete Knowledge ◘ Formal Process defined and mapped to tool ◘ Commissioned since Feb 2009 Complete Business Continuity ◘ Comprehensive Documentations ◘ Perform regular exercises ◘ Reviews and Updates Complete
  12. 12. SERVICE MANAGEMENT Tools 12 ◘ Service Manager ◘ Consolidated Service Desk solution providing best practices based on industry standards ◘ Incident Management ◘ Problem Management ◘ Inventory & Configuration Mgt ◘ Change Management ◘ Scheduled Maintenance ◘ Request Management ◘ Service Level Agreement Mgt ◘ Contract Management ◘ Diagnostic Aids AssetCenter ◘ Asset Management solution providing the greatest depth of procurement, inventory, financial and contract management functionality ◘ Portfolio ◘ Procurement ◘ Financials ◘ Cable & Circuit ◘ Contracts ◘ AssetCenter Web ITIL Ready Tools ◘ While ITIL processes in their own right can progress an organisation’s maturity and performance. When you couple this with an ITIL ready toolset major improvements can be noted ◘ An integrated toolset ensures clear process flows, consistency and efficiency
  13. 13. Managing a Major Incident Incident Control Centre (ICC)
  14. 14. What is a Major Incident? An incident is consider Major when • there is a complete or partial service failure (unavailability) • impact on business is extreme 14
  15. 15. 15 What Is An Incident Control Centre (ICC)? WHAT • Process called to manage Major Incidents • A focal point accountable for coordinating efforts, ensuring clear and concise customer communication ACTIONS • Communicate with all relevant stakeholders • Communicate effectively and professionally to our customers • Escalate to the Management team as appropriate • Coordinate diagnosis and recovery • Prioritize key activities • Continuously analyze and minimize service restoration timeframes • Manage all technical recovery activities through the IRT • Outline resourcing and escalation • Undertake risk and impact assessments • Determine follow-up actions 15
  16. 16. So, What does the typical life-cycle of an ICC look like? 16
  17. 17. 17 ICC Attributes • The ICC operates on a 24 x 7 x 365 basis • It is essential to escalate appropriately at all times day or night 17
  18. 18. The Benefits Of ICC • Customer focused • Consistent approach and methodology • Effective communication • Appropriate resource is guided and focused • Manages Risks associated with Major Incidents 18
  19. 19. What Can Go Wrong If An ICC Is Not Called? • Increased customer pain • Increased brand damage • Poorly or incorrectly understood Incidents • In-appropriate and indeed harmful actions may be initiated • Poor or no coordination of resources • Incorrect prioritization • Poor or no communication • Inconsistency in approach, management, actions and output • In simple terms – The situation escalates and creates more damage and pain 19
  20. 20. ICC Process 20
  21. 21. 21 ICC DEFCON Levels 21 Defense readiness condition (DEFCON)
  22. 22. Service Alert & Notification 22 Internal Communication External Communication
  23. 23. Key Roles 23 Incident Recovery Team (IRT) Management Team Meeting (MTM) Incident Management Group (IMG)
  24. 24. ICC Team Layout
  25. 25. Incident Recovery Team (IRT) 25 When • Whenever an incident severity being upgraded to DEFCON level. • Service impacting incident with unclear recovery path ROLE • The IRT are responsible for all technical recovery activities • It is the IRT’s role to provide and drive the ‘technical solution’ • The team is created at the request of the Incident Manager / Technical Recovery Manager • The Technical Recovery Manager will appoint an Incident Recovery Team Lead (IRTL) • Membership will vary depending upon the nature of the Incident, but will typically have a Incident Recovery Leader and a number of subject matter experts • The IRTL can change or supplement the team membership • The IRT meeting will remain open until service is restored
  26. 26. Incident Management Group (IMG) 26 WHEN • A IMG Meeting is called for all DEFCON levels within 30 minutes of an ICC being initiated • Meetings will occur hourly thereafter although the frequency can be adjusted with agreement from the Incident Manager • The IMG will last for no longer than 20 minutes and will be based in the ‘War Room’ ROLE • Act as a focal point for communication to ensure effective and professional communication occurs • Coordinate the activities of the most appropriate staff and teams solving the Incident – ensuring that the IRT has the right skills and leadership in place and that progress is being made as effectively as possible • The IMG can suggest and make membership changes to the Incident Recovery Team as they feel appropriate (DEFCON 2 and above) • It is not the IMG’s role to drop into ‘technical solution’ mode – this is the responsibility of the Incident Recovery Team
  27. 27. Management Team Meeting (MTM) 27 WHEN • Held for ‘Full ICC DEFCON 2’ or ‘Severity 0 DEFCON 1’ • Follows an IMG meeting within 60 minutes of the ICC being initiated • Subsequent meeting times will be agreed with the Incident Manager but may typically occur hourly thereafter (Only hourly if the incident escalates to DEFCON 1 (Emergency Management Committee)) • The MTM will last for no longer than 20 minutes ROLE • Ensure communication to both customers and senior managers is maintained • Make decisions based upon information provided by the IMG, providing support and guidance as appropriate, and whenever necessary escalate to the EMC
  28. 28. Stand Down The ICC • Clear path of recovery • Service restored • A conscious and recorded decision will be made to stand down all ICCs • The Service Alert must be updated to reflect the fact that the ICC has been closed • Root Cause Analysis (RCA) will be initiated by Problem Management 28
  29. 29. 29 ANY QUESTIONS
  30. 30. Connect me @ @MaydaLim