Managing a Major Incident
Upcoming SlideShare
Loading in...5

Presented by Ms Mayda Lim, Head of Implementation and Support, Thomson-Reuters at NUS-ISS ITSM CoP on 24 Apr.

Presented by Ms Mayda Lim, Head of Implementation and Support, Thomson-Reuters at NUS-ISS ITSM CoP on 24 Apr.



Total Views
Views on SlideShare
Embed Views



1 Embed 9 9


Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Managing a Major Incident Managing a Major Incident Presentation Transcript

  • Managing a Major Incident Case Study in Thomson Reuters Realtime Technology Operations Mayda Lim Head of Implementation & Support, Technology Operations 24 April 2014 Version: Final
  • AGENDA • Thomson Reuters • Our Service Management Journey • Managing a Major Incident
  • INTRODUCTION Thomson Reuters
  • INTRODUCTION Thomson Reuters Trading Investors Marketplaces Governance Risk & Compliance Large Law Firms Small Law Firms General Counsels Government Intellectual Property Scientific & Scholarly Research Life Sciences LegalFinancial & Risk IP & ScienceTax & Accounting Corporate Professional Knowledge Solutions Government Reuters News Media 4
  • INTRODUCTION Finance & Risk • We serve more than 40,000 customers and 400,000 end users in over 155 countries with a strong presence in North America and Europe and a growing presence in emerging markets. At least 50,000 customer applications also use our information • Our customers include energy companies, investment management firms, brokerage houses, industrial conglomerates, the world’s top corporations and the 25 largest global banks • We have the number 1 or 2 position in every segment we serve. We anticipate strongest growth in Governance Risk & Compliance, Commodities & Energy, Marketplaces, Transactions and Enterprise Content, Buy-side and Corporations and Global Markets. 5
  • Real-Time Technology Infrastructure • Probably, the largest private commercial network in the world, delivering news & content to desktops and trading applications across 155 countries • Connecting to ~250 exchanges with over 7000 non Exchange sources • 350,000 customer end-points across 50,000 customer sites • 10 million Reuters Instrument Codes in its head-end database, with Editorial and 3rd party news providing >50,000 stories a day • Delivery options available - depending upon content and latency needs and geography • 2.6 Million updates per second • Real-time, mission critical traffic 6 Client Client Client Exchanges News Contributors
  • Data Data Data Data Data Data SDC SDC SDC SDC SDC Client Client Client Client Client Resilience ◘ Resilience is a key aspect of our design, development and builds ◘ At the shared infrastructure level and at the Service level Service ◘ Dual System Installations  Automated switching ◘ Dual Power  Either at server/device level or through the use of Power finders ◘ Dual International Communications lines  Utilising multiple Telecom providers ◘ Dual Illumination  Dual Uplinks  Dual Receivers Network Resiliency Topology
  • Service Management Our Journey in Service Management
  • SERVICE MANAGEMENT Our Journey in Service Management Programme’s Transformation Objectives laid out in 2004 : • An organisation with a customer oriented proactive culture • A full implementation of appropriate ITIL Service Management processes in line with business requirements • Staff fully trained and motivated to provide great customer service • An integrated tool set to provide seamless end to end processing • A single managed source of reliable trusted data Improve Optimize Automate 9
  • SERVICE MANAGEMENT ITIL Processes Adoption 10 Process Detail Status Incident ◘ Severity levels, prioritization framework, escalation procedures, improved data capture, improved customer communications ◘ Standard Process, Standard tools & Process governance & roles in place Complete Problem ◘ Problem classifications, root cause analysis process and problem database ◘ Standard Process, Standard tools, Process governance & roles in place Complete Change ◘ Improved risk assessment and reporting, enhanced alignment with assets database ◘ Standard Process, Standard tool, Process governance & roles in place Complete Release ◘ Release policy, standardization of release documentation templates and guidelines, improved resource management via Forward Schedule of Release ◘ Standard Process, Process governance & roles in place Complete Capacity ◘ Systems under watch increased, capacity risk dashboard developed ◘ Standard Process, Standard tool, Process governance & roles in place Complete
  • SERVICE MANAGEMENT ITIL Processes Adoption 11 Process Detail Status Configuration ◘ SM tools Rollout following a complete audit, process supported by Change Management. ◘ Standard Process, Standard tools & Process governance & roles in place Complete Financial ◘ Technology operation sis fully align to business ◘ Accountability of CTO Complete Service Level ◘ Central Sourcing function ◘ Back to back internal and external SLA ◘ Service Target agreed Complete Knowledge ◘ Formal Process defined and mapped to tool ◘ Commissioned since Feb 2009 Complete Business Continuity ◘ Comprehensive Documentations ◘ Perform regular exercises ◘ Reviews and Updates Complete
  • SERVICE MANAGEMENT Tools 12 ◘ Service Manager ◘ Consolidated Service Desk solution providing best practices based on industry standards ◘ Incident Management ◘ Problem Management ◘ Inventory & Configuration Mgt ◘ Change Management ◘ Scheduled Maintenance ◘ Request Management ◘ Service Level Agreement Mgt ◘ Contract Management ◘ Diagnostic Aids AssetCenter ◘ Asset Management solution providing the greatest depth of procurement, inventory, financial and contract management functionality ◘ Portfolio ◘ Procurement ◘ Financials ◘ Cable & Circuit ◘ Contracts ◘ AssetCenter Web ITIL Ready Tools ◘ While ITIL processes in their own right can progress an organisation’s maturity and performance. When you couple this with an ITIL ready toolset major improvements can be noted ◘ An integrated toolset ensures clear process flows, consistency and efficiency
  • Managing a Major Incident Incident Control Centre (ICC)
  • What is a Major Incident? An incident is consider Major when • there is a complete or partial service failure (unavailability) • impact on business is extreme 14
  • 15 What Is An Incident Control Centre (ICC)? WHAT • Process called to manage Major Incidents • A focal point accountable for coordinating efforts, ensuring clear and concise customer communication ACTIONS • Communicate with all relevant stakeholders • Communicate effectively and professionally to our customers • Escalate to the Management team as appropriate • Coordinate diagnosis and recovery • Prioritize key activities • Continuously analyze and minimize service restoration timeframes • Manage all technical recovery activities through the IRT • Outline resourcing and escalation • Undertake risk and impact assessments • Determine follow-up actions 15
  • So, What does the typical life-cycle of an ICC look like? 16
  • 17 ICC Attributes • The ICC operates on a 24 x 7 x 365 basis • It is essential to escalate appropriately at all times day or night 17
  • The Benefits Of ICC • Customer focused • Consistent approach and methodology • Effective communication • Appropriate resource is guided and focused • Manages Risks associated with Major Incidents 18
  • What Can Go Wrong If An ICC Is Not Called? • Increased customer pain • Increased brand damage • Poorly or incorrectly understood Incidents • In-appropriate and indeed harmful actions may be initiated • Poor or no coordination of resources • Incorrect prioritization • Poor or no communication • Inconsistency in approach, management, actions and output • In simple terms – The situation escalates and creates more damage and pain 19
  • ICC Process 20
  • 21 ICC DEFCON Levels 21 Defense readiness condition (DEFCON)
  • Service Alert & Notification 22 Internal Communication External Communication
  • Key Roles 23 Incident Recovery Team (IRT) Management Team Meeting (MTM) Incident Management Group (IMG)
  • ICC Team Layout
  • Incident Recovery Team (IRT) 25 When • Whenever an incident severity being upgraded to DEFCON level. • Service impacting incident with unclear recovery path ROLE • The IRT are responsible for all technical recovery activities • It is the IRT’s role to provide and drive the ‘technical solution’ • The team is created at the request of the Incident Manager / Technical Recovery Manager • The Technical Recovery Manager will appoint an Incident Recovery Team Lead (IRTL) • Membership will vary depending upon the nature of the Incident, but will typically have a Incident Recovery Leader and a number of subject matter experts • The IRTL can change or supplement the team membership • The IRT meeting will remain open until service is restored
  • Incident Management Group (IMG) 26 WHEN • A IMG Meeting is called for all DEFCON levels within 30 minutes of an ICC being initiated • Meetings will occur hourly thereafter although the frequency can be adjusted with agreement from the Incident Manager • The IMG will last for no longer than 20 minutes and will be based in the ‘War Room’ ROLE • Act as a focal point for communication to ensure effective and professional communication occurs • Coordinate the activities of the most appropriate staff and teams solving the Incident – ensuring that the IRT has the right skills and leadership in place and that progress is being made as effectively as possible • The IMG can suggest and make membership changes to the Incident Recovery Team as they feel appropriate (DEFCON 2 and above) • It is not the IMG’s role to drop into ‘technical solution’ mode – this is the responsibility of the Incident Recovery Team
  • Management Team Meeting (MTM) 27 WHEN • Held for ‘Full ICC DEFCON 2’ or ‘Severity 0 DEFCON 1’ • Follows an IMG meeting within 60 minutes of the ICC being initiated • Subsequent meeting times will be agreed with the Incident Manager but may typically occur hourly thereafter (Only hourly if the incident escalates to DEFCON 1 (Emergency Management Committee)) • The MTM will last for no longer than 20 minutes ROLE • Ensure communication to both customers and senior managers is maintained • Make decisions based upon information provided by the IMG, providing support and guidance as appropriate, and whenever necessary escalate to the EMC
  • Stand Down The ICC • Clear path of recovery • Service restored • A conscious and recorded decision will be made to stand down all ICCs • The Service Alert must be updated to reflect the fact that the ICC has been closed • Root Cause Analysis (RCA) will be initiated by Problem Management 28
  • Connect me @ @MaydaLim