The document describes Thomson Reuters' management of major incidents through an Incident Control Centre (ICC). The ICC aims to coordinate recovery efforts, communicate with stakeholders, and minimize outage times. It follows defined processes involving an Incident Recovery Team addressing the technical issues, an Incident Management Group overseeing communication and resources, and a Management Team Meeting for escalation and customer updates. The ICC operates continuously and stands down only when service is fully restored and a root cause analysis is initiated.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Managing a Major Incident
1. Managing a Major Incident
Case Study in Thomson Reuters Realtime Technology Operations
Mayda Lim
Head of Implementation & Support, Technology Operations
24 April 2014
Version: Final
4. INTRODUCTION
Thomson Reuters
Trading
Investors
Marketplaces
Governance Risk &
Compliance
Large Law Firms
Small Law Firms
General Counsels
Government
Intellectual Property
Scientific &
Scholarly Research
Life Sciences
LegalFinancial & Risk IP & ScienceTax & Accounting
Corporate
Professional
Knowledge
Solutions
Government
Reuters News
Media
4
5. INTRODUCTION
Finance & Risk
• We serve more than 40,000 customers and 400,000 end users in over 155
countries with a strong presence in North America and Europe and a growing
presence in emerging markets. At least 50,000 customer applications also
use our information
• Our customers include energy companies, investment management firms,
brokerage houses, industrial conglomerates, the world’s top
corporations and the 25 largest global banks
• We have the number 1 or 2 position in every segment we serve. We
anticipate strongest growth in Governance Risk & Compliance,
Commodities & Energy, Marketplaces, Transactions and Enterprise Content,
Buy-side and Corporations and Global Markets.
5
6. Real-Time Technology Infrastructure
• Probably, the largest private commercial
network in the world, delivering news &
content to desktops and trading applications
across 155 countries
• Connecting to ~250 exchanges with over
7000 non Exchange sources
• 350,000 customer end-points across 50,000
customer sites
• 10 million Reuters Instrument Codes in its
head-end database, with Editorial and 3rd
party news providing >50,000 stories a day
• Delivery options available - depending upon
content and latency needs and geography
• 2.6 Million updates per second
• Real-time, mission critical traffic
6
Client
Client
Client
Exchanges
News
Contributors
7. Data
Data
Data
Data
Data
Data
SDC SDC SDC SDC SDC
Client Client Client Client Client
Resilience
◘ Resilience is a key aspect of our design,
development and builds
◘ At the shared infrastructure level and at
the Service level
Service
◘ Dual System Installations
Automated switching
◘ Dual Power
Either at server/device level or
through the use of Power finders
◘ Dual International Communications lines
Utilising multiple Telecom
providers
◘ Dual Illumination
Dual Uplinks
Dual Receivers
Network Resiliency Topology
9. SERVICE MANAGEMENT
Our Journey in Service Management
Programme’s Transformation Objectives laid out in 2004 :
• An organisation with a customer oriented proactive culture
• A full implementation of appropriate ITIL Service Management processes in line with
business requirements
• Staff fully trained and motivated to provide great customer service
• An integrated tool set to provide seamless end to end processing
• A single managed source of reliable trusted data
Improve Optimize Automate
9
10. SERVICE MANAGEMENT
ITIL Processes Adoption
10
Process Detail Status
Incident
◘ Severity levels, prioritization framework, escalation procedures, improved data capture, improved
customer communications
◘ Standard Process, Standard tools & Process governance & roles in place
Complete
Problem
◘ Problem classifications, root cause analysis process and problem database
◘ Standard Process, Standard tools, Process governance & roles in place
Complete
Change
◘ Improved risk assessment and reporting, enhanced alignment with assets database
◘ Standard Process, Standard tool, Process governance & roles in place
Complete
Release
◘ Release policy, standardization of release documentation templates and guidelines, improved
resource management via Forward Schedule of Release
◘ Standard Process, Process governance & roles in place
Complete
Capacity
◘ Systems under watch increased, capacity risk dashboard developed
◘ Standard Process, Standard tool, Process governance & roles in place
Complete
11. SERVICE MANAGEMENT
ITIL Processes Adoption
11
Process Detail Status
Configuration
◘ SM tools Rollout following a complete audit, process supported by Change Management.
◘ Standard Process, Standard tools & Process governance & roles in place
Complete
Financial
◘ Technology operation sis fully align to business
◘ Accountability of CTO
Complete
Service Level
◘ Central Sourcing function
◘ Back to back internal and external SLA
◘ Service Target agreed
Complete
Knowledge
◘ Formal Process defined and mapped to tool
◘ Commissioned since Feb 2009
Complete
Business
Continuity
◘ Comprehensive Documentations
◘ Perform regular exercises
◘ Reviews and Updates
Complete
12. SERVICE MANAGEMENT
Tools
12
◘ Service Manager
◘ Consolidated Service Desk solution providing
best practices based on industry standards
◘ Incident Management
◘ Problem Management
◘ Inventory & Configuration Mgt
◘ Change Management
◘ Scheduled Maintenance
◘ Request Management
◘ Service Level Agreement Mgt
◘ Contract Management
◘ Diagnostic Aids
AssetCenter
◘ Asset Management solution providing the
greatest depth of procurement, inventory,
financial and contract management functionality
◘ Portfolio
◘ Procurement
◘ Financials
◘ Cable & Circuit
◘ Contracts
◘ AssetCenter Web
ITIL Ready Tools
◘ While ITIL processes in their own right can progress an organisation’s maturity and performance. When you
couple this with an ITIL ready toolset major improvements can be noted
◘ An integrated toolset ensures clear process flows, consistency and efficiency
14. What is a Major Incident?
An incident is consider Major when
• there is a complete or partial service failure (unavailability)
• impact on business is extreme
14
15. 15
What Is An Incident Control Centre (ICC)?
WHAT
• Process called to manage Major Incidents
• A focal point accountable for coordinating efforts, ensuring
clear and concise customer communication
ACTIONS
• Communicate with all relevant stakeholders
• Communicate effectively and professionally to our
customers
• Escalate to the Management team as appropriate
• Coordinate diagnosis and recovery
• Prioritize key activities
• Continuously analyze and minimize service restoration
timeframes
• Manage all technical recovery activities through the IRT
• Outline resourcing and escalation
• Undertake risk and impact assessments
• Determine follow-up actions
15
16. So, What does the typical
life-cycle of an
ICC look like?
16
17. 17
ICC Attributes
• The ICC operates on a 24 x 7 x 365 basis
• It is essential to escalate appropriately at all times
day or night
17
18. The Benefits Of ICC
• Customer focused
• Consistent approach and methodology
• Effective communication
• Appropriate resource is guided and focused
• Manages Risks associated with Major Incidents
18
19. What Can Go Wrong If An ICC Is Not Called?
• Increased customer pain
• Increased brand damage
• Poorly or incorrectly understood Incidents
• In-appropriate and indeed harmful actions may be initiated
• Poor or no coordination of resources
• Incorrect prioritization
• Poor or no communication
• Inconsistency in approach, management, actions and output
• In simple terms
– The situation escalates and creates more damage and pain
19
25. Incident Recovery Team (IRT)
25
When
• Whenever an incident severity being upgraded to DEFCON level.
• Service impacting incident with unclear recovery path
ROLE
• The IRT are responsible for all technical recovery activities
• It is the IRT’s role to provide and drive the ‘technical solution’
• The team is created at the request of the Incident Manager / Technical
Recovery Manager
• The Technical Recovery Manager will appoint an Incident Recovery
Team Lead (IRTL)
• Membership will vary depending upon the nature of the Incident, but will
typically have a Incident Recovery Leader and a number of subject
matter experts
• The IRTL can change or supplement the team membership
• The IRT meeting will remain open until service is restored
26. Incident Management Group (IMG)
26
WHEN
• A IMG Meeting is called for all DEFCON levels within 30 minutes of an ICC
being initiated
• Meetings will occur hourly thereafter although the frequency can be adjusted
with agreement from the Incident Manager
• The IMG will last for no longer than 20 minutes and will be based in the ‘War
Room’
ROLE
• Act as a focal point for communication to ensure effective and professional
communication occurs
• Coordinate the activities of the most appropriate staff and teams solving the
Incident – ensuring that the IRT has the right skills and leadership in place
and that progress is being made as effectively as possible
• The IMG can suggest and make membership changes to the Incident
Recovery Team as they feel appropriate
(DEFCON 2 and above)
• It is not the IMG’s role to drop into ‘technical solution’ mode – this is the
responsibility of the Incident Recovery Team
27. Management Team Meeting (MTM)
27
WHEN
• Held for ‘Full ICC DEFCON 2’ or ‘Severity 0 DEFCON 1’
• Follows an IMG meeting within 60 minutes of the ICC being initiated
• Subsequent meeting times will be agreed with the Incident Manager but
may typically occur hourly thereafter (Only hourly if the incident escalates
to DEFCON 1 (Emergency Management Committee))
• The MTM will last for no longer than 20 minutes
ROLE
• Ensure communication to both customers and senior managers
is maintained
• Make decisions based upon information provided by the IMG, providing
support and guidance as appropriate, and whenever necessary escalate
to the EMC
28. Stand Down The ICC
• Clear path of recovery
• Service restored
• A conscious and recorded decision will be made to
stand down all ICCs
• The Service Alert must be updated to reflect the fact that
the ICC has been closed
• Root Cause Analysis (RCA) will be initiated by
Problem Management
28