4. ProblemManagementFoundation
Best practices for the Crisis
management Operations Centre
(CMOC)
• History of Mission Control
• Three level CMOC
• Command and Control
• Air traffic control (ATC)
• Tiger Teams
• “The Trenches”
• Managing the schedules
• Recording
• Dashboards and metrics
5. ProblemManagementFoundation
Objectives of CMOC
• Provide a co-ordinated, coherent and effective response to managing a
crisis
• This is a physical location and is one of the crisis control points
• Contains the requisite tools and technology to assist in managing the
crisis
• Manages, monitors or diverts a crisis threatening the organisation or its
stakeholders
• Deals with Major Incidents or other potential threats
• Houses the Crisis Management teams including the command and
control structure
• Communicates during a crisis including being the hub for internal and
external notifications and escalations
7. Service corridor
Main screen Right screens Right screensLeft screens Left screens
Command and control
Tiger teams
The trenches
Doors
Doors
Fishbowl
①② ③ ④
ew da b
PhySec InfoSec Electrical Cooling Fire - Safety
Servers – Storage – Voice
- Networks Applications
Security Data centre
Infrastructure
Apps
Operations
Support
The Three level CMOC
r
9. ProblemManagementFoundation
CMOC: Command and
Control
Lessons from Air Traffic
Control (ATC)
Summary of the more traditional ATC
from Wikipedia:
• ATC is a service provided by ground-
based controllers who direct aircraft on
the ground and through controlled
airspace, and can provide advisory
services to aircraft in non-controlled
airspace. The primary purpose of ATC
worldwide is to prevent collisions,
organize and expedite the flow of
traffic, and provide information and
other support for pilots. In some
countries, ATC plays a security or
defensive role, or is operated by the
military.
10. ProblemManagementFoundation
CMOC: Command and Control
Lessons from ATC
• To prevent collisions, ATC enforces traffic separation rules, which ensure each
aircraft maintains a minimum amount of empty space around it at all times. Many
aircraft also have collision avoidance systems, which provide additional safety by
warning pilots when other aircraft get too close.
• In many countries, ATC provides services to all private, military, and commercial
aircraft operating within its airspace. Depending on the type of flight and the
class of airspace, ATC may issue instructions that pilots are required to obey, or
advisories (known as flight information in some countries) that pilots may, at their
discretion, disregard. Generally the pilot in command is the final authority for the
safe operation of the aircraft and may, in an emergency, deviate from ATC
instructions to the extent required to maintain safe operation of their aircraft.
11. ProblemManagementFoundation
Command and Control
(1) CMOC Command and Control manager (NCCM)
(2) IT managers (ITM)
(3) Shift Supervisors (SS)
(4) Major incident Manager (MiM)
Use an HD video phone to connect via
video conferencing to the WAR ROOM.
Shift supervisors have extensions boards to
quick dial positions or contacts.
12. ProblemManagementFoundation
Fishbowl
• Interactive screens on outside positions of video wall
• Use android phones on consoles (multi purpose, extra display for
functions such as surveillance)
• Service corridor behind video wall for emergency power, network
points, switches, firewalls, broadband, UCM, mini-PCs for video walls
• Use automatic hdmi switches to increase PC count per screen
• All consoles use standardized mini-PCs and screens
• “The trenches” use desks with screens embedded in surface, tiger
teams use normal desks, Command and Control on elevated platform
13. ProblemManagementFoundation
Fishbowl
• Rear wall of CMOC behind NCCM
has clocks relevant to each time zone
that CMOC services
• All consoles are screen recorded and
used by the alpha tiger team to
analyse process optimization post a
major incident
• The CMOC itself requires a PTZ
surveillance camera (or fisheye)
• Install satellite DSTV/Openview feed
into CMOC for news updates
14. ProblemManagementFoundation
Voice communications
• All phones connected to a UCM and
recorded
• The FXS port of the UCM is
connected to a bell ringer which
rings when an incoming call remains
unanswered or when skeleton staff is
on duty
• Use of intercom function of UCM to
automatically push voice
communications to console, level,
disciple or the whole CMOC.
• zello can be used to contact mobile
users and has a full audit trial.
• UCM client such as Wave can be
used on
Smartphones
15. ProblemManagementFoundation
Major incident
traffic light LED
• Installed above tiger team row with
switch ADDON box on major incident
manager’s console
• Red – major incident in progress
• Amber – heightened potential threat of
major incident
• Green – Normal operations
• * Could also potentially be installed in
the trenches above each discipline if
required.
16. ProblemManagementFoundation
“The trenches”, level 1 CMOC
• Multiple positions per console (scale by size of enterprise) and
multiple rows per CMOC level (scale by size of enterprise)
• Level 1 of CMOC is the equivalent of a Network Operations Centre
(NOC) with permanent operators allocations and hot seats
• Centralization of resources
• Separate hubs of detection and operations should not be encouraged
• Rotate experts through the CMOC shift to provide training including
• IT Managers, Escalation/3rd level engineers, Service and delivery managers
• Use as Genchi Genbutsu
• Escalation point and not a duplication of the control room
• Disciplines
• Security: Physical and Information
• Data centre: Electrical, cooling and Fire-safety
• Infrastructure: VoIP/voice, servers, storage and networks
• Apps: Applications
• Support: Operations
• Social media monitoring
17. ProblemManagementFoundation
Skill level in “the trenches”
• Basic: monitors mission critical internet servers and services, email
notification, web based reports.
• Intermediate: device Identification, 24x7x365 network monitoring,
problem detection, notification to contacts, web based reports.
• Advanced: network map, real time fault detection, problem diagnosis and
resolution, escalation to telcos, hardware service providers or dispatchers,
network performance reports via the Internet, network engineer analysis
and reports on network status, complete network management.
• Supervisor: responsible for: prioritizing tasks, assigning work to in “the
trenches” based on their skills, verifying that tickets are opened properly
and that relevant personnel are notified when required, escalating
problems, communicating with operational management if required,
responsible for change management compliance. (candidate/backup for
MiM)
18. ProblemManagementFoundation
CMOC resilience
• Requires multiple switches and network paths
• Separate broadband connection to use for failover and Internet feeds
• VoIP connection for UCM and PSTN local failover
• Consider alternative methods of communications
• Emergency power local to CMOC – don’t just rely on building
generator
• Multiple electrical distribution feeds: A-red and B-blue
• Backup mini-Pcs for consoles and video wall
19. ProblemManagementFoundation
Dashboards and metrics
• Dashboards provide metrics of the operating status of the work
environment.
• Dashboards are effective when visual.
• Metrics allow you to determine that a problem is occurring.
• Dashboards can be simplistic, like those found in motor vehicles
• Dashboards provide proactive and visual status information.
• Attributes of a good dashboard:
• Simple
• Easy to access and maintain
• Interactive
• Provides trending
• Thresholds and alerting
20. ProblemManagementFoundation
Video wall
• Screen to monitor service desk statistics and call volumes.
• Increase in calls to the service desk could reflect an unknown
problem.
• Dashboard required from command and control (example use
common office tools)
• Status of daily checklists
• Areas of monitoring - OPACS
• (O) Outages
• (P) Performance (including capacity)
• (A) Accounting
• (C) Configuration
• (S) Security
21. ProblemManagementFoundation
CMOC textual dashboard
• Used to show status of CMOC
• Continuously displayed on the left most screen of video wall
• Updated by CMOC manager or designated shift leader
• Displays most important events and relevant CMOC information
22. ProblemManagementFoundation
CMOC textual Dashboard (info)
Example
• Ongoing SLA or contract violations
• Last 10 maintenance tasks completed
• Next 10 maintenance tasks scheduled
• Planned continuity tests scheduled such
as inverter or generator tests, network
path protection tests, business continuity
or application high availability tests.
• Changes including emergency ones
completed during the past week (so if it
is Wednesday, then all the changes
completed since the previous Thursday).
This includes the status on whether they
have been successful or failed.
• Changes scheduled for the next week
(thus if it is Tuesday, then all the changes
scheduled up to the next Monday).
• Top 10 congested network links
• Top 10 devices with temperature alerts
• Top 10 devices with cooling alerts
• Top 10 devices with storage (such as RAID
failures) or capacity alerts
• Systems/devices with known problems or
symptoms of degradations
• Resources available to the CMOC
23. "If they could get a washing machine to fly, my Jimmy could land it.“
The mother of Apollo 13’s commander, Jim Lovell
24. ProblemManagementFoundation
Review
• There is a best practice for how a CMOC
needs to be established
• Logical setup is more important than
physical
• Physical may have constraints
• CMOC needs appropriate tools and
resources required
• Multi-disciplinary resources are preferred
especially when tiger teams deal with
escalations
• Accurate and up to date information is
required for effective and efficient
operations. Visual feedback is most
effective.
• Resilience in components and operations is
crucial (including people, technology and
location)
• Home of command and control
Editor's Notes
Objectives
<Add notes>
Refer video https://lnkd.in/ejJYP_r
Most often business continuity is described as a separate process that caters for a disaster. However a separate process creates inefficiencies as in reality, business continuity is a special case of the major incident process whereby a full workaround is required. The trigger for business continuity or any disaster recovery initiation will always be as a result of an escalation from a major incident.
So if the trigger for the implantation of the disaster recovery plan is via the major incident process, where is the trigger for major incidents? The trigger is definitely not via the service desk because if that is the case then the Information Technology (IT) processes have failed! If an event has occurred that has severe negative business consequences and the mechanism that this becomes know is via a reactive call ticketing system then it is obvious that there is to proactive measures in place.
The trigger for a major incident should be escalated from the Network Operations Centre (CMOC). The CMOC should have all disciplines of IT represented within it and the command structure of the CMOC should be within the area and not external. The boss should be there at the coalface and not in a separate room. A great example for the CMOC is the Mission control used during the Apollo programme. The layout is below and shows the Director of Flight Operations in the prime spot.
The Apollo Mission Control went live on 3rd June 1965, nearly 50 years ago. One of Mission Control’s finest moments wasn’t only when man first walked on the moon but also during the “successful failure” of Apollo 13. In space flight the configuration works and during time the vintage consoles and switches have been replaced by modern computers such as laptops and PCs.
Refer https://lnkd.in/evTNtJX
Let us discuss the modern day CMOC. The CMOC manager assumes the position of Director of Flight Operations at position (1) and the Service Delivery Manager who is responsible for customer interactions assumes position (4). Positions (2) and (3) and taken up other IT and data centre management as well as the major incident manager who also has his own console, typically the furthest location to the right rear at position (3). Position (2) is also often used by shift supervisors in the CMOC. The CMOC thus has three functional areas or stages namely: command and control, tiger teams and the trenches. This might consist of three rows of consoles as a minimum but each functional area might have more rows in larger organizations especially in the trenches. This type of CMOC is commonly called a three level CMOC. Each of the sections are sign posted with a sign suspended from the ceiling and hanging perpendicular to the video wall. Alternatively, the consoles can be labelled using a brother label printer.
Command and control
CMOC Command and Control Manager (NCCM)
IT Managers (ITM)
Shift leaders (SL)
Major Incident Manager (MIM)
The consoles in Command and Control use HD video phones such as the GXV3275.
http://www.grandstream.com/products/ip-video-telephony/ip-video-phones-android/product/gxv3275
The GXV3275 IP Video Phone for Android™ delivers a powerful voice, video and multimedia business communications experience to keep workers in-touch and up-to-date. This one-of-a-kind 6-line IP phone features a tablet-like 7 inch touch screen and, in addition to voice calls allows users to keep in touch with co-workers and clients through a variety of video calling platforms (Grandstream’s free IPVideoTalk service in addition to any Android app, such as Skype, Google Hangouts, and more). This IP Video Phone for Android offers full access to the Google Play Store and the millions of Android apps – including powerful business productivity apps like SalesForce1, GoToMeeting, service provider apps and more. Additional features include integrated Bluetooth for pairing of headset and mobile devices for contact book/calendar exchange and call transferring, Gigabit ports, integrated WiFi, and more. In addition to its value as an IP voice and video phone, the GXV3275 is also a great addition to any IP surveillance or door access solution. Pair it with third party door phones, SIP door openers or IP cameras to allow users to control these third-party devices in an office or apartment right from the GXV3275. From a surveillance standpoint, the GXV3275 can make and receive SIP video calls from IP surveillance cameras for security alerts/alarms or for checking a camera's live video feed.
Most CMOC locations are “fishbowls” meaning they are enclosed in glass in the central office space of a company with the video wall being on one side of the room and all consoles facing this wall in a typical class room setup. A good setup should have 5 LED TVs on the video wall. One 65” in the middle with two 55” screens on either side. The two LED screens on the far left and far right can be interactive ones. Dependant on the size of the CMOC the size of and number of screens may increase. Behind the video wall should be a service corridor where mini PCs for the walls are racked (typically Intel mini PCs). In this area a UPS for the CMOC can be installed, typically a 3Kw, 5Kw or 10Kw unit (these units are available from Powafull). These units can also be attached to PV solar panels as the overall CMOC power requirements make it a perfect candidate for alternative green energy solutions. Since the backup power solutions are installed in this area the power distribution boards should also be in this location. The service corridor also contains the cabling racks and network switches. All consoles have a minimum of two network cables and the CMOC itself has two resilient paths to the data centre. It is recommended that the rear row of the CMOC where the command and control area is situated is on a raised floor. The trenches should use recessed screens in the desks. Each console should be equipped with two 22” to 24” LED Screens connected to an Intel mini PC. The CMOC should have a dual band wifi AP to which smart phones are attached. Each console will also have individuals operating on their smartphones. Within the service corridor the PCs are on racks and include an hdmi computer screen together with a wireless keyboard and mouse to troubleshoot the video wall.
No cater for more than 5 PCs to display information on the CMOC multiple mini-PCs are used. It is not suitable for different applications to share a single PC and for ach functional application to be installed on its own mini PC. Thus each screen can have an hdmi switch that supports 3 sources (here is an example) providing a maximum of 15 mini PCs which will cater for extremely large environments. One of the slots on the central main screen of the video wall will be assigned to a satellite TV service such as DSTV. This isn’t for entertainment purposes but access to news and weather bulletins. Alternatively, the status and map screens from Weather SA can be displayed. The PCs are usually installed with Windows 10 and accessed via rdp using either RDP manager or mRemoteNG. In the CMOC and SSH tool such as putty and kitty is also used for cli access to networking infrastructure.
It is crucial for the CMOC to have video surveillance of itself as that footage is crucial for review when it comes to lessons learnt (this is the topic of a separate article). It is also best practice to record all activities on the PC using a screen recording tool. ShareX is suck a too land it will assist in recording all activity which can then be saved to a location on a shared resource like One Drive. In a CMOC time is important and most PCs have suitable clocks and timers but a suggestion is to have a clock located behind the CMOC manager in position (1) with its own independent power source, such as a battery. If the CMOC deals with multiple time zones a new clock is added for each new time zone.
Sample surveillance camera for CMOC
http://www.vivotek.com/fe8182/
The CMOC should have a SIP based voice infrastructure where each console has a video phone with headset ability. These phones are typically android based and have software that can be used to view surveillance cameras. Thus the CMOC operators can view surveillance on the video wall, desktop phones or smart phones. Additionally, the android phone can run zello, an application that has the ability to provide crystal clear voice communications with an audit trial. The VoIP system in the CMOC needs to support detailed call record details, auto attendant and multiple level IVR, call recording, voice and fax to email, plus centralized contact lists. The CMOC would use a system such as a UCM and connected to one of the FXS ports or to an ATA would be a bell ringer for night time operations. The UCM has paging functions which automatically plays a message via the handset speaker. In the CMOC various intercom groups are created:
An intercom for all console operators
An intercom group for each level in the CMOC namely command and control, tiger teams and “the trenches.”
Each functional discipline in each of the levels has its own intercom.
Thus if a major incident is declared then the announcement is made to the tiger team intercom. Another example is when there is a shift leader change, and the new shift leader will announce himself to the whole CMOC via the intercom. Additionally, a permanently nailed up audio conference bridge is active on the UCN for the tiger teams.
The doors to the CMOC have SIP door phones. Example here. These remote door phones can be used in other areas of the data centre as well for providing remote physical access.
Android VoIP client – Grandstream Wave - http://www.grandstream.com/products/ip-voice-telephony/softphone-app/product/grandstream-wave
The middle row of the CMOC is taken up by the tiger teams. Here is a brief summary of those teams:
Echo: Identify and handle communications
Delta: Diagnose and process information detected and presented
Romeo: Repair, Restore and recover the services based on input present by diagnostics.
Whisky: Deliver a temporary workaround if required.
Bravo: Provide input on business continuity and action disaster recovery plan if required.
Alpha: Analyse the crisis and implement any lessons learnt and mitigations.
The tiger teams form an important part of crisis management and has been addressed separately in the article on tiger teams here. One of the items the Major Incident Manager has on his desk is an AddonBOX which activates an LED signal tower above the tiger team rows of Red, Amber and Green (obtain these units here). This is commonly referred to as the RAG tower.
The front row which is “the Trench” is where the steely eyed rocket men sit. The CMOC is the location for all command and control of IT. Some company have a Security Operations centre. (SOC). The CMOC is the single point of contact and control and a separate location such as a SOC should not exist. Physical, infrastructure and organizational security exist as positions in the trenches.
Besides InfoSec and PhySec mentioned above the other consoles should be manned by specialists in networks, storage, servers, data centre environmental including electrical, cooling including a safety console for fire and water. There are also consoles for faults/outages, capacity/performance. The CMOC does not duplicate or handle the functions of the service desk. The position to the front right is taken up by the CMOC support resources who is responsible for the technology used in the CMOC.
Skill level in “the trenches”
CMOC resilience
Dashboards and metrics
A metric in a motor vehicle is fuel. You use the fuel gauge to determine when is the most appropriate time to refuel.
An extreme example is NASA's mission control. You are not required to implement a NASA type Mission Control but choose an appropriate dashboard that will provide visual feedback of your business.
As an example in the emergency section of a hospital, a monitor displaying the number of patients in each type of emergency category and the waiting time for each is extremely useful and provides confidence to both staff and patients.
Video wall
There are many complex dashboards and visualisations available for a CMOC. They all serve a purpose even the clock behind the CMOC Manager and the RAG tower above the tiger teams. One of the simple tools (available here) is the CMOC PPT dashboard a.k.a. Uberfingers dashboard. The dashboard is a looping PowerPoint slide deck that consistently loops with each slide display for 19 seconds. The CMOC Manager or one of the designated shift leaders creates the slide deck which is configured to constantly loop on the video wall in the furthest position to the left. The slide deck has a naming convention of <CMOC_dashboard_date_daily-increment>. Thus as the day begins the increment is 1 and increases as each new slide deck is modified. It consists of these components:
Ongoing SLA or contract violations
Last 10 maintenance tasks completed
Next 10 maintenance tasks scheduled
Planned continuity tests scheduled such as inverter or generator tests, network path protection tests, business continuity or application high availability tests.
Resources available to the CMOC
Resources unavailable to the CMOC
Changes completed during the past week (so if it is Wednesday, then all the changes completed since the previous Thursday). This includes the status on whether they have been successful or failed.
Changes scheduled for the next week (thus if it is Tuesday, then all the changes scheduled up to the next Monday).
Emergency changes completed and in progress.
RAG status of teams in “the trenches”. Each discipline has a status of Red, Amber or Green.
Top 10 most ongoing projects of which the CMOC must be aware.
Top 10 congested network links
Top 10 devices with temperature alerts
Top 10 devices with cooling alerts
Top 10 devices with storage (such as RAID failures) or capacity alerts
Systems/devices with known problems or symptoms of degradations