3. ProblemManagementFoundation
Avoidance elements
• Resilience - can deal with errors
• Redundancy - can deal with failures
• Fail-over - can deal with loss
• Documentation cannot be created in a crisis. Needs to be available
in advance
• Correct implementation
4. ProblemManagementFoundation
Factors that determine redundancy
Redundancy requires alternative component to be available
• Complexity: Degree of complexity based on the number of items and
interconnects required to provide service
• Hardware age: Measured against the stated lifecycle from the vendor
• Software age: Based on number of items including versions count
from current release
• Supportability: Supportability factor based on in house capability (key
man dependencies), reliance on 3rd party and contractual
arrangements
5. ProblemManagementFoundation
Factors that determine redundancy
(cont.)
• Single points of failure: Based on number of infrastructure single
points of failure
• Disaster recovery time: Based on time to implement full DR plan
• Capacity/Performance: Utilised capacity at tightest bottle neck
• Environmental: Factor based on risk to virus attack, user breakage,
physical damage, power failure etc.
6. ProblemManagementFoundation
Example for redundancy
• Spare tyre in car
• If a type punctures it is possible to stop the car, replace the type with a spare
from the boot and continue the journey.
• A full working alternative of the failed component is available.
7. ProblemManagementFoundation
Factors that determine resilience
Resilience is the ability for a component to continue to operate even
though a failure has occurred.
• Factors that determine resilience are similar to redundancy
• Budget required for resilience will be different to that of redundancy
and failover.
10. Internet
Peering partners
JINX/CINX
Transits
S
P
S
P
Tiered SPs
Gateways
Peering distribution
SiSi SiSi
Primary data centre Secondary data centre
Core Core
Caches
Caches
Network Management Systems
PDSN
PPPOE
PDSN
PPPOE
OSS/BSS systems Overview of an ISP
Fibre rings
RFrings
Satellite
RF high sites
Customer CPEs
Value Added ServicesValue Added ServicesTelkom IPC
ADSL
Fixed line
Mobile
Interconnects
VVVV VVVV
MetroethernetMetroethernet
11. ProblemManagementFoundation
Factors that determine fail-over
Component is able to swop over to another component without
interruption.
• Similar to redundancy and resilience
• Budget required would be different to resilience and redundancy
13. ProblemManagementFoundation
Documentation
• Why do you need documentation?
• Advantages of having documentation before a crisis
• Types of documentation required
• Impact of lack of documentation
• Keeping documents updated
• Documentation standards
14. ProblemManagementFoundation
Documentation
• Advantages of having documentation before a crisis
• When a crisis occurs it is time consuming to start the diagnosis if
documentation of the systems is not available
• An understanding of the system needs to be created and there should be
a set of up-to-date, fully documented procedures and processes that are
available and easy to implement
• New staff members require a reference for processes. A process can only
be as good as its documentation. Correctly used processes avoid errors.
In the event of a crisis, uncertainty is reduced and time to resolution
increased.
• When a failure occurs, processes and documentation need to be
changed to avoid a re-occurrence. It could be as simple as a more
detailed sanity check before running that process that nukes some part
of the system.
19. ProblemManagementFoundation
Correct implementation
• Use a structured approach and plan, no “fly by the seat of your
pants”.
• Understand the deliverables and measure/gauge progress to the
target.
• David Allen, a productivity guru, frequently asserts that anything
that takes more than two steps and two minutes to accomplish is
a project.
• David Ruiz, director of IT at DIC Entertainment Corp, states that
nine out of 10 times taking the extra time to create a plan will
save you time and money.
• Refer to Appendix for project management resources.
20. ProblemManagementFoundation
Review:
Bottom line
In order to avoid a crisis, ensure you
have redundancy and resilience
implemented correctly, supported by
appropriate documentation and
measurements.
A crisis can be mitigated if systems have
been engineered with foresight with
how failures are handled
Editor's Notes
Crisis engineering
Objectives
<Insert notes>
Crisis engineering
Component Failure Impact Analysis - http://www.itsmsolutions.com/newsletters/DITYvol1iss4.htm
ITIL suggests
“Component Failure Impact Analysis” aka ‘single point of failure’ analysis
Complexity: Degree of complexity based on the number of items and interconnects required to provide service
Hardware age: Measured against the stated lifecycle from the vendor
Software age: Based on number of items including versions count from current release
Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements
Single points of failure: Based on number of infrastructure single points of failure
Disaster recovery time: Based on time to implement full DR plan
Capacity/Performance Utilised capacity at tightest bottle neck
Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
Complexity: Degree of complexity based on the number of items and interconnects required to provide service
Hardware age: Measured against the stated lifecycle from the vendor
Software age: Based on number of items including versions count from current release
Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements
Single points of failure: Based on number of infrastructure single points of failure
Disaster recovery time: Based on time to implement full DR plan
Capacity/Performance Utilised capacity at tightest bottle neck
Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
Example of redundancy
Complexity: Degree of complexity based on the number of items and interconnects required to provide service
Hardware age: Measured against the stated lifecycle from the vendor
Software age: Based on number of items including versions count from current release
Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements
Single points of failure: Based on number of infrastructure single points of failure
Disaster recovery time: Based on time to implement full DR plan
Capacity/Performance Utilised capacity at tightest bottle neck
Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
Demonstrate
The Swiss cheese experiment
Three pieces of paper and punching a random hole in the paper with a pen.
Each paper is a system with multiple components. This illustrates the principle of resilience i.e if you line the systems up together, the dots wont align, and until they do, you wont have services that fail. This demonstrates the concept of resilience in systems.
Best practice network design for a service provider
The diagram is an example of engineering a solution for redundancy, resilience and failover.
Complexity: Degree of complexity based on the number of items and interconnects required to provide service
Hardware age: Measured against the stated lifecycle from the vendor
Software age: Based on number of items including versions count from current release
Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements
Single points of failure: Based on number of infrastructure single points of failure
Disaster recovery time: Based on time to implement full DR plan
Capacity/Performance Utilised capacity at tightest bottle neck
Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
Examples of fail-over
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
<Ask Lee to add notes and information>
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
<Ask Lee to add notes and information>
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
<Ask Lee to add notes and information>
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
<Ask Lee to add notes and information>
Naming conventions
Similar documents help to create easier understanding of systems when they're documented in a consistent manner
<Ask Lee to add notes and information>
Basic project management, can within reason, be applied to anything. The alternative method would be to "fly by the seat of your pants" and then "crash and burn.“ Eskom’s Medupi implementation.
The methods of approaching large projects are well documented but that is not what we are talking about. Often a person is tasked to complete a set of deliverables in a few days. So there needs to be a method to approach these aspects of your work that are short in nature in a methodological manner. A lightweight version of doing projects. David Allen, a productivity guru, frequently asserts that anything that takes more than two steps and two minutes to accomplish is a project. David Ruiz, director of IT at DIC Entertainment Corp, states that nine out of 10 times taking the extra time to create a plan will save you time and money
Basic project management, can within reason, be applied to anything. The alternative method would be to "fly by the seat of your pants" and then "crash and burn.“ Eskom’s Medupi implementation.
The methods of approaching large projects are well documented but that is not what we are talking about. Often a person is tasked to complete a set of deliverables in a few days. So there needs to be a method to approach these aspects of your work that are short in nature in a methodological manner. A lightweight version of doing projects. David Allen, a productivity guru, frequently asserts that anything that takes more than two steps and two minutes to accomplish is a project. David Ruiz, director of IT at DIC Entertainment Corp, states that nine out of 10 times taking the extra time to create a plan will save you time and money