4. engineering to
avoid a crisis
ProblemManagementFoundation
Objectives
• Redundancy
• Resilience
• Fail-over
• Documentation
ProblemManagementFoundation
Avoidance elements
• Resilience - can deal with errors
• Redundancy - can deal with failures
• Fail-over - can deal with loss
• Documentation cannot be created in a crisis. Needs to be available
in advance
• Correct implementation
ProblemManagementFoundation
Factors that determine redundancy
Redundancy requires alternative component to be available
• Complexity: Degree of complexity based on the number of items and
interconnects required to provide service
• Hardware age: Measured against the stated lifecycle from the vendor
• Software age: Based on number of items including versions count
from current release
• Supportability: Supportability factor based on in house capability (key
man dependencies), reliance on 3rd party and contractual
arrangements
ProblemManagementFoundation
Factors that determine redundancy
(cont.)
• Single points of failure: Based on number of infrastructure single
points of failure
• Disaster recovery time: Based on time to implement full DR plan
• Capacity/Performance: Utilised capacity at tightest bottle neck
• Environmental: Factor based on risk to virus attack, user breakage,
physical damage, power failure etc.
ProblemManagementFoundation
Example for redundancy
• Spare tyre in car
• If a type punctures it is possible to stop the car, replace the type with a spare
from the boot and continue the journey.
• A full working alternative of the failed component is available.
ProblemManagementFoundation
Factors that determine resilience
Resilience is the ability for a component to continue to operate even
though a failure has occurred.
• Factors that determine resilience are similar to redundancy
• Budget required for resilience will be different to that of redundancy
and failover.
ProblemManagementFoundation
Example of resilience
• A BMW car with run flats
• A MTB with gel that self seals a hole – commonly known as
sludge or slime
• The radials in tubeless tyres
ProblemManagementFoundation
Resilience: The Swiss Cheese experiment
1
2
3
Internet
Peering partners
JINX/CINX
Transits
S
P
S
P
Tiered SPs
Gateways
Peering distribution
SiSi SiSi
Primary data centre Secondary data centre
Core Core
Caches
Caches
Network Management Systems
PDSN
PPPOE
PDSN
PPPOE
OSS/BSS systems Overview of an ISP
Fibre rings
RFrings
Satellite
RF high sites
Customer CPEs
Value Added ServicesValue Added ServicesTelkom IPC
ADSL
Fixed line
Mobile
Interconnects
VVVV VVVV
MetroethernetMetroethernet
ProblemManagementFoundation
Factors that determine fail-over
Component is able to swop over to another component without
interruption.
• Similar to redundancy and resilience
• Budget required would be different to resilience and redundancy
ProblemManagementFoundation
Example of fail-over
• Trucks with multiple wheels per hub
• Electrical supply via utility with ATS switch to generator which
auto starts.
ProblemManagementFoundation
Documentation
• Why do you need documentation?
• Advantages of having documentation before a crisis
• Types of documentation required
• Impact of lack of documentation
• Keeping documents updated
• Documentation standards
ProblemManagementFoundation
Documentation
• Advantages of having documentation before a crisis
• When a crisis occurs it is time consuming to start the diagnosis if
documentation of the systems is not available
• An understanding of the system needs to be created and there should be
a set of up-to-date, fully documented procedures and processes that are
available and easy to implement
• New staff members require a reference for processes. A process can only
be as good as its documentation. Correctly used processes avoid errors.
In the event of a crisis, uncertainty is reduced and time to resolution
increased.
• When a failure occurs, processes and documentation need to be
changed to avoid a re-occurrence. It could be as simple as a more
detailed sanity check before running that process that nukes some part
of the system.
ProblemManagementFoundation
Documentation
• Types of documentation required
• Inventory listings
• Rack and floor plan capacity
• Rack layout diagrams
• Patch panel connections
• Network switch connections
• Power strip connections
• Network diagrams
• Storage diagrams
• Domain diagrams
• Capacity reports
• Change audit trails
ProblemManagementFoundation
Documentation
• Impact of lack of documentation
ProblemManagementFoundation
Documentation
• Keeping documents updated
ProblemManagementFoundation
Documentation
• Documentation standards
ProblemManagementFoundation
Correct implementation
• Use a structured approach and plan, no “fly by the seat of your
pants”.
• Understand the deliverables and measure/gauge progress to the
target.
• David Allen, a productivity guru, frequently asserts that anything
that takes more than two steps and two minutes to accomplish is
a project.
• David Ruiz, director of IT at DIC Entertainment Corp, states that
nine out of 10 times taking the extra time to create a plan will
save you time and money.
• Refer to Appendix for project management resources.
ProblemManagementFoundation
Review:
Bottom line
In order to avoid a crisis, ensure you
have redundancy and resilience
implemented correctly, supported by
appropriate documentation and
measurements.
A crisis can be mitigated if systems have
been engineered with foresight with
how failures are handled

Problem management foundation - Engineering

  • 1.
  • 2.
  • 3.
    ProblemManagementFoundation Avoidance elements • Resilience- can deal with errors • Redundancy - can deal with failures • Fail-over - can deal with loss • Documentation cannot be created in a crisis. Needs to be available in advance • Correct implementation
  • 4.
    ProblemManagementFoundation Factors that determineredundancy Redundancy requires alternative component to be available • Complexity: Degree of complexity based on the number of items and interconnects required to provide service • Hardware age: Measured against the stated lifecycle from the vendor • Software age: Based on number of items including versions count from current release • Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements
  • 5.
    ProblemManagementFoundation Factors that determineredundancy (cont.) • Single points of failure: Based on number of infrastructure single points of failure • Disaster recovery time: Based on time to implement full DR plan • Capacity/Performance: Utilised capacity at tightest bottle neck • Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
  • 6.
    ProblemManagementFoundation Example for redundancy •Spare tyre in car • If a type punctures it is possible to stop the car, replace the type with a spare from the boot and continue the journey. • A full working alternative of the failed component is available.
  • 7.
    ProblemManagementFoundation Factors that determineresilience Resilience is the ability for a component to continue to operate even though a failure has occurred. • Factors that determine resilience are similar to redundancy • Budget required for resilience will be different to that of redundancy and failover.
  • 8.
    ProblemManagementFoundation Example of resilience •A BMW car with run flats • A MTB with gel that self seals a hole – commonly known as sludge or slime • The radials in tubeless tyres
  • 9.
  • 10.
    Internet Peering partners JINX/CINX Transits S P S P Tiered SPs Gateways Peeringdistribution SiSi SiSi Primary data centre Secondary data centre Core Core Caches Caches Network Management Systems PDSN PPPOE PDSN PPPOE OSS/BSS systems Overview of an ISP Fibre rings RFrings Satellite RF high sites Customer CPEs Value Added ServicesValue Added ServicesTelkom IPC ADSL Fixed line Mobile Interconnects VVVV VVVV MetroethernetMetroethernet
  • 11.
    ProblemManagementFoundation Factors that determinefail-over Component is able to swop over to another component without interruption. • Similar to redundancy and resilience • Budget required would be different to resilience and redundancy
  • 12.
    ProblemManagementFoundation Example of fail-over •Trucks with multiple wheels per hub • Electrical supply via utility with ATS switch to generator which auto starts.
  • 13.
    ProblemManagementFoundation Documentation • Why doyou need documentation? • Advantages of having documentation before a crisis • Types of documentation required • Impact of lack of documentation • Keeping documents updated • Documentation standards
  • 14.
    ProblemManagementFoundation Documentation • Advantages ofhaving documentation before a crisis • When a crisis occurs it is time consuming to start the diagnosis if documentation of the systems is not available • An understanding of the system needs to be created and there should be a set of up-to-date, fully documented procedures and processes that are available and easy to implement • New staff members require a reference for processes. A process can only be as good as its documentation. Correctly used processes avoid errors. In the event of a crisis, uncertainty is reduced and time to resolution increased. • When a failure occurs, processes and documentation need to be changed to avoid a re-occurrence. It could be as simple as a more detailed sanity check before running that process that nukes some part of the system.
  • 15.
    ProblemManagementFoundation Documentation • Types ofdocumentation required • Inventory listings • Rack and floor plan capacity • Rack layout diagrams • Patch panel connections • Network switch connections • Power strip connections • Network diagrams • Storage diagrams • Domain diagrams • Capacity reports • Change audit trails
  • 16.
  • 17.
  • 18.
  • 19.
    ProblemManagementFoundation Correct implementation • Usea structured approach and plan, no “fly by the seat of your pants”. • Understand the deliverables and measure/gauge progress to the target. • David Allen, a productivity guru, frequently asserts that anything that takes more than two steps and two minutes to accomplish is a project. • David Ruiz, director of IT at DIC Entertainment Corp, states that nine out of 10 times taking the extra time to create a plan will save you time and money. • Refer to Appendix for project management resources.
  • 20.
    ProblemManagementFoundation Review: Bottom line In orderto avoid a crisis, ensure you have redundancy and resilience implemented correctly, supported by appropriate documentation and measurements. A crisis can be mitigated if systems have been engineered with foresight with how failures are handled

Editor's Notes

  • #2 Crisis engineering
  • #3 Objectives <Insert notes>
  • #4 Crisis engineering Component Failure Impact Analysis - http://www.itsmsolutions.com/newsletters/DITYvol1iss4.htm ITIL suggests “Component Failure Impact Analysis” aka ‘single point of failure’ analysis
  • #5 Complexity: Degree of complexity based on the number of items and interconnects required to provide service Hardware age: Measured against the stated lifecycle from the vendor Software age: Based on number of items including versions count from current release Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements Single points of failure: Based on number of infrastructure single points of failure Disaster recovery time: Based on time to implement full DR plan Capacity/Performance Utilised capacity at tightest bottle neck Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
  • #6 Complexity: Degree of complexity based on the number of items and interconnects required to provide service Hardware age: Measured against the stated lifecycle from the vendor Software age: Based on number of items including versions count from current release Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements Single points of failure: Based on number of infrastructure single points of failure Disaster recovery time: Based on time to implement full DR plan Capacity/Performance Utilised capacity at tightest bottle neck Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
  • #7 Example of redundancy
  • #8 Complexity: Degree of complexity based on the number of items and interconnects required to provide service Hardware age: Measured against the stated lifecycle from the vendor Software age: Based on number of items including versions count from current release Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements Single points of failure: Based on number of infrastructure single points of failure Disaster recovery time: Based on time to implement full DR plan Capacity/Performance Utilised capacity at tightest bottle neck Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
  • #10 Demonstrate The Swiss cheese experiment Three pieces of paper and punching a random hole in the paper with a pen. Each paper is a system with multiple components. This illustrates the principle of resilience i.e if you line the systems up together, the dots wont align, and until they do, you wont have services that fail. This demonstrates the concept of resilience in systems.
  • #11 Best practice network design for a service provider The diagram is an example of engineering a solution for redundancy, resilience and failover.
  • #12 Complexity: Degree of complexity based on the number of items and interconnects required to provide service Hardware age: Measured against the stated lifecycle from the vendor Software age: Based on number of items including versions count from current release Supportability: Supportability factor based on in house capability (key man dependencies), reliance on 3rd party and contractual arrangements Single points of failure: Based on number of infrastructure single points of failure Disaster recovery time: Based on time to implement full DR plan Capacity/Performance Utilised capacity at tightest bottle neck Environmental: Factor based on risk to virus attack, user breakage, physical damage, power failure etc.
  • #13 Examples of fail-over
  • #14 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner
  • #15 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner <Ask Lee to add notes and information>
  • #16 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner <Ask Lee to add notes and information>
  • #17 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner <Ask Lee to add notes and information>
  • #18 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner <Ask Lee to add notes and information>
  • #19 Naming conventions Similar documents help to create easier understanding of systems when they're documented in a consistent manner <Ask Lee to add notes and information>
  • #20 Basic project management, can within reason, be applied to anything. The alternative method would be to "fly by the seat of your pants" and then "crash and burn.“ Eskom’s Medupi implementation. The methods of approaching large projects are well documented but that is not what we are talking about. Often a person is tasked to complete a set of deliverables in a few days. So there needs to be a method to approach these aspects of your work that are short in nature in a methodological manner. A lightweight version of doing projects. David Allen, a productivity guru, frequently asserts that anything that takes more than two steps and two minutes to accomplish is a project. David Ruiz, director of IT at DIC Entertainment Corp, states that nine out of 10 times taking the extra time to create a plan will save you time and money
  • #21 Basic project management, can within reason, be applied to anything. The alternative method would be to "fly by the seat of your pants" and then "crash and burn.“ Eskom’s Medupi implementation. The methods of approaching large projects are well documented but that is not what we are talking about. Often a person is tasked to complete a set of deliverables in a few days. So there needs to be a method to approach these aspects of your work that are short in nature in a methodological manner. A lightweight version of doing projects. David Allen, a productivity guru, frequently asserts that anything that takes more than two steps and two minutes to accomplish is a project. David Ruiz, director of IT at DIC Entertainment Corp, states that nine out of 10 times taking the extra time to create a plan will save you time and money