Information Technology - Discover the Root Cause and Develop a solution through structured processes

1,400 views
1,121 views

Published on

The presentation was compiled by Thinking Dimensions Global in November 2012 for the ITSMF conference held in London. The content relates to the KEPNERandFOURIE process for dealing with incidents and problems in IT and in particular a means of determining the Root Cause and providing the best solution.
The presentation was co-presented by Dr Mat-thys Fourie and John Hudson of Thinking Dimensions Global

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,400
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • v
  • In 2011 we are represented in 20 countries and in 12 different languages. TD has been growing steadily over the last 10 years. As you can see from the list TD and its network were already working with a formidable list of global clients. 2011 was also the year that TD officially decided in their strategy that they will niche exclusively into the IT market.
  • The procedure for problem solving is the following; First you have to state the problem situation and then once you have the correct statement and thus the correct “entry point” into the problem situation, you would be able to gather the most relevant information pertaining to the problem. Once you have the information, you need to analyze it and then come to a mutually agreed answer.
  • The procedure for problem solving is the following; First you have to state the problem situation and then once you have the correct statement and thus the correct “entry point” into the problem situation, you would be able to gather the most relevant information pertaining to the problem. Once you have the information, you need to analyze it and then come to a mutually agreed answer.
  • Information Technology - Discover the Root Cause and Develop a solution through structured processes

    1. 1. John Hudson & Matt Fourie 5 November 2012 Go Direct to the Root Cause – itRCA the solution?
    2. 2. Agenda “Most incident investigators ask the wrong questions, so do not change your people but change the questions they are asking” Matt Fourie • Introduction • Current situation • Components of a credible approach • • Minimalistic information, being specific and knowledge (wisdom) creation The Three critical investigation skills 1. Service Recovery Analysis 2. Technical Cause Analysis 3. Root Cause Analysis • Client outcomes • Questions & answers
    3. 3. Thinking Dimensions Some of our recent clients... Barclays IT ANZ IT Division Macquarie ITG Unisys Polypore IT Medtronic IT SITA Global BT Financial Westpac IT McDonalds IT Queensland Police IT Lockheed Martin Space Systems SPARQ IT • Thinking Dimensions International - operating KEPNERandFOURIE company initiatives for the last 25 years • Specialising in RCA Methodology for IT Incident and Problem Management
    4. 4. Global Presence • • • • • • • • • • • • • • • • • • • • • • • Baxter International Blue Cross Blue Shield Bosch Caltex Oil Carraro Crown Cork and Seal Dometic Electrolux Federal Judiciary Center General Dynamics IT Hollister,Inc Infineon BASF Macquarie Bank IT BT Financial IT Stihl Westpac IT Maersk Norfolk Naval Shipyard Selig Siemens SITA SKF Americas • Canada • Chile • Peru • USA EMEA • Germany • Italy • Netherlands • Poland • Saudi Arabia • South Africa • Spain • Turkey • United Kingdom Asia Pacific • Australia • China • India • South Korea • Thailand • Singapore
    5. 5. The Current Dilemma PAST NOW FUTURE STANDARD itTCA® – TECHNICAL CAUSE ANALYSIS itSRA® – SERVICE RECOVERY ANALYSIS itRCA® – ROOT CAUSE ANALYSIS
    6. 6. The Three Skills… 1. itSRA® Incident 2. itTCA ® 3. itRCA ® Service Recovery Analysis Recovery & Containment Tools & Templates Technical Cause Analysis Technical Cause Process and Techniques Root Cause Analysis Root Cause & FIX Checklist & Templates
    7. 7. Current Default Root Causes • Hardware • Software • “Human Error” • Environment Technical Cause Root Cause
    8. 8. Incisive Thinking Incident Statement Internet Banking Degrading Technical Cause Root Cause
    9. 9. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue
    10. 10. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue Integrative testing not done properly
    11. 11. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue Integrative testing not done properly
    12. 12. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue Encrypted “hello” message not returned Integrative testing not done properly
    13. 13. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue Encrypted “hello” message not returned „Beta‟ Certificate used Integrative testing not done properly
    14. 14. Incisive Thinking Incident Statement Technical Cause Root Cause Internet Banking Degrading New browser configuration issue Integrative testing not done properly Encrypted “hello” message not returned „Beta‟ Certificate used Policy requirements for “production” environment not adhered to
    15. 15. Incisive Thinking Incident Statement G-Force System Freezing Technical Cause Root Cause
    16. 16. Incisive Thinking Incident Statement Technical Cause Root Cause G-Force System Freezing High volume
    17. 17. Incisive Thinking Incident Statement Technical Cause Root Cause G-Force System Freezing High volume Too many users allowed access
    18. 18. Incisive Thinking Incident Statement Technical Cause Root Cause G-Force System Freezing High volume G-Force SQL DB thread count exceeding maximum Too many users allowed access
    19. 19. Incisive Thinking Incident Statement Technical Cause Root Cause G-Force System Freezing High volume G-Force SQL DB G-Force program thread count not closing out exceeding maximum threads Too many users allowed access
    20. 20. Incisive Thinking Incident Statement Technical Cause Root Cause G-Force System Freezing High volume Too many users allowed access G-Force SQL DB G-Force program Vendor thread count not closing out implemented an exceeding maximum threads untested program update
    21. 21. Basic phases of problem solving Procedure for addressing an Incident 1. State the purpose Divergent Thinking 2. Gather incident/problem detail 3. Evaluate for causes Convergent Thinking 4. Confirm technical/root cause 1. Testing 2. Verifying cause
    22. 22. Basic phases of problem solving Procedure for addressing an Incident 1. State the purpose Divergent Thinking 2. Gather incident/problem detail 3. Evaluate for causes Convergent Thinking 4. Confirm technical/root cause 1. Testing 2. Verifying cause
    23. 23. Good RCA… YOU NEED TO SOLVE AN INCIDENT; • QUICKLY [Service Recovery] • ACCURATELY [Technical Cause] • PERMANENTLY [Root Cause]
    24. 24. Factors in minimalistic approach Factor I Keep six honest serving-men: (They taught me all I knew); Their names are What and Why and When What Where And How and Where and Who. When I send them over land and sea, How I send them east and west; But after they have worked for me, I give them all a rest. Rudyard Kipling Why Who IS BUT NOT
    25. 25. Extreme Focus With “Specificity” Object Servers Fault Not communicating “The key to success is to be insistent about specificity – the more specific you are the better your chances to Solve an incident.” KEPNERandFOURIE Specificity Rules •One object one fault •Single-minded & simplistic •Highly focused •Must find the correct entry point •Ask a question – expect an answer
    26. 26. Extreme Focus With “Specificity” Object Servers Fault Not communicating Data not transferred Specificity Rules •One object one fault •Single-minded & simplistic •Highly focused •Must find the correct entry point •Ask a question – expect an answer
    27. 27. Extreme Focus With “Specificity Object Servers Fault Specificity Rules Not communicating •One object one fault Data not transferred •Single-minded & simplistic Sent but not received by receiving servers •Highly focused •Must find the correct entry point •Ask a question – expect an answer
    28. 28. Extreme Focus With “Specificity” Object Servers Fault Specificity Rules •One object one fault Data not transferred. Sent but not received by receiving servers Data for Large Outlets Not communicating •Single-minded & simplistic Not received •Highly focused •Must find the correct entry point •Ask a question – expect an answer
    29. 29. Extreme Focus With “Specificity” Object Servers Fault Specificity Rules Not communicating •One object one fault Data not transferred. Sent but not received by receiving servers •Single-minded & simplistic •Highly focused Data for Large Outlets Not received •Must find the correct entry point Sales turnover numbers for Large Outlets Not received •Ask a question – expect an answer
    30. 30. Creating Intelligence DATA INFORMATION IS BUT NOT Internet Banking Intranet Banking KNOWLEDGE WHY NOT Different routing SSL handshake Unexpected Outcomes •“BUT NOT” clarifies the facts •Creates a curious “contrast” Slow Freezing Volume? APAC users USA, UK ADSL lines Started Oct 1 Before New passwords Continuous After 4pm Different routing •Looking at answers at a “granular level” •Stimulates deductive reasoning
    31. 31. The Current Dilemma PAST NOW FUTURE STANDARD itTCA® – TECHNICAL CAUSE ANALYSIS itSRA ® – SERVICE RECOVERY ANALYSIS itRCA ® – ROOT CAUSE ANALYSIS
    32. 32. Service Recovery [ MTR] FACTOR IS BUT NOT REQUIREMENT OBJECT Mobile website access PC website access WHAT TO RESTORE FAULT Denied – not authorized Slow/freezing WHAT PROBLEMS TO REMOVE WHO Blackberry users Other Smart phones WHO WHERE Asia ANZ, UK, USA WHERE IMPACT Customer complaints PATTERN Sporadic TO WHAT EXTENT continuous FOR HOW LONG ACTIONS TO CONSIDER
    33. 33. Service Recovery [ MTR] Statement: Restore website access to customers Key Solution Requirements Various actions to meet key requirements 1 2 3 4 5 1. Provide access to client to at least receive interim non-availability notice 0 3 2 1 3 2. No loss of Data 3 3 0 0 1 3. Should not impact System Performance 1 0 3 1 0 4. ADSL compatible for Asia 1 2 0 0 0 5. Improve reliability 3 0 3 1 1 6. Implementation within the hour 1 3 3 1 2 Possible Actions: 1. Upload or switch on simple site maintenance page 2. Set up or start up back up service 3. Reroute 20/80 service all to back up service 4. Restrict access to low load tasks only 5. Allow access based on region
    34. 34. Service Recovery [ MTR] Statement: Restore website access to customers Key Solution Requirements Various actions to meet key requirements 1 2 3 4 5 1. Provide access to client to at least receive interim non-availability notice 0 3 2 1 3 2. No loss of Data 3 3 0 0 1 3. Should not impact System Performance 1 0 3 1 0 4. ADSL compatible for Asia 1 2 0 0 0 5. Improve reliability 3 0 3 1 1 6. Implementation within the hour 1 3 3 1 2 Possible Actions: 1. Upload or switch on simple site maintenance page 2. Set up or start up back up service 3. Reroute 20/80 service all to back up service 4. Restrict access to low load tasks only 5. Allow access based on region
    35. 35. The Current Dilemma PAST NOW FUTURE STANDARD itTCA® – TECHNICAL CAUSE ANALYSIS itSRA ® – SERVICE RECOVERY ANALYSIS itRCA ® – ROOT CAUSE ANALYSIS
    36. 36. Technical Cause Analysis [TCA - MTTR] IS BUT NOT WHY NOT OBJECT OBJECT – What object and which other object(s) not? FAULT FAULT – What fault and which other typical faults not? USERS USERS – Who has the problem and who does not? WHERE WHERE – Where are these users and where could they have been but are not? TIMING TIMING – When did it happen first time and when not? PATTERN PATTERN – What is the pattern of faults and what could it have been but is not? CYCLE CYCLE – In which cycle does the problem occur and in which cycle does it not occur?
    37. 37. Technical Cause Analysis [TCA] DIMENSION IS BUT NOT WHY NOT Object Fireburst V2.0 connection E-Express, Mango connections F/B upgrade from V1 to V2, Poor testing issue Fault dropping Freezing, slow Time out settings, configuration of drivers Location of Object ANZ, USA, UK Asia LAN, Proxy server issues, F/Wall rules Timing Monday, Sept 2nd with SOB Any time earlier than Sept 2nd Java upgrade, Netscape upgrade Pattern Continuous Sporadic, Periodic Don‟t know Life Cycle When doing a transaction “x” time into transaction Operator error, Code error on a specific page Phase of Work Just after logging in Logging in or out OS configuration issue, DNS issue Possible Causes & Testing
    38. 38. Technical Cause Analysis [TCA] DIMENSION IS BUT NOT WHY NOT Object Fireburst V2.0 connection E-Express, Mango connections F/B upgrade from V1 to V2, Poor testing issue Fault Dropping Freezing, slow Time out settings, configuration of drivers Location of Object ANZ, USA, UK Asia LAN, Proxy server issues, F/Wall rules Timing Monday, Sept 2nd with SOB Any time earlier than Sept 2nd Java upgrade, Netscape upgrade Pattern Continuous Sporadic, Periodic Don‟t know Life Cycle When doing a transaction “x” time into transaction Operator error, Code error on a specific page Phase of Work Just after logging in Logging in or out OS configuration issue, DNS issue Possible Causes & Testing 1. Proxy server tampered with during the Java upgrade on the LAN 2. Java upgrade caused driver incompatibility with Fireburst website V2.0 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0
    39. 39. Technical Cause Analysis [TCA] DIMENSION IS BUT NOT WHY NOT Object Fireburst V2.0 connection E-Express, Mango connections F/B upgrade from V1 to V2, Poor testing issue Fault Dropping Freezing, slow Time out settings, configuration of drivers Location of Object ANZ, USA, UK Asia LAN, Proxy server issues, F/Wall rules Timing Monday, Sept 2nd with SOB Any time earlier than Sept 2nd Java upgrade, Netscape upgrade Pattern Continuous Sporadic, Periodic Don‟t know Life Cycle When doing a transaction “x” time into transaction Operator error, Code error on a specific page Just after logging in Logging in or out OS configuration issue, DNS issue Phase of Work Possible Causes & Testing 1. Proxy server tampered with during the Java upgrade on the LAN X 2. Java upgrade caused driver incompatibility with Fireburst website V2.0 √ √ X 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0 √ √ A1 √ √ √ √ A1- Only if the staff in Asia did not upgrade to Netscape
    40. 40. The Current Dilemma PAST NOW FUTURE STANDARD itTCA® – TECHNICAL CAUSE ANALYSIS itSRA ® – SERVICE RECOVERY ANALYSIS itRCA ® – ROOT CAUSE ANALYSIS
    41. 41. A Case of a good thinking process • Deviation Statement • Factor Analysis • Possible causal factors • Testing the causal hypotheses • Find the underlying reason(s) for incident 'The truth, if it exists, is in the details' “Bartlett – Familiar Quotations”
    42. 42. The Right Starting Point • Find the technical cause first • Do 5 Why‟s to get to the systemic level • Find the root cause(s) • Fix the incident/problem for good “If a team has not solved an incident, the person with the information was not invited” Chuck Kepner
    43. 43. Four Questions to get Started • Is the object deviation within the control of your own system? Can you fix the root cause with actions under your control? • Is the technical cause deviation in the vendor's system? Can you only fix the root cause with the vendor's help? ITRCA Max4 ITRCA Max4 Is the object deviation within the control of your own system? Can you only fix the root cause with the vendor's help? • RiskWise • Is the technical cause deviation in the vendor's system? We would only be able to take avoiding actions.
    44. 44. Root Cause Analysis [RCA] DIMENSION IS BUT NOT APPPLICATION: What application and which other applications not? DEVIATION DEVIATION: What deviation do we have and which ones not? FUNCTION FUNCTION: Which job/function/process is involved and which ones not? WHO USERS: Who has the problem and who does not? WHERE WHERE: TIMING TIMING: Where are these users and where could they have been but are not? When did it happen first time and when not? FREQUENCY FREQUENCY: APPLICATION How frequent is the fault occurring?
    45. 45. Root Cause Analysis [RCA] COMPONENT CAUSAL FACTORS Decision Making Process and Collaboration for inputs Implementation issues Resources and Scope & Definition of Poor decision process and documentation for this project task Standard Operating Procedures Applicability of SOP and Awareness of SOP Management Management of Work and Staff Measurement KPI”s and Roles & Responsibilities CAUSAL ELEMENTS Critical stakeholder requirements not consulted for this task Inadequate authority levels for making good decisions Inadequate standards guiding the decision making Time Zone difficulties hampering effective decision making Unrealistic time, cost and performance expectations Poor initial estimation of resources needed for the project Poor updated approval data making the procedure unclear Poor work guidance/coaching for correct performance Work standards for this task is not enforced Poor management support in getting this task done KPI and metrics regarding this output not clear or absent Poor feedback on this KPI Duplication and GAPS making roles and responsibilities difficult
    46. 46. Root Cause Analysis 2 cont. [RCA] COMPONENT CAUSAL FACTORS Support Internal and External Vendor support Communications Clarity of communications and instructions Work Environment Task Interference and consequences Skills Complexity and applicability Testing Practices CAUSAL ELEMENTS Procedures and requirements Overuse of the SME causing sub-standard work Poor continual vendor support for this output Continual interruptions in performing the task Task performance request not properly understood Work environment not conducive for the demands of the task Unrealistic task and performance expectation for this task Not having enough experience with similar tasks No vendor training provided for new product and or service Poor risk analysis and decision pressure during testing Not all aspects tested and the test was incomplete Personal Aptitude and Attitude Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure
    47. 47. Root Cause Analysis [RCA] COMPONENT CAUSAL FACTORS Decision Making Process and Collaboration for inputs Implementation issues Resources and Scope & Definition of project Standard Operating Procedures Applicability of SOP and Awareness of SOP Management Management of Work and Staff Measurement KPI”s and Roles & Responsibilities CAUSAL ELEMENTS Critical stakeholder requirements not consulted for this task Inadequate authority levels for making good decisions Poor decision process and documentation for this task Inadequate standards guiding the decision making Time Zone difficulties hampering effective decision making Unrealistic time, cost and performance expectations Poor initial estimation of resources needed for the project Poor updated approval data making the procedure unclear Poor work guidance/coaching for correct performance Work standards for this task is not enforced Poor management support in getting this task done KPI and metrics regarding this output not clear or absent Poor feedback on this KPI Duplication and GAPS making roles and responsibilities difficult
    48. 48. Root Cause Analysis [RCA] COMPONENT CAUSAL FACTORS Support Internal and External Vendor support Communications Clarity of communications and instructions Work Environment Task Interference and consequences Skills Complexity and applicability Testing Practices Procedures and requirements Personal Aptitude and Attitude CAUSAL ELEMENTS Overuse of the SME causing sub-standard work Poor continual vendor support for this output Continual interruptions in performing the task Task performance request not properly understood Work environment not conducive for the demands of the task Unrealistic task and performance expectation for this task Not having enough experience with similar tasks No vendor training provided for new product and or service Poor risk analysis and decision pressure during testing Not all aspects tested and the test was incomplete Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure
    49. 49. Testing the Hypothesis The decision making process is too cumbersome to allow for own initiatives and the staff member must make a choice with given alternatives which is not most optimal for the situation Final Conclusion and Action Plan: 1. The job incumbent did not get the necessary support to do his job under a pressure situation adding to task interference ✗ 2. External vendor support for certain technical decisions was not available and that resulted in a less optimized decision choice. 3.
    50. 50. Additional Resources “SOLVE IT” – Find a way to solve incidents quickly, accurately and permanently.

    ×