Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

445 views

Published on

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
445
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

  1. 1. Identifying the Root Cause of Failures in IT Changes:Novel Strategies and Trade-offsRicardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville andLuciano P. GasparyFederal University of Rio Grande do Sul, BrazilRoben C. LunardiFederal Institute of Rio Grande do Sul, Brazil
  2. 2. • Introduction• Proposed Solution• Diagnosis Process• Conceptual Architecture• Root Cause Analyzer• Strategies for Selecting Questions• Case Study• Final Considerations• Future WorkOutline
  3. 3. Introduction• Context• The complexity of IT infrastructures becomes the ITprocesses a critical mission• ITIL (Information Technology Infrastructure Library) becamethe most widely accepted approach to IT processesmanagement all over the world• IT Change Management• Defines how the IT infrastructure must evolve in aconsistent and safe way• Defines how changes should be conducted3/28
  4. 4. Introduction• IT Problem Management• Defines the lifecycle of IT problems• The primary goals are• To eliminate recurrent incidents• To prevent the occurrence of IT problems• To minimize the impact of problems which cannot be prevented• To achieve these goals, identifying the root cause of failuresand reusing the operator’s knowledge is fundamental• To simplify the procedures• To minimize financial losses• To reduce maintenance costs4/28
  5. 5. Introduction• Current Scenario• Changes and failures have been exploited by severalresearches• However, these researches have some limitations, such as• Often, previous data are not considered• Do not identify root cause of failures• Specific solutions for detecting software failures5/28
  6. 6. Introduction• Our Goals• Propose strategies that help in the identification processkeeping the interactive approach• The developed strategies must select a question and exploredifferent criteria• Compare the diagnostics generated by each strategy6/28
  7. 7. Interactive DiagnosisProposed SolutionDiagnosis Process – Our ApproachProblem Report AnsweredQuestionRoot CauseQuestionSelection7/28PR RCHelp Desk Root CauseAnalyzerOperator
  8. 8. Config. Mgmt.DatabaseChange Management SystemChangePlannerChangeDesignerProposed SolutionConceptual ArchitectureOperator8/28DeploymentSystemRFC
  9. 9. Config. Mgmt.DatabaseDiagnosis SystemDiagnosis LogRecorderRCChange Management SystemChangePlannerChangeDesignerProposed SolutionConceptual ArchitectureOperator8/28DeploymentSystemRFCRoot CauseAnalyzer
  10. 10. Config. Mgmt.DatabaseDiagnosis SystemDiagnosis LogRecorderRCChange Management SystemChangePlannerChangeDesignerProposed SolutionConceptual ArchitectureOperator8/28DeploymentSystemRoot Cause AnalyzerQuestionSelectorQuestionVerifierRCInputProcessorCICIRCRCRCPRRFCRoot CauseAnalyzerLog
  11. 11. Proposed SolutionStrategies for Selecting Questions• The developed strategies use same inputs and returna single question as result• 4 different proposed strategies• Strategy 1 – Only completed diagnostics• Strategy 2 – All diagnostics• Strategy 3 – Age of diagnostics• Strategy 4 – Questions’ popularity9/28
  12. 12. Proposed SolutionStrategies for Selecting Questions• Strategy 1 – Only completed diagnostics• Only completed diagnostics are considered• The calculated weights suffer no penalty• The element weight is computed by sum of completeddiagnostics in which RC was correctly identifiedRoot Causes Questions Answers Completed DiagnosticsRC1 Q1, Q2 A1, A3 20RC2 Q1, Q3 A2, A5 3010/28
  13. 13. Proposed SolutionStrategies for Selecting Questions• Strategy 1 – Only completed diagnostics• Only completed diagnostics are considered• The calculated weights suffer no penalty• The element weight is computed by sum of completeddiagnostics in which RC was correctly identifiedRoot Causes Questions Answers Completed DiagnosticsRC1 Q1, Q2 A1, A3 20RC2 Q1, Q3 A2, A5 3010/2820 + 30 = 5030 20
  14. 14. Proposed SolutionStrategies for Selecting Questions• Strategy 2 – All diagnostics• Completed and frustrated diagnostics are considered• The element weight is calculated by the sum of thecompleted diagnostics subtracting the sum of frustrateddiagnostics• A diagnostic is frustrated when the system uses at least onequestion associated with a RC, but at the end of the processanother RC is identified11/28
  15. 15. Proposed SolutionStrategies for Selecting Questions• Strategy 2 – All diagnosticsRoot Causes Questions AnswersDiagnosticsCompleted FrustratedRC1 Q1, Q2 A1, A3 20 10RC2 Q1, Q3 A2, A5 30 1512/28
  16. 16. Proposed SolutionStrategies for Selecting Questions• Strategy 2 – All diagnosticsRoot Causes Questions AnswersDiagnosticsCompleted FrustratedRC1 Q1, Q2 A1, A3 20 10RC2 Q1, Q3 A2, A5 30 1512/28(20 + 30) – (10 + 15) = 2530 – 15 = 15 20 – 10 = 10
  17. 17. Proposed SolutionStrategies for Selecting Questions• Strategy 3 – Age of diagnostics• Considers completed and frustrated diagnostics• The elements weights suffer penalty by the age of diagnosticsAge Diagnostics Time Penalty1ª To120 days Not applicable2ª From 121 days to 150 days 10%3ª From 151 days to 180 days 20%4ª From 181 days to 210 days 30%5ª From 211 days to 240 days 40%6ª From 241 days to 270 days 50%7ª From 271 days to 300 days 60%8ª From 301 days to 330 days 70%9ª From 331 days to 360 days 80%10ª From 360 days 90%13/28
  18. 18. Proposed SolutionStrategies for Selecting Questions• Strategy 3 – Age of diagnostics101)( )(iiiixghtelementWeii – age of diagnosticsβi – percentage of weight to be usedαi – the amount of completed diagnostics in an age groupωi – the amount of frustrated diagnostics in an age group14/28
  19. 19. Proposed SolutionStrategies for Selecting Questions• Strategy 3 – Age of diagnostics101)( )(iiiixghtelementWei15/28Root Causes Questions AnswersCompletedDiagnosticsFrustratedDiagnostics1st age 10th age 1st age 10th ageRC1 Q1, Q2 A1, A3 1 24 4 8RC2 Q1, Q3 A2, A5 4 15 1 2
  20. 20. Proposed SolutionStrategies for Selecting Questions• Strategy 3 – Age of diagnostics101)( )(iiiixghtelementWei15/28Root Causes Questions AnswersCompletedDiagnosticsFrustratedDiagnostics1st age 10th age 1st age 10th ageRC1 Q1, Q2 A1, A3 1 24 4 8RC2 Q1, Q3 A2, A5 4 15 1 24.3 + 1.6 = 5.9100% (1 - 4) + 10% (24 - 8) = 1.6100% (4 - 1) + 10% (15 - 2) = 4.31.6
  21. 21. Proposed SolutionStrategies for Selecting Questions• Strategy 4 – Questions’ popularity• The RCs and categories’ weight are calculated accordingthe Strategy 2• The question’s weight consider the weight of associatedRCs and question’s popularity• Question’s popularity is obtained by the ratio betweenamount of occurrences of the question and amount ofdiagnostic sets selected16/28
  22. 22. Proposed SolutionStrategies for Selecting Questions• Strategy 4 – Questions’ popularityαx – amount of occurrences of the question x in the diagnostic setsn – amount of diagnostic setsβRCi – probability of identifying an RCαRCi, x – amount of occurrences of question x in the diagnostic setof an RC21,)(nixRCiRCixxnightquestionWe17/28
  23. 23. Proposed SolutionStrategies for Selecting Questions• Strategy 4 – Questions’ popularity21,)(nixRCiRCixxnightquestionWe18/28Root Causes Questions AnswersCompletedDiagnosticsFrustratedDiagnostics1st age 10th age 1st age 10th ageRC1 Q1, Q2 A1, A3 1 24 4 8RC2 Q1, Q3 A2, A5 4 15 1 2
  24. 24. Proposed SolutionStrategies for Selecting Questions• Strategy 4 – Questions’ popularity18/28Root Causes Questions AnswersCompletedDiagnosticsFrustratedDiagnostics1st age 10th age 1st age 10th ageRC1 Q1, Q2 A1, A3 1 24 4 8RC2 Q1, Q3 A2, A5 4 15 1 2(2/2 + ((13/29 * 1) + (16/29 * 1))) /2 = 1(1/2 + ((13/29 * 1) + (16/29 * 0))) /2 = 0.4741(1/2 + ((13/29 * 0) + (16/29 * 1))) /2 = 0.5259
  25. 25. • In this case study some constrains were defined• There is no changes during all executions• The operator will provide always the same answer• One company provides some services on the Web• The infrastructure consists of DB Server and Web Server• In order to meet growing demand 2 new servers will beinstalled• Hosting Server – Will be used to host the clients’ websites• Mail Server – Will be used to host the email services19/28Case Study
  26. 26. • The CP below aims to install 2 new servers and tomigrate existing services20/28Case Study
  27. 27. • The CP below aims to install 2 new servers andmigrate existing services20/28Case StudyA failure occurs
  28. 28. • IT infrastructure state in the company21/28Case Study
  29. 29. • IT infrastructure state in the company21/28Case Study
  30. 30. • IT infrastructure state in the company21/28Case Study
  31. 31. 22/28Case StudyCategories LevelCalculated WeightsStrat. 1 Strat. 2 Strat. 3 Strat. 4Service 1 1083 242 157,30 242Web Page Server 2 558 82 33,20 82DataBase 2 519 195 127,60 195Network 1 1058 345 188,10 345Services 2 512 189 113,40 189Devices 2 485 136 66,20 136System 1 603 167 54,30 167Computer System 2 545 153 52,90 153Hosting Server 3 319 175 49,90 175DB Server 3 192 -22 3,00 -22Software 1 1115 343 126,60 343Web Server 2 607 138 86,80 138DB Server 2 443 169 36,20 169
  32. 32. 23/28Case Study• Diagnostic workflows generated
  33. 33. 23/28Case Study• Diagnostic workflows generatedThe PHP configuration does not allow theuse of language in user’s websites
  34. 34. 24/28Case Study• Diagnostic workflows generated
  35. 35. 24/28Case Study• Diagnostic workflows generatedThe PHP configuration does not allow theuse of language in user’s websites
  36. 36. Final Considerations25/28• The proposed solution allows to identify the failures’root cause with the following features• Reuse the operator’s knowledge• Interactivity between solution and operator• Flexibility of the diagnostic generated• System compatibility with the standards used by companies• The modular structure of solution allows organizationsto adapt the system to their special needs
  37. 37. Final Considerations26/28• The proposed strategies generate different diagnosticworkflows, considering the same infrastructure andfailure• Analyzing the obtained results, we have the followingrecommendations for IT operators• Strategy 1 – histories with a small amount of records• Strategy 2 – bulky and recent histories• Strategy 3 – histories that include at least 10 months• Strategy 4 – data sets with a great amount of popular questions
  38. 38. Future Work27/28• Explore new criteria for the selection of questions• Confidence• False positive and false negative rates• Extend the process to identify root causes for otherscopes• Investigate the use of CIM classes (actions e checks)in order to improve the system bootstrapping• Automate root cause identification of certain kinds offailures
  39. 39. Thank you for your attention!Questions?
  40. 40. References• J. P. Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and PriorityDetermination of Changes in IT Service Management,” in XVIII IFIP/IEEE InternationalWorkshop on Distributed Systems: Operations and Management (DSOM 2007), 2007,pp. 147–158• ITIL, “ITIL - Information Technology Infrastructure Library. Office of GovernmentCommerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug.2010• G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT changemanagement systems,” in Network Operations and Management Symposium, 2008.NOMS 2008. IEEE, April 2008, pp. 347–354• W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design andplanning in networked systems based on reuse of knowledge and automation,”Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009• ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version3.0. Office of Government Commerce (OGC),” 2007• DMTF, “Distributed Management Task Force: Common Information Model. DistributedManagement Task Force (DMTF),” 2009, Available:http://www.dmtf.org/standards/cim. Accessed: aug. 2010
  41. 41. References• J. Sauvé, R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change prioritydetermination in it service management based on risk exposure,” Network and ServiceManagement, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008• A. Brown and A. Keller, “A best practice approach for automating it managementprocesses,” in Network Operations and Management Symposium, 2006. NOMS 2006.10th IEEE/IFIP, 3-7 2006, pp. 33 –44• A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping theante of it : exploring the linkage between it and business to improve both it andbusiness results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153,october 2008• A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system:change management with planning and scheduling,” in Network Operations andManagement Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 –408 Vol.1• M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” inService Systems and Service Management, 2006 International Conference on, vol. 1,Oct. 2006, pp. 798–803• R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident managementprocess,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-62008, pp. 141 –150
  42. 42. References• K. Appleby, G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlationengine for multi-domain server farms,” in Integrated Network ManagementProceedings, 2001 IEEE/IFIP International Symposium on, 2001• M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systemsthrough incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537– 562, 2004• W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to supportknowledge reuse in IT change design,” in Network Operations and ManagementSymposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362• J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change managementprocesses with automated risk assessment,” in XII IFIP/IEEE International Workshopon Distributed Systems: Operations and Management (DSOM 2009), 2009• R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L.d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “Onstrategies for planning the assignment of human resources to it change activities,” inNetwork Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr.2010, pp. 248–255
  43. 43. Root Cause AnalyzerProposed SolutionRoot Cause AnalyzerQuestion VerifierObvious?Threshold80% with thesame answerInput ProcessorRCRCRC Identificationbased oncategoriesIdentificationbased on PRIdentificationbased on RCsQuestion SelectorSelects theQuestion hasthe greatestweight/levelSelects theCategory thathas the greatestweightCalculates theweightsaccording to thestrategyCICILog
  44. 44. Case Study• Identified CIs and categories associatedCI CategoriesHosted Sites Service  Web Page ServerDataBase Access Service  DataBaseWeb Page Access Service  Web Page ServerPHP Interpreter Service  Web Page ServerCMS Service Service  Web Page ServerLogical Connection Network  ServicesJoomla Software  Web ServerPHP Software  Web ServerApache Software  Web ServerMySQL Software  Web ServerDB Server System  Computer System  DB ServerHosting Server System  Computer System  Hosting ServerSwitch Network  Devices
  45. 45. Proposed SolutionInformation ModeldeterminesProblempossibleAnswersdeterminesOthersQuestionsCategoryParentChild11..*10..11..**ServiceProblemSolutionCategory *1..*ManagedElementExchangeElementSolutionElement*QuestionCategoryCategory0..1QuestionRootCause1..* *10..*ServiceIncidentProblemAnswer0..11..*1..*0..11..* SolutionCategory
  46. 46. Proposed SolutionInformation ModeldeterminesProblempossiblesAnswersdeterminesOthersQuestions1..*0..11Logical ElementEnabledLogicalElementMessageLogRecordLogrecordedAnswersrecordedQuestions10..1QuestionRootCause1..*1 111ProblemAnswer0..1recordedProblem111..*1 *

×