Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)
1. Identifying the Root Cause of Failures in IT Changes:
Novel Strategies and Trade-offs
Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and
Luciano P. Gaspary
Federal University of Rio Grande do Sul, Brazil
Roben C. Lunardi
Federal Institute of Rio Grande do Sul, Brazil
2. • Introduction
• Proposed Solution
• Diagnosis Process
• Conceptual Architecture
• Root Cause Analyzer
• Strategies for Selecting Questions
• Case Study
• Final Considerations
• Future Work
Outline
3. Introduction
• Context
• The complexity of IT infrastructures becomes the IT
processes a critical mission
• ITIL (Information Technology Infrastructure Library) became
the most widely accepted approach to IT processes
management all over the world
• IT Change Management
• Defines how the IT infrastructure must evolve in a
consistent and safe way
• Defines how changes should be conducted
3/28
4. Introduction
• IT Problem Management
• Defines the lifecycle of IT problems
• The primary goals are
• To eliminate recurrent incidents
• To prevent the occurrence of IT problems
• To minimize the impact of problems which cannot be prevented
• To achieve these goals, identifying the root cause of failures
and reusing the operator’s knowledge is fundamental
• To simplify the procedures
• To minimize financial losses
• To reduce maintenance costs
4/28
5. Introduction
• Current Scenario
• Changes and failures have been exploited by several
researches
• However, these researches have some limitations, such as
• Often, previous data are not considered
• Do not identify root cause of failures
• Specific solutions for detecting software failures
5/28
6. Introduction
• Our Goals
• Propose strategies that help in the identification process
keeping the interactive approach
• The developed strategies must select a question and explore
different criteria
• Compare the diagnostics generated by each strategy
6/28
9. Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution
Conceptual Architecture
Operator
8/28
Deployment
System
RFC
Root Cause
Analyzer
10. Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution
Conceptual Architecture
Operator
8/28
Deployment
System
Root Cause Analyzer
Question
Selector
Question
Verifier
RC
Input
Processor
CICI
RCRCRC
PR
RFC
Root Cause
Analyzer
Log
11. Proposed Solution
Strategies for Selecting Questions
• The developed strategies use same inputs and return
a single question as result
• 4 different proposed strategies
• Strategy 1 – Only completed diagnostics
• Strategy 2 – All diagnostics
• Strategy 3 – Age of diagnostics
• Strategy 4 – Questions’ popularity
9/28
12. Proposed Solution
Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed
diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
13. Proposed Solution
Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed
diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
20 + 30 = 5030 20
14. Proposed Solution
Strategies for Selecting Questions
• Strategy 2 – All diagnostics
• Completed and frustrated diagnostics are considered
• The element weight is calculated by the sum of the
completed diagnostics subtracting the sum of frustrated
diagnostics
• A diagnostic is frustrated when the system uses at least one
question associated with a RC, but at the end of the process
another RC is identified
11/28
17. Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
• Considers completed and frustrated diagnostics
• The elements weights suffer penalty by the age of diagnostics
Age Diagnostics Time Penalty
1ª To120 days Not applicable
2ª From 121 days to 150 days 10%
3ª From 151 days to 180 days 20%
4ª From 181 days to 210 days 30%
5ª From 211 days to 240 days 40%
6ª From 241 days to 270 days 50%
7ª From 271 days to 300 days 60%
8ª From 301 days to 330 days 70%
9ª From 331 days to 360 days 80%
10ª From 360 days 90%
13/28
18. Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(
i
iiixghtelementWei
i – age of diagnostics
βi – percentage of weight to be used
αi – the amount of completed diagnostics in an age group
ωi – the amount of frustrated diagnostics in an age group
14/28
19. Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(
i
iiixghtelementWei
15/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
21. Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
• The RCs and categories’ weight are calculated according
the Strategy 2
• The question’s weight consider the weight of associated
RCs and question’s popularity
• Question’s popularity is obtained by the ratio between
amount of occurrences of the question and amount of
diagnostic sets selected
16/28
22. Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
αx – amount of occurrences of the question x in the diagnostic sets
n – amount of diagnostic sets
βRCi – probability of identifying an RC
αRCi, x – amount of occurrences of question x in the diagnostic set
of an RC
2
1
,
)(
n
i
xRCiRCi
x
x
n
ightquestionWe
17/28
23. Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
2
1
,
)(
n
i
xRCiRCi
x
x
n
ightquestionWe
18/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
25. • In this case study some constrains were defined
• There is no changes during all executions
• The operator will provide always the same answer
• One company provides some services on the Web
• The infrastructure consists of DB Server and Web Server
• In order to meet growing demand 2 new servers will be
installed
• Hosting Server – Will be used to host the clients’ websites
• Mail Server – Will be used to host the email services
19/28
Case Study
26. • The CP below aims to install 2 new servers and to
migrate existing services
20/28
Case Study
27. • The CP below aims to install 2 new servers and
migrate existing services
20/28
Case Study
A failure occurs
35. 24/28
Case Study
• Diagnostic workflows generated
The PHP configuration does not allow the
use of language in user’s websites
36. Final Considerations
25/28
• The proposed solution allows to identify the failures’
root cause with the following features
• Reuse the operator’s knowledge
• Interactivity between solution and operator
• Flexibility of the diagnostic generated
• System compatibility with the standards used by companies
• The modular structure of solution allows organizations
to adapt the system to their special needs
37. Final Considerations
26/28
• The proposed strategies generate different diagnostic
workflows, considering the same infrastructure and
failure
• Analyzing the obtained results, we have the following
recommendations for IT operators
• Strategy 1 – histories with a small amount of records
• Strategy 2 – bulky and recent histories
• Strategy 3 – histories that include at least 10 months
• Strategy 4 – data sets with a great amount of popular questions
38. Future Work
27/28
• Explore new criteria for the selection of questions
• Confidence
• False positive and false negative rates
• Extend the process to identify root causes for other
scopes
• Investigate the use of CIM classes (actions e checks)
in order to improve the system bootstrapping
• Automate root cause identification of certain kinds of
failures
40. References
• J. P. Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and Priority
Determination of Changes in IT Service Management,” in XVIII IFIP/IEEE International
Workshop on Distributed Systems: Operations and Management (DSOM 2007), 2007,
pp. 147–158
• ITIL, “ITIL - Information Technology Infrastructure Library. Office of Government
Commerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug.
2010
• G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT change
management systems,” in Network Operations and Management Symposium, 2008.
NOMS 2008. IEEE, April 2008, pp. 347–354
• W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design and
planning in networked systems based on reuse of knowledge and automation,”
Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009
• ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version
3.0. Office of Government Commerce (OGC),” 2007
• DMTF, “Distributed Management Task Force: Common Information Model. Distributed
Management Task Force (DMTF),” 2009, Available:
http://www.dmtf.org/standards/cim. Accessed: aug. 2010
41. References
• J. Sauvé, R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change priority
determination in it service management based on risk exposure,” Network and Service
Management, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008
• A. Brown and A. Keller, “A best practice approach for automating it management
processes,” in Network Operations and Management Symposium, 2006. NOMS 2006.
10th IEEE/IFIP, 3-7 2006, pp. 33 –44
• A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping the
ante of it : exploring the linkage between it and business to improve both it and
business results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153,
october 2008
• A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system:
change management with planning and scheduling,” in Network Operations and
Management Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 –
408 Vol.1
• M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” in
Service Systems and Service Management, 2006 International Conference on, vol. 1,
Oct. 2006, pp. 798–803
• R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident management
process,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-6
2008, pp. 141 –150
42. References
• K. Appleby, G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlation
engine for multi-domain server farms,” in Integrated Network Management
Proceedings, 2001 IEEE/IFIP International Symposium on, 2001
• M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems
through incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537
– 562, 2004
• W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to support
knowledge reuse in IT change design,” in Network Operations and Management
Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362
• J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change management
processes with automated risk assessment,” in XII IFIP/IEEE International Workshop
on Distributed Systems: Operations and Management (DSOM 2009), 2009
• R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L.
d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “On
strategies for planning the assignment of human resources to it change activities,” in
Network Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr.
2010, pp. 248–255
43. Root Cause Analyzer
Proposed Solution
Root Cause Analyzer
Question Verifier
Obvious?
Threshold
80% with the
same answer
Input Processor
RCRCRC Identification
based on
categories
Identification
based on PR
Identification
based on RCs
Question Selector
Selects the
Question has
the greatest
weight/level
Selects the
Category that
has the greatest
weight
Calculates the
weights
according to the
strategy
CICILog
44. Case Study
• Identified CIs and categories associated
CI Categories
Hosted Sites Service Web Page Server
DataBase Access Service DataBase
Web Page Access Service Web Page Server
PHP Interpreter Service Web Page Server
CMS Service Service Web Page Server
Logical Connection Network Services
Joomla Software Web Server
PHP Software Web Server
Apache Software Web Server
MySQL Software Web Server
DB Server System Computer System DB Server
Hosting Server System Computer System Hosting Server
Switch Network Devices