Identifying the Root Cause of Failures in IT Changes:
Novel Strategies and Trade-offs
Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and
Luciano P. Gaspary
Federal University of Rio Grande do Sul, Brazil
Roben C. Lunardi
Federal Institute of Rio Grande do Sul, Brazil
• Introduction
• Proposed Solution
• Diagnosis Process
• Conceptual Architecture
• Root Cause Analyzer
• Strategies for Selecting Questions
• Case Study
• Final Considerations
• Future Work
Outline
Introduction
• Context
• The complexity of IT infrastructures becomes the IT
processes a critical mission
• ITIL (Information Technology Infrastructure Library) became
the most widely accepted approach to IT processes
management all over the world
• IT Change Management
• Defines how the IT infrastructure must evolve in a
consistent and safe way
• Defines how changes should be conducted
3/28
Introduction
• IT Problem Management
• Defines the lifecycle of IT problems
• The primary goals are
• To eliminate recurrent incidents
• To prevent the occurrence of IT problems
• To minimize the impact of problems which cannot be prevented
• To achieve these goals, identifying the root cause of failures
and reusing the operator’s knowledge is fundamental
• To simplify the procedures
• To minimize financial losses
• To reduce maintenance costs
4/28
Introduction
• Current Scenario
• Changes and failures have been exploited by several
researches
• However, these researches have some limitations, such as
• Often, previous data are not considered
• Do not identify root cause of failures
• Specific solutions for detecting software failures
5/28
Introduction
• Our Goals
• Propose strategies that help in the identification process
keeping the interactive approach
• The developed strategies must select a question and explore
different criteria
• Compare the diagnostics generated by each strategy
6/28
Interactive Diagnosis
Proposed Solution
Diagnosis Process – Our Approach
Problem Report Answered
Question
Root CauseQuestion
Selection
7/28
PR RC
Help Desk Root Cause
Analyzer
Operator
Config. Mgmt.
Database
Change Management System
Change
Planner
Change
Designer
Proposed Solution
Conceptual Architecture
Operator
8/28
Deployment
System
RFC
Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution
Conceptual Architecture
Operator
8/28
Deployment
System
RFC
Root Cause
Analyzer
Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution
Conceptual Architecture
Operator
8/28
Deployment
System
Root Cause Analyzer
Question
Selector
Question
Verifier
RC
Input
Processor
CICI
RCRCRC
PR
RFC
Root Cause
Analyzer
Log
Proposed Solution
Strategies for Selecting Questions
• The developed strategies use same inputs and return
a single question as result
• 4 different proposed strategies
• Strategy 1 – Only completed diagnostics
• Strategy 2 – All diagnostics
• Strategy 3 – Age of diagnostics
• Strategy 4 – Questions’ popularity
9/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed
diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed
diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
20 + 30 = 5030 20
Proposed Solution
Strategies for Selecting Questions
• Strategy 2 – All diagnostics
• Completed and frustrated diagnostics are considered
• The element weight is calculated by the sum of the
completed diagnostics subtracting the sum of frustrated
diagnostics
• A diagnostic is frustrated when the system uses at least one
question associated with a RC, but at the end of the process
another RC is identified
11/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 2 – All diagnostics
Root Causes Questions Answers
Diagnostics
Completed Frustrated
RC1 Q1, Q2 A1, A3 20 10
RC2 Q1, Q3 A2, A5 30 15
12/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 2 – All diagnostics
Root Causes Questions Answers
Diagnostics
Completed Frustrated
RC1 Q1, Q2 A1, A3 20 10
RC2 Q1, Q3 A2, A5 30 15
12/28
(20 + 30) – (10 + 15) = 2530 – 15 = 15 20 – 10 = 10
Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
• Considers completed and frustrated diagnostics
• The elements weights suffer penalty by the age of diagnostics
Age Diagnostics Time Penalty
1ª To120 days Not applicable
2ª From 121 days to 150 days 10%
3ª From 151 days to 180 days 20%
4ª From 181 days to 210 days 30%
5ª From 211 days to 240 days 40%
6ª From 241 days to 270 days 50%
7ª From 271 days to 300 days 60%
8ª From 301 days to 330 days 70%
9ª From 331 days to 360 days 80%
10ª From 360 days 90%
13/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(
i
iiixghtelementWei
i – age of diagnostics
βi – percentage of weight to be used
αi – the amount of completed diagnostics in an age group
ωi – the amount of frustrated diagnostics in an age group
14/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(
i
iiixghtelementWei
15/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
Proposed Solution
Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(
i
iiixghtelementWei
15/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
4.3 + 1.6 = 5.9
100% (1 - 4) + 10% (24 - 8) = 1.6
100% (4 - 1) + 10% (15 - 2) = 4.3
1.6
Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
• The RCs and categories’ weight are calculated according
the Strategy 2
• The question’s weight consider the weight of associated
RCs and question’s popularity
• Question’s popularity is obtained by the ratio between
amount of occurrences of the question and amount of
diagnostic sets selected
16/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
αx – amount of occurrences of the question x in the diagnostic sets
n – amount of diagnostic sets
βRCi – probability of identifying an RC
αRCi, x – amount of occurrences of question x in the diagnostic set
of an RC
2
1
,
)(
n
i
xRCiRCi
x
x
n
ightquestionWe
17/28
Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
2
1
,
)(
n
i
xRCiRCi
x
x
n
ightquestionWe
18/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
Proposed Solution
Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
18/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
(2/2 + ((13/29 * 1) + (16/29 * 1))) /2 = 1
(1/2 + ((13/29 * 1) + (16/29 * 0))) /2 = 0.4741
(1/2 + ((13/29 * 0) + (16/29 * 1))) /2 = 0.5259
• In this case study some constrains were defined
• There is no changes during all executions
• The operator will provide always the same answer
• One company provides some services on the Web
• The infrastructure consists of DB Server and Web Server
• In order to meet growing demand 2 new servers will be
installed
• Hosting Server – Will be used to host the clients’ websites
• Mail Server – Will be used to host the email services
19/28
Case Study
• The CP below aims to install 2 new servers and to
migrate existing services
20/28
Case Study
• The CP below aims to install 2 new servers and
migrate existing services
20/28
Case Study
A failure occurs
• IT infrastructure state in the company
21/28
Case Study
• IT infrastructure state in the company
21/28
Case Study
• IT infrastructure state in the company
21/28
Case Study
22/28
Case Study
Categories Level
Calculated Weights
Strat. 1 Strat. 2 Strat. 3 Strat. 4
Service 1 1083 242 157,30 242
Web Page Server 2 558 82 33,20 82
DataBase 2 519 195 127,60 195
Network 1 1058 345 188,10 345
Services 2 512 189 113,40 189
Devices 2 485 136 66,20 136
System 1 603 167 54,30 167
Computer System 2 545 153 52,90 153
Hosting Server 3 319 175 49,90 175
DB Server 3 192 -22 3,00 -22
Software 1 1115 343 126,60 343
Web Server 2 607 138 86,80 138
DB Server 2 443 169 36,20 169
23/28
Case Study
• Diagnostic workflows generated
23/28
Case Study
• Diagnostic workflows generated
The PHP configuration does not allow the
use of language in user’s websites
24/28
Case Study
• Diagnostic workflows generated
24/28
Case Study
• Diagnostic workflows generated
The PHP configuration does not allow the
use of language in user’s websites
Final Considerations
25/28
• The proposed solution allows to identify the failures’
root cause with the following features
• Reuse the operator’s knowledge
• Interactivity between solution and operator
• Flexibility of the diagnostic generated
• System compatibility with the standards used by companies
• The modular structure of solution allows organizations
to adapt the system to their special needs
Final Considerations
26/28
• The proposed strategies generate different diagnostic
workflows, considering the same infrastructure and
failure
• Analyzing the obtained results, we have the following
recommendations for IT operators
• Strategy 1 – histories with a small amount of records
• Strategy 2 – bulky and recent histories
• Strategy 3 – histories that include at least 10 months
• Strategy 4 – data sets with a great amount of popular questions
Future Work
27/28
• Explore new criteria for the selection of questions
• Confidence
• False positive and false negative rates
• Extend the process to identify root causes for other
scopes
• Investigate the use of CIM classes (actions e checks)
in order to improve the system bootstrapping
• Automate root cause identification of certain kinds of
failures
Thank you for your attention!
Questions?
References
• J. P. Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and Priority
Determination of Changes in IT Service Management,” in XVIII IFIP/IEEE International
Workshop on Distributed Systems: Operations and Management (DSOM 2007), 2007,
pp. 147–158
• ITIL, “ITIL - Information Technology Infrastructure Library. Office of Government
Commerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug.
2010
• G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT change
management systems,” in Network Operations and Management Symposium, 2008.
NOMS 2008. IEEE, April 2008, pp. 347–354
• W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design and
planning in networked systems based on reuse of knowledge and automation,”
Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009
• ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version
3.0. Office of Government Commerce (OGC),” 2007
• DMTF, “Distributed Management Task Force: Common Information Model. Distributed
Management Task Force (DMTF),” 2009, Available:
http://www.dmtf.org/standards/cim. Accessed: aug. 2010
References
• J. Sauvé, R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change priority
determination in it service management based on risk exposure,” Network and Service
Management, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008
• A. Brown and A. Keller, “A best practice approach for automating it management
processes,” in Network Operations and Management Symposium, 2006. NOMS 2006.
10th IEEE/IFIP, 3-7 2006, pp. 33 –44
• A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping the
ante of it : exploring the linkage between it and business to improve both it and
business results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153,
october 2008
• A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system:
change management with planning and scheduling,” in Network Operations and
Management Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 –
408 Vol.1
• M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” in
Service Systems and Service Management, 2006 International Conference on, vol. 1,
Oct. 2006, pp. 798–803
• R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident management
process,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-6
2008, pp. 141 –150
References
• K. Appleby, G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlation
engine for multi-domain server farms,” in Integrated Network Management
Proceedings, 2001 IEEE/IFIP International Symposium on, 2001
• M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems
through incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537
– 562, 2004
• W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to support
knowledge reuse in IT change design,” in Network Operations and Management
Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362
• J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change management
processes with automated risk assessment,” in XII IFIP/IEEE International Workshop
on Distributed Systems: Operations and Management (DSOM 2009), 2009
• R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L.
d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “On
strategies for planning the assignment of human resources to it change activities,” in
Network Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr.
2010, pp. 248–255
Root Cause Analyzer
Proposed Solution
Root Cause Analyzer
Question Verifier
Obvious?
Threshold
80% with the
same answer
Input Processor
RCRCRC Identification
based on
categories
Identification
based on PR
Identification
based on RCs
Question Selector
Selects the
Question has
the greatest
weight/level
Selects the
Category that
has the greatest
weight
Calculates the
weights
according to the
strategy
CICILog
Case Study
• Identified CIs and categories associated
CI Categories
Hosted Sites Service  Web Page Server
DataBase Access Service  DataBase
Web Page Access Service  Web Page Server
PHP Interpreter Service  Web Page Server
CMS Service Service  Web Page Server
Logical Connection Network  Services
Joomla Software  Web Server
PHP Software  Web Server
Apache Software  Web Server
MySQL Software  Web Server
DB Server System  Computer System  DB Server
Hosting Server System  Computer System  Hosting Server
Switch Network  Devices
Proposed Solution
Information Model
determinesProblem
possibleAnswers
determinesOthersQuestions
CategoryParentChild
11..*
10..1
1..*
*
ServiceProblem
SolutionCategory *
1..*
ManagedElement
ExchangeElement
SolutionElement
*
QuestionCategory
Category
0..1
Question
RootCause
1..* *
1
0..*
ServiceIncident
Problem
Answer
0..1
1..*
1..*
0..1
1..* SolutionCategory
Proposed Solution
Information Model
determines
Problem
possibles
Answers
determines
OthersQuestions
1..*
0..1
1
Logical Element
EnabledLogical
Element
MessageLog
RecordLog
recordedAnswers
recordedQuestions
1
0..1
Question
RootCause
1..*
1 1
1
1
Problem
Answer
0..1
recordedProblem
1
1
1..*
1 *

Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

  • 1.
    Identifying the RootCause of Failures in IT Changes: Novel Strategies and Trade-offs Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and Luciano P. Gaspary Federal University of Rio Grande do Sul, Brazil Roben C. Lunardi Federal Institute of Rio Grande do Sul, Brazil
  • 2.
    • Introduction • ProposedSolution • Diagnosis Process • Conceptual Architecture • Root Cause Analyzer • Strategies for Selecting Questions • Case Study • Final Considerations • Future Work Outline
  • 3.
    Introduction • Context • Thecomplexity of IT infrastructures becomes the IT processes a critical mission • ITIL (Information Technology Infrastructure Library) became the most widely accepted approach to IT processes management all over the world • IT Change Management • Defines how the IT infrastructure must evolve in a consistent and safe way • Defines how changes should be conducted 3/28
  • 4.
    Introduction • IT ProblemManagement • Defines the lifecycle of IT problems • The primary goals are • To eliminate recurrent incidents • To prevent the occurrence of IT problems • To minimize the impact of problems which cannot be prevented • To achieve these goals, identifying the root cause of failures and reusing the operator’s knowledge is fundamental • To simplify the procedures • To minimize financial losses • To reduce maintenance costs 4/28
  • 5.
    Introduction • Current Scenario •Changes and failures have been exploited by several researches • However, these researches have some limitations, such as • Often, previous data are not considered • Do not identify root cause of failures • Specific solutions for detecting software failures 5/28
  • 6.
    Introduction • Our Goals •Propose strategies that help in the identification process keeping the interactive approach • The developed strategies must select a question and explore different criteria • Compare the diagnostics generated by each strategy 6/28
  • 7.
    Interactive Diagnosis Proposed Solution DiagnosisProcess – Our Approach Problem Report Answered Question Root CauseQuestion Selection 7/28 PR RC Help Desk Root Cause Analyzer Operator
  • 8.
    Config. Mgmt. Database Change ManagementSystem Change Planner Change Designer Proposed Solution Conceptual Architecture Operator 8/28 Deployment System RFC
  • 9.
    Config. Mgmt. Database Diagnosis System DiagnosisLog Recorder RC Change Management System Change Planner Change Designer Proposed Solution Conceptual Architecture Operator 8/28 Deployment System RFC Root Cause Analyzer
  • 10.
    Config. Mgmt. Database Diagnosis System DiagnosisLog Recorder RC Change Management System Change Planner Change Designer Proposed Solution Conceptual Architecture Operator 8/28 Deployment System Root Cause Analyzer Question Selector Question Verifier RC Input Processor CICI RCRCRC PR RFC Root Cause Analyzer Log
  • 11.
    Proposed Solution Strategies forSelecting Questions • The developed strategies use same inputs and return a single question as result • 4 different proposed strategies • Strategy 1 – Only completed diagnostics • Strategy 2 – All diagnostics • Strategy 3 – Age of diagnostics • Strategy 4 – Questions’ popularity 9/28
  • 12.
    Proposed Solution Strategies forSelecting Questions • Strategy 1 – Only completed diagnostics • Only completed diagnostics are considered • The calculated weights suffer no penalty • The element weight is computed by sum of completed diagnostics in which RC was correctly identified Root Causes Questions Answers Completed Diagnostics RC1 Q1, Q2 A1, A3 20 RC2 Q1, Q3 A2, A5 30 10/28
  • 13.
    Proposed Solution Strategies forSelecting Questions • Strategy 1 – Only completed diagnostics • Only completed diagnostics are considered • The calculated weights suffer no penalty • The element weight is computed by sum of completed diagnostics in which RC was correctly identified Root Causes Questions Answers Completed Diagnostics RC1 Q1, Q2 A1, A3 20 RC2 Q1, Q3 A2, A5 30 10/28 20 + 30 = 5030 20
  • 14.
    Proposed Solution Strategies forSelecting Questions • Strategy 2 – All diagnostics • Completed and frustrated diagnostics are considered • The element weight is calculated by the sum of the completed diagnostics subtracting the sum of frustrated diagnostics • A diagnostic is frustrated when the system uses at least one question associated with a RC, but at the end of the process another RC is identified 11/28
  • 15.
    Proposed Solution Strategies forSelecting Questions • Strategy 2 – All diagnostics Root Causes Questions Answers Diagnostics Completed Frustrated RC1 Q1, Q2 A1, A3 20 10 RC2 Q1, Q3 A2, A5 30 15 12/28
  • 16.
    Proposed Solution Strategies forSelecting Questions • Strategy 2 – All diagnostics Root Causes Questions Answers Diagnostics Completed Frustrated RC1 Q1, Q2 A1, A3 20 10 RC2 Q1, Q3 A2, A5 30 15 12/28 (20 + 30) – (10 + 15) = 2530 – 15 = 15 20 – 10 = 10
  • 17.
    Proposed Solution Strategies forSelecting Questions • Strategy 3 – Age of diagnostics • Considers completed and frustrated diagnostics • The elements weights suffer penalty by the age of diagnostics Age Diagnostics Time Penalty 1ª To120 days Not applicable 2ª From 121 days to 150 days 10% 3ª From 151 days to 180 days 20% 4ª From 181 days to 210 days 30% 5ª From 211 days to 240 days 40% 6ª From 241 days to 270 days 50% 7ª From 271 days to 300 days 60% 8ª From 301 days to 330 days 70% 9ª From 331 days to 360 days 80% 10ª From 360 days 90% 13/28
  • 18.
    Proposed Solution Strategies forSelecting Questions • Strategy 3 – Age of diagnostics 10 1 )( )( i iiixghtelementWei i – age of diagnostics βi – percentage of weight to be used αi – the amount of completed diagnostics in an age group ωi – the amount of frustrated diagnostics in an age group 14/28
  • 19.
    Proposed Solution Strategies forSelecting Questions • Strategy 3 – Age of diagnostics 10 1 )( )( i iiixghtelementWei 15/28 Root Causes Questions Answers Completed Diagnostics Frustrated Diagnostics 1st age 10th age 1st age 10th age RC1 Q1, Q2 A1, A3 1 24 4 8 RC2 Q1, Q3 A2, A5 4 15 1 2
  • 20.
    Proposed Solution Strategies forSelecting Questions • Strategy 3 – Age of diagnostics 10 1 )( )( i iiixghtelementWei 15/28 Root Causes Questions Answers Completed Diagnostics Frustrated Diagnostics 1st age 10th age 1st age 10th age RC1 Q1, Q2 A1, A3 1 24 4 8 RC2 Q1, Q3 A2, A5 4 15 1 2 4.3 + 1.6 = 5.9 100% (1 - 4) + 10% (24 - 8) = 1.6 100% (4 - 1) + 10% (15 - 2) = 4.3 1.6
  • 21.
    Proposed Solution Strategies forSelecting Questions • Strategy 4 – Questions’ popularity • The RCs and categories’ weight are calculated according the Strategy 2 • The question’s weight consider the weight of associated RCs and question’s popularity • Question’s popularity is obtained by the ratio between amount of occurrences of the question and amount of diagnostic sets selected 16/28
  • 22.
    Proposed Solution Strategies forSelecting Questions • Strategy 4 – Questions’ popularity αx – amount of occurrences of the question x in the diagnostic sets n – amount of diagnostic sets βRCi – probability of identifying an RC αRCi, x – amount of occurrences of question x in the diagnostic set of an RC 2 1 , )( n i xRCiRCi x x n ightquestionWe 17/28
  • 23.
    Proposed Solution Strategies forSelecting Questions • Strategy 4 – Questions’ popularity 2 1 , )( n i xRCiRCi x x n ightquestionWe 18/28 Root Causes Questions Answers Completed Diagnostics Frustrated Diagnostics 1st age 10th age 1st age 10th age RC1 Q1, Q2 A1, A3 1 24 4 8 RC2 Q1, Q3 A2, A5 4 15 1 2
  • 24.
    Proposed Solution Strategies forSelecting Questions • Strategy 4 – Questions’ popularity 18/28 Root Causes Questions Answers Completed Diagnostics Frustrated Diagnostics 1st age 10th age 1st age 10th age RC1 Q1, Q2 A1, A3 1 24 4 8 RC2 Q1, Q3 A2, A5 4 15 1 2 (2/2 + ((13/29 * 1) + (16/29 * 1))) /2 = 1 (1/2 + ((13/29 * 1) + (16/29 * 0))) /2 = 0.4741 (1/2 + ((13/29 * 0) + (16/29 * 1))) /2 = 0.5259
  • 25.
    • In thiscase study some constrains were defined • There is no changes during all executions • The operator will provide always the same answer • One company provides some services on the Web • The infrastructure consists of DB Server and Web Server • In order to meet growing demand 2 new servers will be installed • Hosting Server – Will be used to host the clients’ websites • Mail Server – Will be used to host the email services 19/28 Case Study
  • 26.
    • The CPbelow aims to install 2 new servers and to migrate existing services 20/28 Case Study
  • 27.
    • The CPbelow aims to install 2 new servers and migrate existing services 20/28 Case Study A failure occurs
  • 28.
    • IT infrastructurestate in the company 21/28 Case Study
  • 29.
    • IT infrastructurestate in the company 21/28 Case Study
  • 30.
    • IT infrastructurestate in the company 21/28 Case Study
  • 31.
    22/28 Case Study Categories Level CalculatedWeights Strat. 1 Strat. 2 Strat. 3 Strat. 4 Service 1 1083 242 157,30 242 Web Page Server 2 558 82 33,20 82 DataBase 2 519 195 127,60 195 Network 1 1058 345 188,10 345 Services 2 512 189 113,40 189 Devices 2 485 136 66,20 136 System 1 603 167 54,30 167 Computer System 2 545 153 52,90 153 Hosting Server 3 319 175 49,90 175 DB Server 3 192 -22 3,00 -22 Software 1 1115 343 126,60 343 Web Server 2 607 138 86,80 138 DB Server 2 443 169 36,20 169
  • 32.
  • 33.
    23/28 Case Study • Diagnosticworkflows generated The PHP configuration does not allow the use of language in user’s websites
  • 34.
  • 35.
    24/28 Case Study • Diagnosticworkflows generated The PHP configuration does not allow the use of language in user’s websites
  • 36.
    Final Considerations 25/28 • Theproposed solution allows to identify the failures’ root cause with the following features • Reuse the operator’s knowledge • Interactivity between solution and operator • Flexibility of the diagnostic generated • System compatibility with the standards used by companies • The modular structure of solution allows organizations to adapt the system to their special needs
  • 37.
    Final Considerations 26/28 • Theproposed strategies generate different diagnostic workflows, considering the same infrastructure and failure • Analyzing the obtained results, we have the following recommendations for IT operators • Strategy 1 – histories with a small amount of records • Strategy 2 – bulky and recent histories • Strategy 3 – histories that include at least 10 months • Strategy 4 – data sets with a great amount of popular questions
  • 38.
    Future Work 27/28 • Explorenew criteria for the selection of questions • Confidence • False positive and false negative rates • Extend the process to identify root causes for other scopes • Investigate the use of CIM classes (actions e checks) in order to improve the system bootstrapping • Automate root cause identification of certain kinds of failures
  • 39.
    Thank you foryour attention! Questions?
  • 40.
    References • J. P.Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and Priority Determination of Changes in IT Service Management,” in XVIII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2007), 2007, pp. 147–158 • ITIL, “ITIL - Information Technology Infrastructure Library. Office of Government Commerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug. 2010 • G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT change management systems,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 347–354 • W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design and planning in networked systems based on reuse of knowledge and automation,” Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009 • ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version 3.0. Office of Government Commerce (OGC),” 2007 • DMTF, “Distributed Management Task Force: Common Information Model. Distributed Management Task Force (DMTF),” 2009, Available: http://www.dmtf.org/standards/cim. Accessed: aug. 2010
  • 41.
    References • J. Sauvé,R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change priority determination in it service management based on risk exposure,” Network and Service Management, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008 • A. Brown and A. Keller, “A best practice approach for automating it management processes,” in Network Operations and Management Symposium, 2006. NOMS 2006. 10th IEEE/IFIP, 3-7 2006, pp. 33 –44 • A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping the ante of it : exploring the linkage between it and business to improve both it and business results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153, october 2008 • A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system: change management with planning and scheduling,” in Network Operations and Management Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 – 408 Vol.1 • M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” in Service Systems and Service Management, 2006 International Conference on, vol. 1, Oct. 2006, pp. 798–803 • R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident management process,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-6 2008, pp. 141 –150
  • 42.
    References • K. Appleby,G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlation engine for multi-domain server farms,” in Integrated Network Management Proceedings, 2001 IEEE/IFIP International Symposium on, 2001 • M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems through incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537 – 562, 2004 • W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to support knowledge reuse in IT change design,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362 • J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change management processes with automated risk assessment,” in XII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2009), 2009 • R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L. d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “On strategies for planning the assignment of human resources to it change activities,” in Network Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr. 2010, pp. 248–255
  • 43.
    Root Cause Analyzer ProposedSolution Root Cause Analyzer Question Verifier Obvious? Threshold 80% with the same answer Input Processor RCRCRC Identification based on categories Identification based on PR Identification based on RCs Question Selector Selects the Question has the greatest weight/level Selects the Category that has the greatest weight Calculates the weights according to the strategy CICILog
  • 44.
    Case Study • IdentifiedCIs and categories associated CI Categories Hosted Sites Service  Web Page Server DataBase Access Service  DataBase Web Page Access Service  Web Page Server PHP Interpreter Service  Web Page Server CMS Service Service  Web Page Server Logical Connection Network  Services Joomla Software  Web Server PHP Software  Web Server Apache Software  Web Server MySQL Software  Web Server DB Server System  Computer System  DB Server Hosting Server System  Computer System  Hosting Server Switch Network  Devices
  • 45.
    Proposed Solution Information Model determinesProblem possibleAnswers determinesOthersQuestions CategoryParentChild 11..* 10..1 1..* * ServiceProblem SolutionCategory* 1..* ManagedElement ExchangeElement SolutionElement * QuestionCategory Category 0..1 Question RootCause 1..* * 1 0..* ServiceIncident Problem Answer 0..1 1..* 1..* 0..1 1..* SolutionCategory
  • 46.
    Proposed Solution Information Model determines Problem possibles Answers determines OthersQuestions 1..* 0..1 1 LogicalElement EnabledLogical Element MessageLog RecordLog recordedAnswers recordedQuestions 1 0..1 Question RootCause 1..* 1 1 1 1 Problem Answer 0..1 recordedProblem 1 1 1..* 1 *