SlideShare a Scribd company logo
1 of 60
Download to read offline
A Machine Learning approach to predict Software
Defects
Chetan Hireholi
February 19, 2017
Abstract
Software engineering teams are not only involved in developing new versions of a
product but are often involved in fixing customer reported defects. Customers report issues
faced by a particular software and some of these issues may actually require the engineering
team to analyze and potentially provide a fix for the issue. The number of defects that a
software engineering team has to analyze is significant and teams often prioritize the order
of the defects based on the customer’s priority and to the extent the defect impacts the
business operations of the customer. Often, it is likely that the engineering team may not
truly understand the business impact that a defect is likely to have and this results in the
customer escalating the defect up the engineering team’s management chain seeking more
immediate attention to their problem.
Such escalated defects tend to consume a lot of engineering bandwidth
and increase defect handling costs; further such escalations impact existing plans as all crit-
ical resources are used to handle these cases. Software escalations are classified under three
categories: Red, Yellow and Green. The software defect report containing a high business
value is prioritized and is marked as Red, Yellow being the neutral one, which might be
escalated if appropriate attention is not given and the Green reports have the low priority.
The engineering team understands the nature of the software escalation and allocates the
resources appropriately. The objective of this project is to be able to analyze software defects
and predict the defects that are likely to be escalated by the customer. This would permit
the engineering team to be alerted about the situation and take proactive measures that
will give better support to the customers. For the purpose of our analysis, we have used the
defects database provided by Hewlett-Packard (HP) India. We have used the concepts of R
programming for cleaning the database and to pre-process it. We then extracted keywords
using natural language processing and then used machine learning (J 48 decision tree, Na¨ıve
Bayes and Simple K Means) to predict the escalations. Thus by combining the key words
and the tickets received to the team, we can predict the nature of the Escalation and alert
the engineering team so that they can take respective steps appropriately.
i
Acknowledgment
I would like to take this to thank a lot of eminent personalities, without whose
constant encouragement and support, the endeavor of mine would not have been successful.
Firstly, I would like to thank the PES University, for having ”Final Year Project”
as a part of my curriculum, which gave me a wonderful opportunity to work on research and
presentation abilities, and provided excellent facilities, without which, this project could not
have acquired the orientation, it has now.
At the outset I would like to venerate Prof. Nitin V. Pujari, Chairperson, PES
University, who toned me encompassing the attitude towards the subject implied in this
literary work.
It gives me immense pleasure to thank Dr. K. V. Subramaniam, Department
of Computer Science and Engineering, PES University for his continuous support, advice
and guidance.
I would also like to thank Dr. Jayashree R., Department of Computer Sci-
ence and Engineering, PES University for her initiative and support which made this work
possible.
ii
Contents
Abstract i
Acknowledgment ii
List of figures v
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
2.1 A Probabilistic Model for Software Defect Prediction . . . . . . . . . . . . . 4
2.2 Predicting Bugs from History . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Exploring the Dataset 8
3.1 Incident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Change Requests(CR) Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Cleaning the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Defect Escalation Analysis 14
4.1 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Analyzing the Incidents Dataset . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Analyzing Change Requests Dataset . . . . . . . . . . . . . . . . . . 21
4.2 Applying Machine Learning on the Dataset . . . . . . . . . . . . . . . . . . . 26
4.2.1 Classifying the Incidents Data . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Clustering the Incidents Data . . . . . . . . . . . . . . . . . . . . . . 36
4.2.3 Text Mining and Natural Language Tool Kit (NLTK) . . . . . . . . . 42
5 Results and Conclusions 51
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 53
List of Figures
1 Predicting Bugs from History- Commonly used complexity metrics . . . . . . 5
2 Life Cycle of a Defect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Cleansing in OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Data Transformation in Microsoft Excel . . . . . . . . . . . . . . . . . . . . 12
5 Distribution of Incident Escalations . . . . . . . . . . . . . . . . . . . . . . . 14
6 Analyzing RED Incidents: Customers vs Escalations . . . . . . . . . . . . . 15
7 Analyzing RED Incidents: Modules vs Escalations . . . . . . . . . . . . . . . 17
8 Analyzing RED Incidents: Software release vs Escalations . . . . . . . . . . 17
9 S/w release vs Escalations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10 Analyzing RED Incidents: OS vs Escalations . . . . . . . . . . . . . . . . . . 18
11 Analyzing RED Incidents: Developer vs Escalations . . . . . . . . . . . . . . 19
12 Other observations made on Incidents . . . . . . . . . . . . . . . . . . . . . . 20
13 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
14 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
15 Analyzing CRs: Customers vs Escalations . . . . . . . . . . . . . . . . . . . 22
16 Analyzing CRs: Modules vs Escalations . . . . . . . . . . . . . . . . . . . . . 23
17 Analyzing CRs: S/w release vs Escalations . . . . . . . . . . . . . . . . . . . 23
18 Analyzing CRs: OS vs Escalations . . . . . . . . . . . . . . . . . . . . . . . 24
19 Analyzing CRs: Developer vs Escalations . . . . . . . . . . . . . . . . . . . . 25
20 Classifying using: J48 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
21 Classifying using: J48 Tree: Prefuse Tree . . . . . . . . . . . . . . . . . . . . 28
22 Module as root node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
23 Probability for MODULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
24 Probability distribustion for SERVERITY SHORT . . . . . . . . . . . . . . 31
25 ESCALATION as the root node . . . . . . . . . . . . . . . . . . . . . . . . . 31
26 Probability distribution table for ESCALATION . . . . . . . . . . . . . . . . 32
27 Probability distribution table for EXPECTATION . . . . . . . . . . . . . . . 33
28 SEVERITY SHORT as the root node . . . . . . . . . . . . . . . . . . . . . . 33
iv
29 Probability distribution for SEVERITY SHORT . . . . . . . . . . . . . . . . 34
30 Probability distribution for Customer Expectations . . . . . . . . . . . . . . 35
31 Final Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
32 Model and evaluation on training set . . . . . . . . . . . . . . . . . . . . . . 37
33 Cluster Centroids I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
34 Cluster Centroids II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
35 Total number of Incidents and thier Escalation count . . . . . . . . . . . . . 42
36 Words with highest frequency mined on GREEN tickets escalated to RED . 44
37 GREEN tickets escalated to RED . . . . . . . . . . . . . . . . . . . . . . . . 44
38 Words with highest frequency mined on GREEN tickets escalated to YELLOW 45
39 GREEN ticket escalated to YELLOW . . . . . . . . . . . . . . . . . . . . . . 45
40 Words with highest frequency mined on GREEN tickets escalated to YELLOW 46
41 YELLOW ticket escalated to RED . . . . . . . . . . . . . . . . . . . . . . . 46
42 Observations made on RED ticket were Escalated . . . . . . . . . . . . . . . 47
43 Plotting the highest mined words . . . . . . . . . . . . . . . . . . . . . . . . 47
44 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 48
45 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 48
46 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 49
47 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 49
48 Output of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
v
1 Introduction
In spite of diligent planning, documentation, and proper process adherence in
software development, occurrences of defects are inevitable. In today’s cutting edge com-
petition, it is important to make conscious efforts to control and minimize these defects by
using techniques to allow in-process quality monitoring and control. Predicting the total
number of defects before testing begins improves the quality of the product being delivered
and helps in planning and decision making for future project releases.
Defect Prediction in software is viewed as one of the most useful and cost efficient
operation. Software developers see it as a vital phase on which the quality of the product
being developed depends. It has taken up major part in bringing down the allegations on
the software industry, of being incapable to deliver the requirements within budget and on
time. Besides this, the clients response regarding the product quality has shown a large shift
from unsatisfactory to satisfactory.
Today, many data miners have replaced the earlier statistical approaches for defect
prediction. The basis of data mining is the classification model which places the component
in one of the two classes: fault prone or non fault prone.
Defects reported to the engineering team carry important information that may
lead to various important decisions. These defects are reported as Tickets. Each ticket
contain the nature of the escalation which is based on the business value.
These tickets may also give valuable information regarding the manner in which:
• Engineering defects are collected as part of the Quality & Analysis(QA) cycle – of
a software product or software application
• Product management which manages the usage and direction of the product
• Escalations: How often an Escalation happens?, What modules of the application is
buggy, and requires more attention
• Who has worked on the code-base? – during Development, QA and defect/ticket/in-
cidence fixing, etc.
1
1.1 Problem Definition
• Determine causes for Defects during the Engineering phase which may lead
to Escalation of Customer Support Cases
While most software defects are corrected and tested as part of the prolonged software
development cycle, enterprise software vendors often have to release software products
before all reported defects are corrected, due to deadlines and limited resources.
A small number of these reported defects will be escalated by customers whose busi-
nesses are seriously impacted. Escalated defects must be resolved immediately and
individually by the software vendors at a very high cost. The total costs can be even
greater, including loss of reputation, satisfaction, loyalty, and repeat revenue.
• Build a Recommendation Engine to alert on Escalation based on the nature
of defects to create an alert system for the team, given one such escalation
that has happened
By this Alerting Mechanism, the team can take proactive steps in preventing the
Escalation which is going to happen.
2
1.2 Motivation
• Market Research companies especially, Gartner and Forrester
Predicts have conducted survey that 80% of the IT budgets goes towards main-
tenance of applications
There is a dismal success rate of 2% of the new projects being successful (funded
from the 20% of the budget allocated to IT)
The research hopes to finding the way tickets information (in our opinion – wealth
of information, which is being neglected so far) is being looked at
• Mine information that are hidden and are lost in the tickets dump. The mined infor-
mation will then be useful to deduce important details - which would help the Project
Manager of the team to plan out the activities appropriately.
3
2 Literature Survey
In this process we have found many works related to the software bugs prediction
and which has helped us in understanding the kind of knowledge that can be captured by
bugs. The following are the work carried out by the specific persons in the area of the Soft-
ware Defect Prediction:
2.1 A Probabilistic Model for Software Defect Prediction
Although a number of approaches have been taken to quality prediction for software, none
have achieved widespread applicability. The author’s aim here is to produce a single model
to combine the diverse forms of, often causal, evidence available in software development
in a more natural and efficient way than done previously. The authors use graphical prob-
ability models (also known as Bayesian Belief Networks) as the appropriate formalism for
representing this evidence. The authors have used the subjective judgments of experienced
project managers to build the probability model and use this model to produce forecasts
about the software quality throughout the development life cycle. Moreover, the causal or
influence structure of the model more naturally mirrors the real world sequence of events and
relations than can be achieved with other formalism. We used WEKA in order to apply the
Bayesian Network Classifier. We selected the attributed: Escalation, Expectation, Modules
& Severity. Then by rotating the attributes as the root nodes, results were captured.
A disadvantage of a reliability model of this complexity is the amount of data that is needed
to support a statistically significant validation study. More detailed description regarding
the application of Bayesian Classification is covered in section 4.3.[PMDM]
4
2.2 Predicting Bugs from History
Version and bug databases contain a wealth of information about software failures — how
the failure occurred, who was affected, and how it was fixed. Such defect information can be
automatically mined from software archives; and it frequently turns out that some modules
are far more defect-prone than others. How do these differences come to be?
The authors have researched how code properties like (a) code complexity, (b) the problem
domain, (c) past history, or (d) process quality affect software defects, and how their cor-
relation with defects in the past can be used to predict future software properties — where
the defects are, how to fix them, as well as the associated cost.[PRBD]
Figure 1:
Commonly used complexity metrics
5
Learning from history means learning from successes and failures — and how to make the
right decisions in the future. In our case, the history of successes and failures is provided
by the bug database: systematic mining uncovers which modules are most prone to defects
and failures. Correlating defects with complexity metrics or the problem domain is useful
in predicting problems for new or evolved components. Learning from history has one big
advantage: one can focus on the aspect of history that is most relevant for the current situ-
ation. Thus the history data provided to us by Helwet Packard(HP): which consisted of the
incident and the change requests data proved helpful during the statistical analysis. More
descriptive coverage of the statistical analysis is covered in section 4.2. The dataset given to
us by HP is explained in detail in section 3.
Some more research work carried out by few people on similar problem domain:
• Work done using the Machine Learning approach:
Predicting Effort to Fix Software Bugs [PEFS]:
The authors have used the K - Nearest Neighbor approach to predict the effort
put by an engineer to fix the software bugs. In this study carried out by the
authors, their technique leverages existing issue tracking systems: given a new
issue report, they search for similar, earlier reports and use their average time
as a prediction. This technique will not be much useful in our problem domain,
since the work done here focuses on the assignment of the job to a resource for
which the effort is estimated.
Cost-Sensitive Learning for Defect Escalation [CSDE]:
The authors have established that the Cost - Sensitive Decision Tress is the best
methods for producing the highest positive net profit. Our approach on the defect
reports is different from this, since we are not focusing on the cost aspect, but
on the Escalation of a defect report. We have used Decision Trees in the form of
Apriori Algorithms to derive rules with respect to Escalation.
6
Predicting Failures with Hidden Markov Models [PHMM]:
The authors have come up with an approach using Hidden Markov Model(HMM)
to recognize the patterns in failures. Since HMMs give accurate outputs on lesser
attributes, we cannot use this approach in recognizing the defect patterns in the
defect reports.
Data Mining Based Social Network Analysis from Online Behavior [DSNA]: The
authors have used the Neural Networks and done sentimental analysis on the social
networks to predict the online behavior of people. The approach used in sentiment
analysis have given me insights on Natural Language Tool Kit(NLTK) and how
NLTK can be used to find the sentiments of the user data. This motivated me
to pick up NLTK to analyze the tickets dataset provided by Hewlett Packard.
Detailed information on the application of NLTK is explained in section 4.
7
3 Exploring the Dataset
The methodology describes on exploring the data set acquired from Hewlett Packard(HP)
and closely analyzing it. This is the beginning phase of the project where the data is under-
stood and meaningful analysis is done. This gives an high level overview on the datasets.
HP had provided two datasets: Customer Incident data set and Change Requests(CR) data
set.
The Incidents dataset had the customer cases. These cases included - troubleshooting errors,
field issues, installation issues, environment issues and all other cases which are related to
the Software. When a customer logs a unique case which the team identifies it as a Change
Request, and a respective entry is made in the CR dataset.
Below are the characteristics of Incident & CR dataset:
3.1 Incident Dataset
• Contains 153 headers with 6,433 entries
Few important characteristics are:
Each entry has the owner assigned to respective entries
Each entry has an Unique ID called ISSUEID
The life line of each entry is captured and is represented numerically(For e.g.
DAYS TO OPEN, DAYS TO FIXED, etc.)
Each entry has an Escalation set by the CPE Support Team on consulting the
customer. The Escalation comes in 3 categories: RED, ORANGE & GREEN.
RED being the most high priority ticket, YELLOW being a potentially important
ticket and GREEN being a ticket with lesser business impact compared to the
Red and Yellow tickets.
Each entry has an Expectation set by the CPE Support team on the inputs by
the customer(E.g. Investigate Issue & Hotfix requested, Answer Question, Create
Enhancement, Investigate Issue, etc.)
8
The mail communication between the Developer and Customer can be found
under: NOTE CUSTOMER. This field contains all the information about the
Defect being tracked with the team.
Each entry has a Severity set by the CPE Support Team on consulting the cus-
tomer. The Severity comes in 3 stages: Low, Medium & High.
Each entry has a date describing the date on which the case was Escalated. It
called QUIXY ESCALATED ON DATE & QUIXY ESCALATED YELLOW DATE.
Each entry has a RESOLUTION attribute, where it describes the resolution of
the case.
On observing the Incident data set, We needed to come up with a life cycle of how a defect
is tracked with the team.
The below figure illustrates the life cycle of the defect. This process is currently in use by
the team.
Figure 2:
Life Cycle of a Defect
9
3.2 Change Requests(CR) Dataset
The CRs dataset contains the Incident Cases which were identified as Change Requests.
• Contains 153 headers with 11960 entries
Few important characteristics are:
Each entry has the owner assigned to respective entries
Each entry has an Unique ID called ISSUEID
The life line of each entry is captured and is represented numerically(For e.g.
DAYS TO OPEN, DAYS TO FIXED, etc.)
Each entry has an Escalation set by the CPE Support Team on consulting the
customer. The Escalation comes as Y(Escalated) and N(Not Escalated)
Each entry has Expectation set by the CPE Support team on the inputs by the
customer(For e.g. Investigate Issue & Hotfix requested, Answer Question, Create
Enhancement, Investigate Issue, etc.)
The mail communication between the Developer and Customer can be found
under: NOTE CUSTOMER. This field contains all the information about the
Defect being tracked with the team.
Each entry has a Severity set by the CPE Support Team on consulting the cus-
tomer. The Severity comes in 3 stages: Low, Medium & High.
Each entry has a date describing the date on which the case was Escalated. It
called QUIXY ESCALATED ON DATE
Each entry has a RESOLUTION attribute, where it describes the resolution of
the case.
10
3.3 Cleaning the Dataset
After receiving the huge dataset the next approach was to clean the data. There were many
discrepancies in the dataset viz., presence of non-numeric values in the dates field, the data
in the rows were shifted to the left by 2 - 4 columns; due to this data - shift, the data were
not aligned to its specific header. The following steps were performed to clean the dataset.
• Removing Discrepancies
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data:
cleaning it; transforming it from one format into another; and extending it with web
services and external data.
This tool brought down the immense cleaning effort. OpenRefine helped to explore
large data sets with ease. It provided functions to transform the data to make it
unified. E.g. In the ”Customer” column, there were different names for a single
company. This tool helps in organizing varied names to a single one. Thus unifying
the word:- ”Vodafone”, ”Vodafone Inc”, ”vodafone” to a single name- ”Vodafone”. It
also help in removing special characters and make the text readable. It also takes care
of the case sensitivity of the context. i.e. we can edit the contents of multiple rows
using the feature: ”Text Facet”, shown in the figure next page.
• Removing the Unwanted Data
Microsoft Excel helped in converting the whole data set to a table. Then by moving
into the table, applying filters and filtering the data excluded all the noise. Then
further it was also possible to extract the pivot tables which favored in the statistical
analysis of the dataset.
11
Figure 3:
Figure 4:
Data Transformation in Microsoft Excel
12
• Removing Stop Words
The Text Mining (tm) package from R language helped converting the dataset into
a corpus. The pre - processing of the data is efficiently done by the ’tm’ package in
R. The various text transformations offered by the tm package are: “removeNumbers”
“removePunctuation” “removeWords” “stemDocument” “stripWhitespace”.
13
4 Defect Escalation Analysis
After getting the data cleaned, the next approach was Statistical Analysis of the data. The
Statistical Analysis will further unveil some under - the - hood facts
4.1 Statistical Analysis
The initial Statistical Analyses of the the data received from HP- incidents.csv and crs.csv
was carried out by Microsoft Excel(MS Excel). The Pivot Charts obtained from MS Excel
helped in graphically analyzing the huge datasets.
4.1.1 Analyzing the Incidents Dataset
The incidents.csv contained all the customer incidents received reported to team.
Figure 5:
Distribution of Incident Escalations incidents.csv
There in total 125 RED Escalated incidents, 3831 GREEN incidents and 329 YEL-
LOW incidents in the dataset.
14
• Analyzing RED Incidents: Customers vs Escalations
In certain situations, the escalation of a task is necessary. For example, a user is doing
a task, and is unable to complete that task within a certain period of time. In such
cases, where the user/customer is completely blocked, a case is RED Escalated. These
RED Escalated Incident Cases are High Priority cases and have to be addressed on
high severity.
Companies RHEINENERGIE, HEWLETT PACKARD, DEUTSCHE BANK had the
highest number of RED Escalations.
Figure 6:
Analyzing RED Incidents: Customers vs Escalations
15
• Company behavior analysis: RHEINENERGIE
RHEINENERGIE had maximum RED escalations among the other customers. There
were around 28 incident cases registered to the Operations Team. The Patterns ob-
served in those 28 incidents are:
Out of the 28 escalations, 6 were RED escalated.
21.28% chance that an incident logged in will be a RED escalation
Most reported module:
Ops - Monitor Agent (opcmona) (7 nos.) ; where 3 of them were RED
Installation(6 nos.)
Perf – Collector(3 nos.)
Average number of days a single incident handled: 73.5 days
Number of incidents which move to CR: 15; 53.57% of the incidents move to CRs
16
• Analyzing RED Incidents: Modules vs Escalations
The software modules: Ops - Action Agent (opcmona) & Installation had the
highest number of RED escalations reported to the Operations Team.
Figure 7:
Analyzing RED Incidents: Modules vs Escalations
• Analyzing RED Incidents: Software Release vs Escalations
Out of all the 125 RED escalated cases, the Operations Agent version 8.6 had the
maximum number of incident reports. Followed by version 11.14 & 11.02. These
Software Release Versions had the maximum escalations in the incident.
Figure 8:
Analyzing RED Incidents: Software release vs Escalations
17
The incident frequencies of the other Software Versions is shown below:
Figure 9:
S/w release vs Escalations
• Analyzing RED Incidents: Operating System(OS) vs Escalations
It can be observed that 83 entries out of 125 Red Escalations had a ’blank’ OS field.
After having a talk with a HP Developer, we got the inputs that the customers tend
to skip the OS field. But we further observed that the no specific version of the OS
which could aid the trouble shooting. The customers tend to put just the general OS
names viz. Windows, Solaris, etc. instead of mentioning the entire name which the
version details.
Figure 10:
Analyzing RED Incidents: OS vs Escalations
18
• Analyzing RED Incidents: Developer vs Escalations
This describes the Developer associated with the incident case. Below is the distribu-
tion of incidents among the developers. The developer who was been assigned with
high number of the incident cases is prasad.m.k hp.com
Figure 11:
Analyzing RED Incidents: Developer vs Escalations
• Other observations made on Incidents
Calculating the age of a single ticket was challenging. The data dump had two
columns: ”OPEN-IN-DATE and CLOSED-IN-DATE.The difference of those two
dates would give the total days taken to close the ticket. But it was contradicting
to the column present in the table: DAYS SUPPORT TO CPE. Both the values
were not matching. Below is a pictorial representation of the same:
19
Figure 12:
Other observations made on Incidents
20
4.1.2 Analyzing Change Requests Dataset
The second data set provided by HP was the CR data(Incident cases which were
”Change Requests”). These are the cases which are added to the product backlog.
Each entry had an escalation attached to it. The three escalations attached were:
Showstopper, Yes and No. Below is the distribution of CRs and the nature of
escalation it carried. There were in total 10,387 CR entries. Of these, 10219 cases did
not escalate, 75 cases escalated and 93 of them were marked as ”Showstopper”.
Figure 13:
Analyzing CR data
The Showstopper escalations were High Priority cases.
Figure 14:
CR data
21
• Analyzing CRs: Customers vs Escalations
The company TATA CONSULTANCY SERVICES LTD. had the maximum
Showstopper escalations. Where as Allegis, NORTHROP GRUMMAN,PepperWeed
are the companies with highest ”Y”(Yes) escalations.
Figure 15:
Analyzing CRs: Customers vs Escalations
• Analyzing CRs: Modules vs Escalations
The software module: Ops - Monitor Agent (opcmona) & Installation had the
highest ”Showstopper” escalations. Where as the modules: Installation & Lcore
had the highest number of ”Y”(yes) escalations.
22
Figure 16:
Analyzing CRs: Modules vs Escalations
• Analyzing CRs: Software Release vs Escalations
Software Release Version 11 had the highest ”Showstopper” and ”Y”(yes) escalations.
Where as the Software Release Version 8.6 was the second highest in having the ”Show-
stopper” and ”Y”(yes) escalations.
Figure 17:
Analyzing CRs: S/w release vs Escalations
23
• Analyzing CRs: OS vs Escalations
The Software running Windows OS had the maximum number of both ”Showstop-
per” and ”Y”(yes) escalations. Note: Submitter of these tickets tend to choose the OS
fields as they want to. Some choose the exact versions where the issue was seen or
reported or some choose just at a high level. No strict rules observed
Figure 18:
Analyzing CRs: OS vs Escalations
• Analyzing CRs: Developer vs Escalations
While analyzing the Developers and the Escalated tickets assigned to them: swati.sinha hp.com
was been assigned the highest number of ”Showstopper” escalations.
Where as umesh.sharoff hp.com was been assigned the highest number of ”Y”(yes)
escalations.
24
Figure 19:
Analyzing CRs: Developer vs Escalations
Post statistical analysis we used WEKA for applying few machine learning algorithms
onto the dataset. In the next section we will use few datamining concepts along with
machine learning algorithms to collect meaningfull conclusion from the dataset.
25
4.2 Applying Machine Learning on the Dataset
4.2.1 Classifying the Incidents Data
In this phase, classification and clustering is applied on the data to get to know the main
attributes which are responsible to trigger an Escalation. By using WEKA tool, which offers
various machine learning algorithms to use on the data set, certain informative conclusions
have been drawn. Classification is a data mining function that assigns items in a collection
to target categories or classes. The goal of classification is to accurately predict the target
class for each case in the data.
Here the target class will be the Escalation attribute of each ticket. The class assignments
are know: viz., Severity of a Bug, Expectation on the Defect Resolution, the Modules of the
Software, etc. By computing these important attributes of a ticket, the classification algo-
rithm finds the relationship between these values and predicts the vale of the target. In this
work, I have chosen J48 Decision Tree Algorithm and Bayes Network Classifier Algorithm
to predict the Target class: Escalation.
Classifying using: J48 Tree
The J48 Decision tree classifier follows the following simple algorithm. In order to classify a
new item, it first needs to create a decision tree based on the attribute values of the avail-
able training data. So, whenever it encounters a set of items (training set) it identifies the
attribute that discriminates the various instances most clearly. This feature that is able to
tell the most about the data instances so that it can classify them the best is said to have
the highest information gain. Now, among the possible values of this feature, if there is any
value for which there is no ambiguity, that is, for which the data instances falling within
its category have the same value for the target variable, then it terminates that branch and
assign to it the target value that it has obtained.
We used WEKA to apply the decision tree on the Data set.
Attributes selected:
• Escalations(Yellow, Red)
26
• Expectation(Contains the customer expectation on the resolution of the ticket from
the support team)
• Modules
• Severity
Results:
• SEVERITY SHORT = Urgent
• ESCALATION = Red: Ops - Monitor Agent (opcmona) (17.0/11.0)*
• ESCALATION = Yellow: Ops - Logfile Encapsulator (opcle) (21.0/17.0)*
• SEVERITY SHORT = High: Installation (92.0/74.0)*
• SEVERITY SHORT = Medium: Installation (23.0/20.0)*
• SEVERITY SHORT = Low: Ops - Message Agent (opcmsga) (1.0)
• Number of Leaves : 5
• Size of the tree : 7
• Correctly Classified Instances = 32 (20.7792 %)
• Incorrectly Classified Instances = 122 (79.2208 %)
• Kappa statistic = 0.0626
• Mean absolute error= 0.0595
• Root mean squared error= 0.1725
• Relative absolute error= 96.4179
• Root relative squared error= 98.5439
• Total Number of Instances= 154
27
* The first number is the total number of instances weightofinstances reaching the leaf.
The second number is the number weight of those instances that are miss-classified.
Figure 20:
Classifying using: J48 Tree
Figure 21:
Classifying using: J48 Tree: Prefuse Tree
After observing that the incorrectly classified instances were greater than the cor-
rectly classified instances, J48 Decision Tree did not yield the required answers.
28
Bayes Network Classifier, A Supervised Learning
Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier
with strong assumptions of independence among features, called naive Bayes, is competi-
tive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a
classifier with less restrictive assumptions can perform even better. In this paper we evalu-
ate approaches for inducing classifiers from data, based on the theory of learning Bayesian
networks. These networks are factored representations of probability distributions that gen-
eralize the naive Bayesian classifier and explicitly represent statements about independence.
We used WEKA to apply this Classifier on the Data set.
By using this classifier, the Probability Distribution is found out:
Attributes selected:
• ESCALATION (Yellow, Red)
• EXPECTATION (Contains the customer expectation from the support team)
• MODULES
• SEVERITY SHORT
Results I: MODULE as the root node
• Red Escalation:
Highest probable module: Lcore- Control, Ops- Message Interceptor (opcmsgi)
• Yellow escalation:
Highest probable module: Ops-Trap interceptor, other, Lcore BBC
Figure 22: Module as root node
29
Figure 23: Probability for MODULE
• The modules: Ops - Action Agent (opcacta) & Installation: Have the highest
number of RED escalations reported to the Operations Team.
• When the SEVERITY is:
URGENT: Most probable module is : LCore BBC
HIGH: Most probable module is : Installation
MEDIUM: Most probable module is : LCore- XPL
LOW: Most probable module is : LCore-control, Opcmsgi, LCore-Security,
LCore-Deploy, Operation Agent, Agent Framework
The Probability Distribution of Modules vs Severity is shown in the next page:
30
Figure 24:
Probability distribustion for SERVERITY SHORT
Results II: ESCALATION as the root node
• Correctly Classified Instances= 80; 51.9481%
• Incorrectly Classified Instances= 74; 48.0519%
Figure 25:
ESCALATION as the root node
• Probability that it would be a RED Escalation : 24.20%
• Probability that it would be a YELLOW Escalation : 75.80%
31
Figure 26:
Probability distribution table for ESCALATION
• Probability of the EXPECTATION from Customer when it’s a RED Escalation:
Answer Question: 64%
Investigation & Hotfix requested: 47.4%
Investigate issue: 39.7%
Create Enhancement: 64%
• Probability of the EXPECTATION from Customer when it’s a YELLOW Escala-
tion:
Answer Question: 63%
Investigation & Hotfix requested: 56.7%
Investigate issue: 32.4%
Create Enhancement: 46%
32
Figure 27:
Probability distribution table for EXPECTATION
• Probability of the SEVERITY of the incident when it’s a RED Escalation:
URGENT: 44.90%
HIGH: 44.90%
MEDIUM: 09.00%
LOW: 13.00%
• Probability of the SEVERITY of the incident when it’s a YELLOW Escalation:
URGENT: 18.10%
HIGH: 63.40%
MEDIUM: 17.20%
LOW: 13.00%
Results III: SEVERITY SHORT as the root node
• Correctly Classified Instances= 63; 40.9091%
• Incorrectly Classified Instances= 91; 59.0909%
Figure 28:
SEVERITY SHORT as the root node
33
• Probability distribution for SEVERITY SHORT
Probability that it would URGENT is 24.7%
Probability that it would HIGH is 59.3%
Probability that it would MEDIUM is 15.1%
Probability that it would LOW is 0.01%
Figure 29:
Probability distribution for SEVERITY SHORT
• Probability distribution for RED escalation:
URGENT: 44.9%
HIGH: 18.8%
MEDIUM: 14.6%
LOW: 25%
• Probability distribution for YELLOW escalation:
URGENT: 55.1%
HIGH: 81.2%
MEDIUM: 85.4%
LOW: 75%
34
• Probability distribution for Customer Expectations
URGENT: Investigate Issue & Hotfix request
HIGH: Investigate Issue & Hotfix request
MEDIUM: Investigate Issue
LOW: Investigate Issue
Figure 30:
Probability distribution for Customer Expectations
35
4.2.2 Clustering the Incidents Data
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another) to
each other than to those in other groups (clusters).
Simple K Means method - is a method of vector quantization.
Originally from signal processing, that is popular for cluster analysis in data mining. k-
means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This
results in a partitioning of the data space into Voronoi cells.
We used WEKA for applying this method. the following are the results got from it:
• Number of iterations: 3
• Within cluster sum of squared errors: 261.0
• Cluster 0: Yellow,’Investigate Issue & Hotfix requested’, ’Ops - Trap Inter-
ceptor (opctrapi)’, Urgent
• Cluster 1: Red,’Investigate Issue & Hotfix requested’, Perf,High
Figure 31:
Final Cluster Centroids
36
Figure 32:
Model and evaluation on training set
In the below cluster centroid, the instances are divided according to the Escalation
Type of the tickets:
Figure 33:
Cluster Centroids I
37
In the below cluster centroid, the instances are dived according to the Severity nature
of the tickets:
Figure 34:
Cluster Centroids II
38
Predictive Apriori - An Apriori variant
Predictive Apriori algorithm use larger support and traded with higher confidence,
and calculate the expected accuracy in Bayesian framework. The result of this algo-
rithm maximizes the expected accuracy for future data of association rules.
We used WEKA to apply this algorithm to the incident Dataset.
Below are the findings:
(Try I)Attributes selected:
ESCALATION (Yellow, Red)
CUSTOMER ENTITLEMENT
SEVERITY SHORT
Best rules fond:
1. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT
= Medium(8) == ESCALATION = Yellow(8) Accuracy: 95.49%
2. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT
= High(38) == ESCALATION = Yellow 34 Accuracy: 83.22%
The above rules describe that: When the Customer Entitlement is Premier and
the Severity set to the ticket is Medium, then there is 95% chance that the Esca-
lation might be Yellow
we used WEKA to apply this algorithm on the Incidents Data sets.
(Try II)Attributes selected:
ESCALATION (Yellow, Red)
CUSTOMER ENTITLEMENT
SEVERITY SHORT
MODULE
OPERATING SYSTEM
39
Best rules fond:
1. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT
= Medium(11) == ESCALATION = Yellow(11) Accuracy:98.84%
2. CUSTOMER ENTITLEMENT = Premier OS & Linux(11) == ES-
CALATION = Yellow(11) Accuracy: 98.46%
3. MODULE = Ops - Logfile Encapsulator[opcle](10) == ESCALA-
TION = Yellow(10) Accuracy: 98.70%
The above rules describe that:
When the module is Ops - Log Encapsulator[opcle], there is 98.70% chance that
it will be a Yellow Escalated.
When the Customer Entitlement is Premier and the Severity of the ticket is
Medium, there is 98.46% chance that it would be Yellow Escalated.
40
Simple Apriori Algorithm - Apriori Algorithm is an association rule mining that
was founded in 1994
Apriori Algorithm works by several steps. First, the candidate item sets are generated.
Then, scan the database to check the support of these item sets. This later will generate
the frequent 1-item sets. In this first scan, the 1-item sets are generated by eliminating
item sets with support below the threshold value. Later, the passes candidates became
k-item sets that generated after k-1 of threshold founded. The iteration of database
scanning and calculating support will be resulting support and confidence of each
association rule that found.
Attributes selected:
ESCALATION (Yellow, Red)
CUSTOMER ENTITLEMENT
MODULES
SEVERITY SHORT
OS
Best rules fond:
1. OS is Linux(19); ESCALATION observed is Yellow(18). The Con-
fidence achieved is 95%
2. CUSTOMER ENTITLEMENT is Premier & SEVERITY SHORT
is High(38); ESCALATION observed is Yellow(35). The Confidence
achieved is 92%
The above rules describe that:
When the OS is Linux, then there is 95% confidence that the Escalation is Yellow.
When the Customer Entitlement is Premier and the Severity of the ticket is high,
then there 92% confidence that the Escalation is Yellow.
41
4.2.3 Text Mining and Natural Language Tool Kit (NLTK)
After evaluating the results acquired from Phase I, a final conclusion cannot be drawn.
It did not answer- what actually triggers an incident so that it escalates. This phase
describes the use of Text Mining and Natural Language processing to determine the
triggering factor of an incident.
Figure 35:
Total number of Incidents and thier Escalation count
The purpose of Text Mining is to process unstructured (textual) information, extract
meaningful numeric indices from the text, and, thus, make the information contained
in the text accessible to the various data mining (statistical and machine learning)
algorithms. Information can be extracted to derive summaries for the words contained
in the documents or to compute summaries for the documents based on the words
contained in them.
We used the tm package in R for text mining the Incident tickets.
The tm package provides the methods for data import, corpus handling, pre-processing,
metadata management and creation of term-document matrices.
42
The main crux was hidden in finding out What made an Incident Ticket to get
RED Escalated from other escalation states?
The following is the step by step process in finding out:What might help to identify
the reason of an Escalation We took the dataset(Incidents.csv) and performed the
following tasks using R:
Data Import: Load the text to a corpus
Inspecting Corpora: to get a concise overview of the corpus
Transformations: Modify the corpus - e.g., stemming, stop-word removal, etc.
Creating Term-Document Matrices
Operations on Term-Document Matrices e.g.: Calculating the word fre-
quencies, Plot the word frequencies, word cloud, etc.
43
Observations made on GREEN tickets which were RED Escalated It was
observed that, opcle was most talked module. It can be also observed that the words:
please, hotfix & support were used the most in the mail chain exchanged between
the customer and the developer.
Figure 36:
Words with highest frequency mined on GREEN tickets escalated to RED
Figure 37:
GREEN tickets escalated to RED
44
Observations made on GREEN ticket which were YELLOW Escalated It
was observed that, support was most talked word. It can be also observed that the
words: issue & time were used the most in the mail chain exchanged between the
customer and the developer.
Figure 38:
Words with highest frequency mined on GREEN tickets escalated to YELLOW
Figure 39:
GREEN ticket escalated to YELLOW
45
Observations made on YELLOW ticket which were RED Escalated This
set of corpus did not yield anything notable. But it did brought out the name of the
developer which was most associated with the resolution of the tickets
Figure 40:
Words with highest frequency mined on YELLOW tickets escalated to RED
Figure 41:
YELLOW ticket escalated to RED
46
Observations made on RED ticket were Escalated It was observed that, opc-
mona was most talked module. It can be also observed that the words: waiting,
hotfix & issue were used the most in the mail chain exchanged between the customer
and the developer.
Figure 42:
Observations made on RED ticket were Escalated
Figure 43:
Plotting the highest mined words
47
Observations made on the Whole RED Escalated Tickets There were in total
125 entries RED escalated in the Incidents. Text mining on all of the 125 entries
revealed the below details:
Figure 44:
Words with highest frequency mined
The words: issue, please, support & escalation were used the most in the mail
chain exchanged between Customer and the Team
Figure 45:
Plotting the words with highest frequency mined
48
Observations made on Whole GREEN Escalated Tickets There were in total
3831 entries GREEN escalated in the Incidents. Text mining on all of these entries
revealed the below details:
Figure 46:
Words with highest frequency mined
Mining the whole GREEN Escalated tickets did not yield valuable information.
Figure 47:
Plotting the words with highest frequency mined
49
The above observations show that the mail chains in which the key words: please,
hotfix, support & please are most expected to get converted to a RED Escalation.
The we used all these key words and built a program which takes input the Incident
data dump and scans the email chain. As the probability of these key words increases,
it alerts the user when it crosses the threshold limit. The threshold limit can be ad-
justed by the Developer based on the on going trend.
Figure 48:
Output of the program
50
5 Results and Conclusions
By text mining and applying machine learning algorithms on the incident dataset, we ob-
tained the below results:
• The mail chain of a ticket which is going to be escalated to Red will contain these
words in high occurrences : please, hotfix, support.
• The software modules: Ops - Logfile Encapsulator (Opcle), Ops - Action Agent (op-
cacta) & Installation had the highest number of red escalations reported to the team
as incidents. Whereas Ops - Monitor Agent (opcmona) & Installation had the highest
showstopper escalations for change requests.
• By applying the Predictive Apriori algorithm on the Incident dataset, we observed the
following:
We got a confidence of 98.70% when a module reported was ’Opcle’ and the
Escalation of the case was ’Yellow’.
We got a confidence of 98.84% for Escalation as ’Yellow’, when the Customer
Entitlement was ’Premier’ and Severity was ’Medium’
• We got two clusters formed by using Simple K - Means method:
Cluster 1: Escalation is ’Yellow’, Customer expectation is ’Investigate issue &
Hotfix required’, Software module is ’Ops - Trap Interceptor’ and Severity is
’Urgent’.
Cluster 2: Escalation is ’Red’, Customer expectation is ’Investigate issue & Hotfix
required’, Software module is ’Perf’ and Severity is ’High’.
For an Engineering team it is really important to avoid any major Red escalations. The
team gets a lot of incidents which need to be resolved in a limited time. Since the number of
incidents are more, it becomes hard for the team to keep track of all the incident issues with
respect of the criticality and the severity of the incident. By implementing such predictive
51
mechanisms where, an incident which will turn RED can be alerted to the team. This would
help the manager to allocated appropriate resources based on the criticality of the tickets
coming in. This would help in the resolution of the incident ticket within the stipulated time
and avoiding any unwanted escalations. This would indeed help in maintaining the trust of
the customer as well.
More accurate and varied results could have been achieved, but the missing data in the
dataset limited us from it. Due to the discrepancies in the data (data being shifted by 3-4
columns), we had to ignore such inconsistent entries in the dataset to perform the statistical
analysis.
5.1 Future Work
The use of NLTK proved to be very helpful in extracting the meaningful conclusion
from the tickets dataset. NLTK can be used to analyze real time behavior of the tickets
incoming to the team. This analysis can be used in providing proactive resolutions to the
customers, thus preventing the tickets to get escalated.
52
References
[BAG] L. Mach Learn, Bagging Predictors Breiman
[SFTWR] Ishani Arora, Vivek Tetarwal, Anju Saha, Software Defect Prediction
[PRBD] Thomas Zimmermann, NachiappanNagappan, Predicting Bugs from History
[SFRC] Stamatia Bibi, Grigorios Tsoumakas, I. Vlahavas, Ioannis Stamelos, Prediction Using
Regression via Classification
[HIDM] Felix Salfner, Predicting Failures with Hidden Markov Models
[IEMD] Ian H. Witten, Eibe Frank, Mark A. Hall,Data Mining Practical Machine Learning
Tools and Techniques, Third Edition, Copyright c 2011 Elsevier Inc.
[EFRB] Eibe Frank,Computer Science Department, University of Waikato, New Zealand and
Remco R. Bouckaert,Xtal Mountain Information Technology, Auckland, New Zealand,
Naive Bayes for Text Classification with Unbalanced Classes
[JMJ11] Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques,
Third Edition, 2011 Addison-Wesley, Reading, MA, 1999.
[IT2008] Irina Tudor, “Association Rule Mining as a Data Mining Technique”, 2008.
[PMDM] Norman Fenton, Paul Krause and Martin Neil, A Probabilistic Model for Software
Defect Prediction, 2006
[BLMK] Billy Edward Hunt, Jr., Overland Park,KS (Us); Jennifer J- Kirkpatrick,Olathe, KS
(US); Richard Allan Kloss, Wlnllg´en Josseph, SOFTWARE DEFECT PREDICTION,
2014
[WEKA] WEKA Online, www.cs.waikato.ac.nz/ml/weka.
[PEFS] Cathrin Weil, Rahul Premraj, Thomas Zimmermann, Andreas ZellerPredicting Effort
to Fix Software Bugs, 2006
53
[CSDE] Victor S. Shenga, Bin Gub, Wei Fangc, Jian Wud, Cost-Sensitive Learning for Defect
Escalation, 2001
[DSNA] Jaideep Srivastava, Muhammad A. Ahmad, Nishith Pathak, David Kuo-Wei Hsu Data
Mining Based Social Network Analysis from Online Behavior, 2008
[PHMM] Felix Salfner, Predicting Failures with Hidden Markov Models, 2005
54

More Related Content

What's hot

System Analysis and Design
System Analysis and Design System Analysis and Design
System Analysis and Design Matthew McKenzie
 
Software Engineering
 Software Engineering  Software Engineering
Software Engineering JayaKamal
 
Project report (web 3.0)
Project report (web 3.0)Project report (web 3.0)
Project report (web 3.0)Abhishek Roy
 
Software Process Models
Software Process ModelsSoftware Process Models
Software Process ModelsAtul Karmyal
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random ForestIRJET Journal
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineeringPreeti Mishra
 
Concept of Failure, error, fault and defect
Concept of Failure, error, fault and defectConcept of Failure, error, fault and defect
Concept of Failure, error, fault and defectchaklee191
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
 
cinema-management-system[1]
cinema-management-system[1]cinema-management-system[1]
cinema-management-system[1]OpioRafael
 
Rational Unified Process
Rational Unified ProcessRational Unified Process
Rational Unified ProcessOmkar Dash
 
Software development life cycle (SDLC)
Software development life cycle (SDLC)Software development life cycle (SDLC)
Software development life cycle (SDLC)Simran Kaur
 
System engineering
System engineeringSystem engineering
System engineeringLisa Elisa
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 

What's hot (20)

System Analysis and Design
System Analysis and Design System Analysis and Design
System Analysis and Design
 
Extreme Programming
Extreme ProgrammingExtreme Programming
Extreme Programming
 
Computer aided software engineering
Computer aided software engineeringComputer aided software engineering
Computer aided software engineering
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
 
Web based booking a car taxi5
Web based booking a car taxi5Web based booking a car taxi5
Web based booking a car taxi5
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
Software Engineering
 Software Engineering  Software Engineering
Software Engineering
 
Project report (web 3.0)
Project report (web 3.0)Project report (web 3.0)
Project report (web 3.0)
 
Software Process Models
Software Process ModelsSoftware Process Models
Software Process Models
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random Forest
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineering
 
Concept of Failure, error, fault and defect
Concept of Failure, error, fault and defectConcept of Failure, error, fault and defect
Concept of Failure, error, fault and defect
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performance
 
Industrial Training report on java
Industrial  Training report on javaIndustrial  Training report on java
Industrial Training report on java
 
cinema-management-system[1]
cinema-management-system[1]cinema-management-system[1]
cinema-management-system[1]
 
Rational Unified Process
Rational Unified ProcessRational Unified Process
Rational Unified Process
 
Software development life cycle (SDLC)
Software development life cycle (SDLC)Software development life cycle (SDLC)
Software development life cycle (SDLC)
 
System engineering
System engineeringSystem engineering
System engineering
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 

Viewers also liked

Crime Analysis using Data Analysis
Crime Analysis using Data AnalysisCrime Analysis using Data Analysis
Crime Analysis using Data AnalysisChetan Hireholi
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesChamath Sajeewa
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
 
учнівська презентація Jeunes, qui êtes vous
учнівська презентація Jeunes, qui êtes vousучнівська презентація Jeunes, qui êtes vous
учнівська презентація Jeunes, qui êtes vousІгор Арсентьєв
 
L’importance de l’étude d’une langue étrangère
L’importance de l’étude d’une langue étrangèreL’importance de l’étude d’une langue étrangère
L’importance de l’étude d’une langue étrangèreІгор Арсентьєв
 
UNION BUDGET 2015-2016
UNION BUDGET 2015-2016UNION BUDGET 2015-2016
UNION BUDGET 2015-2016Musthafa K M
 
Omni-channel Talent Application
Omni-channel Talent Application Omni-channel Talent Application
Omni-channel Talent Application Jonathan Wilson
 
Labour workwear uniform suppliers in uae
Labour workwear uniform suppliers in uaeLabour workwear uniform suppliers in uae
Labour workwear uniform suppliers in uaeAlona Stjohn
 
FDEA Africa December 10 2104
FDEA Africa December 10 2104FDEA Africa December 10 2104
FDEA Africa December 10 2104John Giles
 

Viewers also liked (11)

Crime Analysis using Data Analysis
Crime Analysis using Data AnalysisCrime Analysis using Data Analysis
Crime Analysis using Data Analysis
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articles
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime Pattern
 
учнівська презентація Jeunes, qui êtes vous
учнівська презентація Jeunes, qui êtes vousучнівська презентація Jeunes, qui êtes vous
учнівська презентація Jeunes, qui êtes vous
 
учнівське самоврядування
учнівське самоврядуванняучнівське самоврядування
учнівське самоврядування
 
L’importance de l’étude d’une langue étrangère
L’importance de l’étude d’une langue étrangèreL’importance de l’étude d’une langue étrangère
L’importance de l’étude d’une langue étrangère
 
UNION BUDGET 2015-2016
UNION BUDGET 2015-2016UNION BUDGET 2015-2016
UNION BUDGET 2015-2016
 
Omni-channel Talent Application
Omni-channel Talent Application Omni-channel Talent Application
Omni-channel Talent Application
 
Labour workwear uniform suppliers in uae
Labour workwear uniform suppliers in uaeLabour workwear uniform suppliers in uae
Labour workwear uniform suppliers in uae
 
PASTORETS 2014
PASTORETS 2014PASTORETS 2014
PASTORETS 2014
 
FDEA Africa December 10 2104
FDEA Africa December 10 2104FDEA Africa December 10 2104
FDEA Africa December 10 2104
 

Similar to A Machine Learning approach to predict Software Defects

Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerAdel Belasker
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]Rajon
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
 
Agentless Monitoring with AdRem Software's NetCrunch 7
Agentless Monitoring with AdRem Software's NetCrunch 7Agentless Monitoring with AdRem Software's NetCrunch 7
Agentless Monitoring with AdRem Software's NetCrunch 7Hamza Lazaar
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfjeevanbasnyat1
 
Security in mobile banking apps
Security in mobile banking appsSecurity in mobile banking apps
Security in mobile banking appsAlexandre Teyar
 
bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-finalBen Kremer
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportTrushita Redij
 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Lorenzo D'Eri
 
Abstract contents
Abstract contentsAbstract contents
Abstract contentsloisy28
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Priyanka Kapoor
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 

Similar to A Machine Learning approach to predict Software Defects (20)

E.M._Poot
E.M._PootE.M._Poot
E.M._Poot
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
Agentless Monitoring with AdRem Software's NetCrunch 7
Agentless Monitoring with AdRem Software's NetCrunch 7Agentless Monitoring with AdRem Software's NetCrunch 7
Agentless Monitoring with AdRem Software's NetCrunch 7
 
diss
dissdiss
diss
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 
Security in mobile banking apps
Security in mobile banking appsSecurity in mobile banking apps
Security in mobile banking apps
 
bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-final
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
Knapp_Masterarbeit
Knapp_MasterarbeitKnapp_Masterarbeit
Knapp_Masterarbeit
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
 
Abstract contents
Abstract contentsAbstract contents
Abstract contents
 
Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
Malware Analysis
Malware Analysis Malware Analysis
Malware Analysis
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
CS4099Report
CS4099ReportCS4099Report
CS4099Report
 

Recently uploaded

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 

Recently uploaded (20)

9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 

A Machine Learning approach to predict Software Defects

  • 1. A Machine Learning approach to predict Software Defects Chetan Hireholi February 19, 2017
  • 2. Abstract Software engineering teams are not only involved in developing new versions of a product but are often involved in fixing customer reported defects. Customers report issues faced by a particular software and some of these issues may actually require the engineering team to analyze and potentially provide a fix for the issue. The number of defects that a software engineering team has to analyze is significant and teams often prioritize the order of the defects based on the customer’s priority and to the extent the defect impacts the business operations of the customer. Often, it is likely that the engineering team may not truly understand the business impact that a defect is likely to have and this results in the customer escalating the defect up the engineering team’s management chain seeking more immediate attention to their problem. Such escalated defects tend to consume a lot of engineering bandwidth and increase defect handling costs; further such escalations impact existing plans as all crit- ical resources are used to handle these cases. Software escalations are classified under three categories: Red, Yellow and Green. The software defect report containing a high business value is prioritized and is marked as Red, Yellow being the neutral one, which might be escalated if appropriate attention is not given and the Green reports have the low priority. The engineering team understands the nature of the software escalation and allocates the resources appropriately. The objective of this project is to be able to analyze software defects and predict the defects that are likely to be escalated by the customer. This would permit the engineering team to be alerted about the situation and take proactive measures that will give better support to the customers. For the purpose of our analysis, we have used the defects database provided by Hewlett-Packard (HP) India. We have used the concepts of R programming for cleaning the database and to pre-process it. We then extracted keywords using natural language processing and then used machine learning (J 48 decision tree, Na¨ıve Bayes and Simple K Means) to predict the escalations. Thus by combining the key words and the tickets received to the team, we can predict the nature of the Escalation and alert the engineering team so that they can take respective steps appropriately. i
  • 3. Acknowledgment I would like to take this to thank a lot of eminent personalities, without whose constant encouragement and support, the endeavor of mine would not have been successful. Firstly, I would like to thank the PES University, for having ”Final Year Project” as a part of my curriculum, which gave me a wonderful opportunity to work on research and presentation abilities, and provided excellent facilities, without which, this project could not have acquired the orientation, it has now. At the outset I would like to venerate Prof. Nitin V. Pujari, Chairperson, PES University, who toned me encompassing the attitude towards the subject implied in this literary work. It gives me immense pleasure to thank Dr. K. V. Subramaniam, Department of Computer Science and Engineering, PES University for his continuous support, advice and guidance. I would also like to thank Dr. Jayashree R., Department of Computer Sci- ence and Engineering, PES University for her initiative and support which made this work possible. ii
  • 4. Contents Abstract i Acknowledgment ii List of figures v 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Survey 4 2.1 A Probabilistic Model for Software Defect Prediction . . . . . . . . . . . . . 4 2.2 Predicting Bugs from History . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Exploring the Dataset 8 3.1 Incident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Change Requests(CR) Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Cleaning the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Defect Escalation Analysis 14 4.1 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.1 Analyzing the Incidents Dataset . . . . . . . . . . . . . . . . . . . . . 14 4.1.2 Analyzing Change Requests Dataset . . . . . . . . . . . . . . . . . . 21 4.2 Applying Machine Learning on the Dataset . . . . . . . . . . . . . . . . . . . 26 4.2.1 Classifying the Incidents Data . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Clustering the Incidents Data . . . . . . . . . . . . . . . . . . . . . . 36 4.2.3 Text Mining and Natural Language Tool Kit (NLTK) . . . . . . . . . 42 5 Results and Conclusions 51 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Bibliography 53
  • 5. List of Figures 1 Predicting Bugs from History- Commonly used complexity metrics . . . . . . 5 2 Life Cycle of a Defect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Cleansing in OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Data Transformation in Microsoft Excel . . . . . . . . . . . . . . . . . . . . 12 5 Distribution of Incident Escalations . . . . . . . . . . . . . . . . . . . . . . . 14 6 Analyzing RED Incidents: Customers vs Escalations . . . . . . . . . . . . . 15 7 Analyzing RED Incidents: Modules vs Escalations . . . . . . . . . . . . . . . 17 8 Analyzing RED Incidents: Software release vs Escalations . . . . . . . . . . 17 9 S/w release vs Escalations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 10 Analyzing RED Incidents: OS vs Escalations . . . . . . . . . . . . . . . . . . 18 11 Analyzing RED Incidents: Developer vs Escalations . . . . . . . . . . . . . . 19 12 Other observations made on Incidents . . . . . . . . . . . . . . . . . . . . . . 20 13 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 14 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 15 Analyzing CRs: Customers vs Escalations . . . . . . . . . . . . . . . . . . . 22 16 Analyzing CRs: Modules vs Escalations . . . . . . . . . . . . . . . . . . . . . 23 17 Analyzing CRs: S/w release vs Escalations . . . . . . . . . . . . . . . . . . . 23 18 Analyzing CRs: OS vs Escalations . . . . . . . . . . . . . . . . . . . . . . . 24 19 Analyzing CRs: Developer vs Escalations . . . . . . . . . . . . . . . . . . . . 25 20 Classifying using: J48 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 21 Classifying using: J48 Tree: Prefuse Tree . . . . . . . . . . . . . . . . . . . . 28 22 Module as root node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 23 Probability for MODULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 24 Probability distribustion for SERVERITY SHORT . . . . . . . . . . . . . . 31 25 ESCALATION as the root node . . . . . . . . . . . . . . . . . . . . . . . . . 31 26 Probability distribution table for ESCALATION . . . . . . . . . . . . . . . . 32 27 Probability distribution table for EXPECTATION . . . . . . . . . . . . . . . 33 28 SEVERITY SHORT as the root node . . . . . . . . . . . . . . . . . . . . . . 33 iv
  • 6. 29 Probability distribution for SEVERITY SHORT . . . . . . . . . . . . . . . . 34 30 Probability distribution for Customer Expectations . . . . . . . . . . . . . . 35 31 Final Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 32 Model and evaluation on training set . . . . . . . . . . . . . . . . . . . . . . 37 33 Cluster Centroids I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 34 Cluster Centroids II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 35 Total number of Incidents and thier Escalation count . . . . . . . . . . . . . 42 36 Words with highest frequency mined on GREEN tickets escalated to RED . 44 37 GREEN tickets escalated to RED . . . . . . . . . . . . . . . . . . . . . . . . 44 38 Words with highest frequency mined on GREEN tickets escalated to YELLOW 45 39 GREEN ticket escalated to YELLOW . . . . . . . . . . . . . . . . . . . . . . 45 40 Words with highest frequency mined on GREEN tickets escalated to YELLOW 46 41 YELLOW ticket escalated to RED . . . . . . . . . . . . . . . . . . . . . . . 46 42 Observations made on RED ticket were Escalated . . . . . . . . . . . . . . . 47 43 Plotting the highest mined words . . . . . . . . . . . . . . . . . . . . . . . . 47 44 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 48 45 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 48 46 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 49 47 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 49 48 Output of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v
  • 7. 1 Introduction In spite of diligent planning, documentation, and proper process adherence in software development, occurrences of defects are inevitable. In today’s cutting edge com- petition, it is important to make conscious efforts to control and minimize these defects by using techniques to allow in-process quality monitoring and control. Predicting the total number of defects before testing begins improves the quality of the product being delivered and helps in planning and decision making for future project releases. Defect Prediction in software is viewed as one of the most useful and cost efficient operation. Software developers see it as a vital phase on which the quality of the product being developed depends. It has taken up major part in bringing down the allegations on the software industry, of being incapable to deliver the requirements within budget and on time. Besides this, the clients response regarding the product quality has shown a large shift from unsatisfactory to satisfactory. Today, many data miners have replaced the earlier statistical approaches for defect prediction. The basis of data mining is the classification model which places the component in one of the two classes: fault prone or non fault prone. Defects reported to the engineering team carry important information that may lead to various important decisions. These defects are reported as Tickets. Each ticket contain the nature of the escalation which is based on the business value. These tickets may also give valuable information regarding the manner in which: • Engineering defects are collected as part of the Quality & Analysis(QA) cycle – of a software product or software application • Product management which manages the usage and direction of the product • Escalations: How often an Escalation happens?, What modules of the application is buggy, and requires more attention • Who has worked on the code-base? – during Development, QA and defect/ticket/in- cidence fixing, etc. 1
  • 8. 1.1 Problem Definition • Determine causes for Defects during the Engineering phase which may lead to Escalation of Customer Support Cases While most software defects are corrected and tested as part of the prolonged software development cycle, enterprise software vendors often have to release software products before all reported defects are corrected, due to deadlines and limited resources. A small number of these reported defects will be escalated by customers whose busi- nesses are seriously impacted. Escalated defects must be resolved immediately and individually by the software vendors at a very high cost. The total costs can be even greater, including loss of reputation, satisfaction, loyalty, and repeat revenue. • Build a Recommendation Engine to alert on Escalation based on the nature of defects to create an alert system for the team, given one such escalation that has happened By this Alerting Mechanism, the team can take proactive steps in preventing the Escalation which is going to happen. 2
  • 9. 1.2 Motivation • Market Research companies especially, Gartner and Forrester Predicts have conducted survey that 80% of the IT budgets goes towards main- tenance of applications There is a dismal success rate of 2% of the new projects being successful (funded from the 20% of the budget allocated to IT) The research hopes to finding the way tickets information (in our opinion – wealth of information, which is being neglected so far) is being looked at • Mine information that are hidden and are lost in the tickets dump. The mined infor- mation will then be useful to deduce important details - which would help the Project Manager of the team to plan out the activities appropriately. 3
  • 10. 2 Literature Survey In this process we have found many works related to the software bugs prediction and which has helped us in understanding the kind of knowledge that can be captured by bugs. The following are the work carried out by the specific persons in the area of the Soft- ware Defect Prediction: 2.1 A Probabilistic Model for Software Defect Prediction Although a number of approaches have been taken to quality prediction for software, none have achieved widespread applicability. The author’s aim here is to produce a single model to combine the diverse forms of, often causal, evidence available in software development in a more natural and efficient way than done previously. The authors use graphical prob- ability models (also known as Bayesian Belief Networks) as the appropriate formalism for representing this evidence. The authors have used the subjective judgments of experienced project managers to build the probability model and use this model to produce forecasts about the software quality throughout the development life cycle. Moreover, the causal or influence structure of the model more naturally mirrors the real world sequence of events and relations than can be achieved with other formalism. We used WEKA in order to apply the Bayesian Network Classifier. We selected the attributed: Escalation, Expectation, Modules & Severity. Then by rotating the attributes as the root nodes, results were captured. A disadvantage of a reliability model of this complexity is the amount of data that is needed to support a statistically significant validation study. More detailed description regarding the application of Bayesian Classification is covered in section 4.3.[PMDM] 4
  • 11. 2.2 Predicting Bugs from History Version and bug databases contain a wealth of information about software failures — how the failure occurred, who was affected, and how it was fixed. Such defect information can be automatically mined from software archives; and it frequently turns out that some modules are far more defect-prone than others. How do these differences come to be? The authors have researched how code properties like (a) code complexity, (b) the problem domain, (c) past history, or (d) process quality affect software defects, and how their cor- relation with defects in the past can be used to predict future software properties — where the defects are, how to fix them, as well as the associated cost.[PRBD] Figure 1: Commonly used complexity metrics 5
  • 12. Learning from history means learning from successes and failures — and how to make the right decisions in the future. In our case, the history of successes and failures is provided by the bug database: systematic mining uncovers which modules are most prone to defects and failures. Correlating defects with complexity metrics or the problem domain is useful in predicting problems for new or evolved components. Learning from history has one big advantage: one can focus on the aspect of history that is most relevant for the current situ- ation. Thus the history data provided to us by Helwet Packard(HP): which consisted of the incident and the change requests data proved helpful during the statistical analysis. More descriptive coverage of the statistical analysis is covered in section 4.2. The dataset given to us by HP is explained in detail in section 3. Some more research work carried out by few people on similar problem domain: • Work done using the Machine Learning approach: Predicting Effort to Fix Software Bugs [PEFS]: The authors have used the K - Nearest Neighbor approach to predict the effort put by an engineer to fix the software bugs. In this study carried out by the authors, their technique leverages existing issue tracking systems: given a new issue report, they search for similar, earlier reports and use their average time as a prediction. This technique will not be much useful in our problem domain, since the work done here focuses on the assignment of the job to a resource for which the effort is estimated. Cost-Sensitive Learning for Defect Escalation [CSDE]: The authors have established that the Cost - Sensitive Decision Tress is the best methods for producing the highest positive net profit. Our approach on the defect reports is different from this, since we are not focusing on the cost aspect, but on the Escalation of a defect report. We have used Decision Trees in the form of Apriori Algorithms to derive rules with respect to Escalation. 6
  • 13. Predicting Failures with Hidden Markov Models [PHMM]: The authors have come up with an approach using Hidden Markov Model(HMM) to recognize the patterns in failures. Since HMMs give accurate outputs on lesser attributes, we cannot use this approach in recognizing the defect patterns in the defect reports. Data Mining Based Social Network Analysis from Online Behavior [DSNA]: The authors have used the Neural Networks and done sentimental analysis on the social networks to predict the online behavior of people. The approach used in sentiment analysis have given me insights on Natural Language Tool Kit(NLTK) and how NLTK can be used to find the sentiments of the user data. This motivated me to pick up NLTK to analyze the tickets dataset provided by Hewlett Packard. Detailed information on the application of NLTK is explained in section 4. 7
  • 14. 3 Exploring the Dataset The methodology describes on exploring the data set acquired from Hewlett Packard(HP) and closely analyzing it. This is the beginning phase of the project where the data is under- stood and meaningful analysis is done. This gives an high level overview on the datasets. HP had provided two datasets: Customer Incident data set and Change Requests(CR) data set. The Incidents dataset had the customer cases. These cases included - troubleshooting errors, field issues, installation issues, environment issues and all other cases which are related to the Software. When a customer logs a unique case which the team identifies it as a Change Request, and a respective entry is made in the CR dataset. Below are the characteristics of Incident & CR dataset: 3.1 Incident Dataset • Contains 153 headers with 6,433 entries Few important characteristics are: Each entry has the owner assigned to respective entries Each entry has an Unique ID called ISSUEID The life line of each entry is captured and is represented numerically(For e.g. DAYS TO OPEN, DAYS TO FIXED, etc.) Each entry has an Escalation set by the CPE Support Team on consulting the customer. The Escalation comes in 3 categories: RED, ORANGE & GREEN. RED being the most high priority ticket, YELLOW being a potentially important ticket and GREEN being a ticket with lesser business impact compared to the Red and Yellow tickets. Each entry has an Expectation set by the CPE Support team on the inputs by the customer(E.g. Investigate Issue & Hotfix requested, Answer Question, Create Enhancement, Investigate Issue, etc.) 8
  • 15. The mail communication between the Developer and Customer can be found under: NOTE CUSTOMER. This field contains all the information about the Defect being tracked with the team. Each entry has a Severity set by the CPE Support Team on consulting the cus- tomer. The Severity comes in 3 stages: Low, Medium & High. Each entry has a date describing the date on which the case was Escalated. It called QUIXY ESCALATED ON DATE & QUIXY ESCALATED YELLOW DATE. Each entry has a RESOLUTION attribute, where it describes the resolution of the case. On observing the Incident data set, We needed to come up with a life cycle of how a defect is tracked with the team. The below figure illustrates the life cycle of the defect. This process is currently in use by the team. Figure 2: Life Cycle of a Defect 9
  • 16. 3.2 Change Requests(CR) Dataset The CRs dataset contains the Incident Cases which were identified as Change Requests. • Contains 153 headers with 11960 entries Few important characteristics are: Each entry has the owner assigned to respective entries Each entry has an Unique ID called ISSUEID The life line of each entry is captured and is represented numerically(For e.g. DAYS TO OPEN, DAYS TO FIXED, etc.) Each entry has an Escalation set by the CPE Support Team on consulting the customer. The Escalation comes as Y(Escalated) and N(Not Escalated) Each entry has Expectation set by the CPE Support team on the inputs by the customer(For e.g. Investigate Issue & Hotfix requested, Answer Question, Create Enhancement, Investigate Issue, etc.) The mail communication between the Developer and Customer can be found under: NOTE CUSTOMER. This field contains all the information about the Defect being tracked with the team. Each entry has a Severity set by the CPE Support Team on consulting the cus- tomer. The Severity comes in 3 stages: Low, Medium & High. Each entry has a date describing the date on which the case was Escalated. It called QUIXY ESCALATED ON DATE Each entry has a RESOLUTION attribute, where it describes the resolution of the case. 10
  • 17. 3.3 Cleaning the Dataset After receiving the huge dataset the next approach was to clean the data. There were many discrepancies in the dataset viz., presence of non-numeric values in the dates field, the data in the rows were shifted to the left by 2 - 4 columns; due to this data - shift, the data were not aligned to its specific header. The following steps were performed to clean the dataset. • Removing Discrepancies OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. This tool brought down the immense cleaning effort. OpenRefine helped to explore large data sets with ease. It provided functions to transform the data to make it unified. E.g. In the ”Customer” column, there were different names for a single company. This tool helps in organizing varied names to a single one. Thus unifying the word:- ”Vodafone”, ”Vodafone Inc”, ”vodafone” to a single name- ”Vodafone”. It also help in removing special characters and make the text readable. It also takes care of the case sensitivity of the context. i.e. we can edit the contents of multiple rows using the feature: ”Text Facet”, shown in the figure next page. • Removing the Unwanted Data Microsoft Excel helped in converting the whole data set to a table. Then by moving into the table, applying filters and filtering the data excluded all the noise. Then further it was also possible to extract the pivot tables which favored in the statistical analysis of the dataset. 11
  • 18. Figure 3: Figure 4: Data Transformation in Microsoft Excel 12
  • 19. • Removing Stop Words The Text Mining (tm) package from R language helped converting the dataset into a corpus. The pre - processing of the data is efficiently done by the ’tm’ package in R. The various text transformations offered by the tm package are: “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”. 13
  • 20. 4 Defect Escalation Analysis After getting the data cleaned, the next approach was Statistical Analysis of the data. The Statistical Analysis will further unveil some under - the - hood facts 4.1 Statistical Analysis The initial Statistical Analyses of the the data received from HP- incidents.csv and crs.csv was carried out by Microsoft Excel(MS Excel). The Pivot Charts obtained from MS Excel helped in graphically analyzing the huge datasets. 4.1.1 Analyzing the Incidents Dataset The incidents.csv contained all the customer incidents received reported to team. Figure 5: Distribution of Incident Escalations incidents.csv There in total 125 RED Escalated incidents, 3831 GREEN incidents and 329 YEL- LOW incidents in the dataset. 14
  • 21. • Analyzing RED Incidents: Customers vs Escalations In certain situations, the escalation of a task is necessary. For example, a user is doing a task, and is unable to complete that task within a certain period of time. In such cases, where the user/customer is completely blocked, a case is RED Escalated. These RED Escalated Incident Cases are High Priority cases and have to be addressed on high severity. Companies RHEINENERGIE, HEWLETT PACKARD, DEUTSCHE BANK had the highest number of RED Escalations. Figure 6: Analyzing RED Incidents: Customers vs Escalations 15
  • 22. • Company behavior analysis: RHEINENERGIE RHEINENERGIE had maximum RED escalations among the other customers. There were around 28 incident cases registered to the Operations Team. The Patterns ob- served in those 28 incidents are: Out of the 28 escalations, 6 were RED escalated. 21.28% chance that an incident logged in will be a RED escalation Most reported module: Ops - Monitor Agent (opcmona) (7 nos.) ; where 3 of them were RED Installation(6 nos.) Perf – Collector(3 nos.) Average number of days a single incident handled: 73.5 days Number of incidents which move to CR: 15; 53.57% of the incidents move to CRs 16
  • 23. • Analyzing RED Incidents: Modules vs Escalations The software modules: Ops - Action Agent (opcmona) & Installation had the highest number of RED escalations reported to the Operations Team. Figure 7: Analyzing RED Incidents: Modules vs Escalations • Analyzing RED Incidents: Software Release vs Escalations Out of all the 125 RED escalated cases, the Operations Agent version 8.6 had the maximum number of incident reports. Followed by version 11.14 & 11.02. These Software Release Versions had the maximum escalations in the incident. Figure 8: Analyzing RED Incidents: Software release vs Escalations 17
  • 24. The incident frequencies of the other Software Versions is shown below: Figure 9: S/w release vs Escalations • Analyzing RED Incidents: Operating System(OS) vs Escalations It can be observed that 83 entries out of 125 Red Escalations had a ’blank’ OS field. After having a talk with a HP Developer, we got the inputs that the customers tend to skip the OS field. But we further observed that the no specific version of the OS which could aid the trouble shooting. The customers tend to put just the general OS names viz. Windows, Solaris, etc. instead of mentioning the entire name which the version details. Figure 10: Analyzing RED Incidents: OS vs Escalations 18
  • 25. • Analyzing RED Incidents: Developer vs Escalations This describes the Developer associated with the incident case. Below is the distribu- tion of incidents among the developers. The developer who was been assigned with high number of the incident cases is prasad.m.k hp.com Figure 11: Analyzing RED Incidents: Developer vs Escalations • Other observations made on Incidents Calculating the age of a single ticket was challenging. The data dump had two columns: ”OPEN-IN-DATE and CLOSED-IN-DATE.The difference of those two dates would give the total days taken to close the ticket. But it was contradicting to the column present in the table: DAYS SUPPORT TO CPE. Both the values were not matching. Below is a pictorial representation of the same: 19
  • 26. Figure 12: Other observations made on Incidents 20
  • 27. 4.1.2 Analyzing Change Requests Dataset The second data set provided by HP was the CR data(Incident cases which were ”Change Requests”). These are the cases which are added to the product backlog. Each entry had an escalation attached to it. The three escalations attached were: Showstopper, Yes and No. Below is the distribution of CRs and the nature of escalation it carried. There were in total 10,387 CR entries. Of these, 10219 cases did not escalate, 75 cases escalated and 93 of them were marked as ”Showstopper”. Figure 13: Analyzing CR data The Showstopper escalations were High Priority cases. Figure 14: CR data 21
  • 28. • Analyzing CRs: Customers vs Escalations The company TATA CONSULTANCY SERVICES LTD. had the maximum Showstopper escalations. Where as Allegis, NORTHROP GRUMMAN,PepperWeed are the companies with highest ”Y”(Yes) escalations. Figure 15: Analyzing CRs: Customers vs Escalations • Analyzing CRs: Modules vs Escalations The software module: Ops - Monitor Agent (opcmona) & Installation had the highest ”Showstopper” escalations. Where as the modules: Installation & Lcore had the highest number of ”Y”(yes) escalations. 22
  • 29. Figure 16: Analyzing CRs: Modules vs Escalations • Analyzing CRs: Software Release vs Escalations Software Release Version 11 had the highest ”Showstopper” and ”Y”(yes) escalations. Where as the Software Release Version 8.6 was the second highest in having the ”Show- stopper” and ”Y”(yes) escalations. Figure 17: Analyzing CRs: S/w release vs Escalations 23
  • 30. • Analyzing CRs: OS vs Escalations The Software running Windows OS had the maximum number of both ”Showstop- per” and ”Y”(yes) escalations. Note: Submitter of these tickets tend to choose the OS fields as they want to. Some choose the exact versions where the issue was seen or reported or some choose just at a high level. No strict rules observed Figure 18: Analyzing CRs: OS vs Escalations • Analyzing CRs: Developer vs Escalations While analyzing the Developers and the Escalated tickets assigned to them: swati.sinha hp.com was been assigned the highest number of ”Showstopper” escalations. Where as umesh.sharoff hp.com was been assigned the highest number of ”Y”(yes) escalations. 24
  • 31. Figure 19: Analyzing CRs: Developer vs Escalations Post statistical analysis we used WEKA for applying few machine learning algorithms onto the dataset. In the next section we will use few datamining concepts along with machine learning algorithms to collect meaningfull conclusion from the dataset. 25
  • 32. 4.2 Applying Machine Learning on the Dataset 4.2.1 Classifying the Incidents Data In this phase, classification and clustering is applied on the data to get to know the main attributes which are responsible to trigger an Escalation. By using WEKA tool, which offers various machine learning algorithms to use on the data set, certain informative conclusions have been drawn. Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Here the target class will be the Escalation attribute of each ticket. The class assignments are know: viz., Severity of a Bug, Expectation on the Defect Resolution, the Modules of the Software, etc. By computing these important attributes of a ticket, the classification algo- rithm finds the relationship between these values and predicts the vale of the target. In this work, I have chosen J48 Decision Tree Algorithm and Bayes Network Classifier Algorithm to predict the Target class: Escalation. Classifying using: J48 Tree The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the avail- able training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell the most about the data instances so that it can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then it terminates that branch and assign to it the target value that it has obtained. We used WEKA to apply the decision tree on the Data set. Attributes selected: • Escalations(Yellow, Red) 26
  • 33. • Expectation(Contains the customer expectation on the resolution of the ticket from the support team) • Modules • Severity Results: • SEVERITY SHORT = Urgent • ESCALATION = Red: Ops - Monitor Agent (opcmona) (17.0/11.0)* • ESCALATION = Yellow: Ops - Logfile Encapsulator (opcle) (21.0/17.0)* • SEVERITY SHORT = High: Installation (92.0/74.0)* • SEVERITY SHORT = Medium: Installation (23.0/20.0)* • SEVERITY SHORT = Low: Ops - Message Agent (opcmsga) (1.0) • Number of Leaves : 5 • Size of the tree : 7 • Correctly Classified Instances = 32 (20.7792 %) • Incorrectly Classified Instances = 122 (79.2208 %) • Kappa statistic = 0.0626 • Mean absolute error= 0.0595 • Root mean squared error= 0.1725 • Relative absolute error= 96.4179 • Root relative squared error= 98.5439 • Total Number of Instances= 154 27
  • 34. * The first number is the total number of instances weightofinstances reaching the leaf. The second number is the number weight of those instances that are miss-classified. Figure 20: Classifying using: J48 Tree Figure 21: Classifying using: J48 Tree: Prefuse Tree After observing that the incorrectly classified instances were greater than the cor- rectly classified instances, J48 Decision Tree did not yield the required answers. 28
  • 35. Bayes Network Classifier, A Supervised Learning Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competi- tive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evalu- ate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that gen- eralize the naive Bayesian classifier and explicitly represent statements about independence. We used WEKA to apply this Classifier on the Data set. By using this classifier, the Probability Distribution is found out: Attributes selected: • ESCALATION (Yellow, Red) • EXPECTATION (Contains the customer expectation from the support team) • MODULES • SEVERITY SHORT Results I: MODULE as the root node • Red Escalation: Highest probable module: Lcore- Control, Ops- Message Interceptor (opcmsgi) • Yellow escalation: Highest probable module: Ops-Trap interceptor, other, Lcore BBC Figure 22: Module as root node 29
  • 36. Figure 23: Probability for MODULE • The modules: Ops - Action Agent (opcacta) & Installation: Have the highest number of RED escalations reported to the Operations Team. • When the SEVERITY is: URGENT: Most probable module is : LCore BBC HIGH: Most probable module is : Installation MEDIUM: Most probable module is : LCore- XPL LOW: Most probable module is : LCore-control, Opcmsgi, LCore-Security, LCore-Deploy, Operation Agent, Agent Framework The Probability Distribution of Modules vs Severity is shown in the next page: 30
  • 37. Figure 24: Probability distribustion for SERVERITY SHORT Results II: ESCALATION as the root node • Correctly Classified Instances= 80; 51.9481% • Incorrectly Classified Instances= 74; 48.0519% Figure 25: ESCALATION as the root node • Probability that it would be a RED Escalation : 24.20% • Probability that it would be a YELLOW Escalation : 75.80% 31
  • 38. Figure 26: Probability distribution table for ESCALATION • Probability of the EXPECTATION from Customer when it’s a RED Escalation: Answer Question: 64% Investigation & Hotfix requested: 47.4% Investigate issue: 39.7% Create Enhancement: 64% • Probability of the EXPECTATION from Customer when it’s a YELLOW Escala- tion: Answer Question: 63% Investigation & Hotfix requested: 56.7% Investigate issue: 32.4% Create Enhancement: 46% 32
  • 39. Figure 27: Probability distribution table for EXPECTATION • Probability of the SEVERITY of the incident when it’s a RED Escalation: URGENT: 44.90% HIGH: 44.90% MEDIUM: 09.00% LOW: 13.00% • Probability of the SEVERITY of the incident when it’s a YELLOW Escalation: URGENT: 18.10% HIGH: 63.40% MEDIUM: 17.20% LOW: 13.00% Results III: SEVERITY SHORT as the root node • Correctly Classified Instances= 63; 40.9091% • Incorrectly Classified Instances= 91; 59.0909% Figure 28: SEVERITY SHORT as the root node 33
  • 40. • Probability distribution for SEVERITY SHORT Probability that it would URGENT is 24.7% Probability that it would HIGH is 59.3% Probability that it would MEDIUM is 15.1% Probability that it would LOW is 0.01% Figure 29: Probability distribution for SEVERITY SHORT • Probability distribution for RED escalation: URGENT: 44.9% HIGH: 18.8% MEDIUM: 14.6% LOW: 25% • Probability distribution for YELLOW escalation: URGENT: 55.1% HIGH: 81.2% MEDIUM: 85.4% LOW: 75% 34
  • 41. • Probability distribution for Customer Expectations URGENT: Investigate Issue & Hotfix request HIGH: Investigate Issue & Hotfix request MEDIUM: Investigate Issue LOW: Investigate Issue Figure 30: Probability distribution for Customer Expectations 35
  • 42. 4.2.2 Clustering the Incidents Data Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Simple K Means method - is a method of vector quantization. Originally from signal processing, that is popular for cluster analysis in data mining. k- means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. We used WEKA for applying this method. the following are the results got from it: • Number of iterations: 3 • Within cluster sum of squared errors: 261.0 • Cluster 0: Yellow,’Investigate Issue & Hotfix requested’, ’Ops - Trap Inter- ceptor (opctrapi)’, Urgent • Cluster 1: Red,’Investigate Issue & Hotfix requested’, Perf,High Figure 31: Final Cluster Centroids 36
  • 43. Figure 32: Model and evaluation on training set In the below cluster centroid, the instances are divided according to the Escalation Type of the tickets: Figure 33: Cluster Centroids I 37
  • 44. In the below cluster centroid, the instances are dived according to the Severity nature of the tickets: Figure 34: Cluster Centroids II 38
  • 45. Predictive Apriori - An Apriori variant Predictive Apriori algorithm use larger support and traded with higher confidence, and calculate the expected accuracy in Bayesian framework. The result of this algo- rithm maximizes the expected accuracy for future data of association rules. We used WEKA to apply this algorithm to the incident Dataset. Below are the findings: (Try I)Attributes selected: ESCALATION (Yellow, Red) CUSTOMER ENTITLEMENT SEVERITY SHORT Best rules fond: 1. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT = Medium(8) == ESCALATION = Yellow(8) Accuracy: 95.49% 2. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT = High(38) == ESCALATION = Yellow 34 Accuracy: 83.22% The above rules describe that: When the Customer Entitlement is Premier and the Severity set to the ticket is Medium, then there is 95% chance that the Esca- lation might be Yellow we used WEKA to apply this algorithm on the Incidents Data sets. (Try II)Attributes selected: ESCALATION (Yellow, Red) CUSTOMER ENTITLEMENT SEVERITY SHORT MODULE OPERATING SYSTEM 39
  • 46. Best rules fond: 1. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT = Medium(11) == ESCALATION = Yellow(11) Accuracy:98.84% 2. CUSTOMER ENTITLEMENT = Premier OS & Linux(11) == ES- CALATION = Yellow(11) Accuracy: 98.46% 3. MODULE = Ops - Logfile Encapsulator[opcle](10) == ESCALA- TION = Yellow(10) Accuracy: 98.70% The above rules describe that: When the module is Ops - Log Encapsulator[opcle], there is 98.70% chance that it will be a Yellow Escalated. When the Customer Entitlement is Premier and the Severity of the ticket is Medium, there is 98.46% chance that it would be Yellow Escalated. 40
  • 47. Simple Apriori Algorithm - Apriori Algorithm is an association rule mining that was founded in 1994 Apriori Algorithm works by several steps. First, the candidate item sets are generated. Then, scan the database to check the support of these item sets. This later will generate the frequent 1-item sets. In this first scan, the 1-item sets are generated by eliminating item sets with support below the threshold value. Later, the passes candidates became k-item sets that generated after k-1 of threshold founded. The iteration of database scanning and calculating support will be resulting support and confidence of each association rule that found. Attributes selected: ESCALATION (Yellow, Red) CUSTOMER ENTITLEMENT MODULES SEVERITY SHORT OS Best rules fond: 1. OS is Linux(19); ESCALATION observed is Yellow(18). The Con- fidence achieved is 95% 2. CUSTOMER ENTITLEMENT is Premier & SEVERITY SHORT is High(38); ESCALATION observed is Yellow(35). The Confidence achieved is 92% The above rules describe that: When the OS is Linux, then there is 95% confidence that the Escalation is Yellow. When the Customer Entitlement is Premier and the Severity of the ticket is high, then there 92% confidence that the Escalation is Yellow. 41
  • 48. 4.2.3 Text Mining and Natural Language Tool Kit (NLTK) After evaluating the results acquired from Phase I, a final conclusion cannot be drawn. It did not answer- what actually triggers an incident so that it escalates. This phase describes the use of Text Mining and Natural Language processing to determine the triggering factor of an incident. Figure 35: Total number of Incidents and thier Escalation count The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. We used the tm package in R for text mining the Incident tickets. The tm package provides the methods for data import, corpus handling, pre-processing, metadata management and creation of term-document matrices. 42
  • 49. The main crux was hidden in finding out What made an Incident Ticket to get RED Escalated from other escalation states? The following is the step by step process in finding out:What might help to identify the reason of an Escalation We took the dataset(Incidents.csv) and performed the following tasks using R: Data Import: Load the text to a corpus Inspecting Corpora: to get a concise overview of the corpus Transformations: Modify the corpus - e.g., stemming, stop-word removal, etc. Creating Term-Document Matrices Operations on Term-Document Matrices e.g.: Calculating the word fre- quencies, Plot the word frequencies, word cloud, etc. 43
  • 50. Observations made on GREEN tickets which were RED Escalated It was observed that, opcle was most talked module. It can be also observed that the words: please, hotfix & support were used the most in the mail chain exchanged between the customer and the developer. Figure 36: Words with highest frequency mined on GREEN tickets escalated to RED Figure 37: GREEN tickets escalated to RED 44
  • 51. Observations made on GREEN ticket which were YELLOW Escalated It was observed that, support was most talked word. It can be also observed that the words: issue & time were used the most in the mail chain exchanged between the customer and the developer. Figure 38: Words with highest frequency mined on GREEN tickets escalated to YELLOW Figure 39: GREEN ticket escalated to YELLOW 45
  • 52. Observations made on YELLOW ticket which were RED Escalated This set of corpus did not yield anything notable. But it did brought out the name of the developer which was most associated with the resolution of the tickets Figure 40: Words with highest frequency mined on YELLOW tickets escalated to RED Figure 41: YELLOW ticket escalated to RED 46
  • 53. Observations made on RED ticket were Escalated It was observed that, opc- mona was most talked module. It can be also observed that the words: waiting, hotfix & issue were used the most in the mail chain exchanged between the customer and the developer. Figure 42: Observations made on RED ticket were Escalated Figure 43: Plotting the highest mined words 47
  • 54. Observations made on the Whole RED Escalated Tickets There were in total 125 entries RED escalated in the Incidents. Text mining on all of the 125 entries revealed the below details: Figure 44: Words with highest frequency mined The words: issue, please, support & escalation were used the most in the mail chain exchanged between Customer and the Team Figure 45: Plotting the words with highest frequency mined 48
  • 55. Observations made on Whole GREEN Escalated Tickets There were in total 3831 entries GREEN escalated in the Incidents. Text mining on all of these entries revealed the below details: Figure 46: Words with highest frequency mined Mining the whole GREEN Escalated tickets did not yield valuable information. Figure 47: Plotting the words with highest frequency mined 49
  • 56. The above observations show that the mail chains in which the key words: please, hotfix, support & please are most expected to get converted to a RED Escalation. The we used all these key words and built a program which takes input the Incident data dump and scans the email chain. As the probability of these key words increases, it alerts the user when it crosses the threshold limit. The threshold limit can be ad- justed by the Developer based on the on going trend. Figure 48: Output of the program 50
  • 57. 5 Results and Conclusions By text mining and applying machine learning algorithms on the incident dataset, we ob- tained the below results: • The mail chain of a ticket which is going to be escalated to Red will contain these words in high occurrences : please, hotfix, support. • The software modules: Ops - Logfile Encapsulator (Opcle), Ops - Action Agent (op- cacta) & Installation had the highest number of red escalations reported to the team as incidents. Whereas Ops - Monitor Agent (opcmona) & Installation had the highest showstopper escalations for change requests. • By applying the Predictive Apriori algorithm on the Incident dataset, we observed the following: We got a confidence of 98.70% when a module reported was ’Opcle’ and the Escalation of the case was ’Yellow’. We got a confidence of 98.84% for Escalation as ’Yellow’, when the Customer Entitlement was ’Premier’ and Severity was ’Medium’ • We got two clusters formed by using Simple K - Means method: Cluster 1: Escalation is ’Yellow’, Customer expectation is ’Investigate issue & Hotfix required’, Software module is ’Ops - Trap Interceptor’ and Severity is ’Urgent’. Cluster 2: Escalation is ’Red’, Customer expectation is ’Investigate issue & Hotfix required’, Software module is ’Perf’ and Severity is ’High’. For an Engineering team it is really important to avoid any major Red escalations. The team gets a lot of incidents which need to be resolved in a limited time. Since the number of incidents are more, it becomes hard for the team to keep track of all the incident issues with respect of the criticality and the severity of the incident. By implementing such predictive 51
  • 58. mechanisms where, an incident which will turn RED can be alerted to the team. This would help the manager to allocated appropriate resources based on the criticality of the tickets coming in. This would help in the resolution of the incident ticket within the stipulated time and avoiding any unwanted escalations. This would indeed help in maintaining the trust of the customer as well. More accurate and varied results could have been achieved, but the missing data in the dataset limited us from it. Due to the discrepancies in the data (data being shifted by 3-4 columns), we had to ignore such inconsistent entries in the dataset to perform the statistical analysis. 5.1 Future Work The use of NLTK proved to be very helpful in extracting the meaningful conclusion from the tickets dataset. NLTK can be used to analyze real time behavior of the tickets incoming to the team. This analysis can be used in providing proactive resolutions to the customers, thus preventing the tickets to get escalated. 52
  • 59. References [BAG] L. Mach Learn, Bagging Predictors Breiman [SFTWR] Ishani Arora, Vivek Tetarwal, Anju Saha, Software Defect Prediction [PRBD] Thomas Zimmermann, NachiappanNagappan, Predicting Bugs from History [SFRC] Stamatia Bibi, Grigorios Tsoumakas, I. Vlahavas, Ioannis Stamelos, Prediction Using Regression via Classification [HIDM] Felix Salfner, Predicting Failures with Hidden Markov Models [IEMD] Ian H. Witten, Eibe Frank, Mark A. Hall,Data Mining Practical Machine Learning Tools and Techniques, Third Edition, Copyright c 2011 Elsevier Inc. [EFRB] Eibe Frank,Computer Science Department, University of Waikato, New Zealand and Remco R. Bouckaert,Xtal Mountain Information Technology, Auckland, New Zealand, Naive Bayes for Text Classification with Unbalanced Classes [JMJ11] Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Third Edition, 2011 Addison-Wesley, Reading, MA, 1999. [IT2008] Irina Tudor, “Association Rule Mining as a Data Mining Technique”, 2008. [PMDM] Norman Fenton, Paul Krause and Martin Neil, A Probabilistic Model for Software Defect Prediction, 2006 [BLMK] Billy Edward Hunt, Jr., Overland Park,KS (Us); Jennifer J- Kirkpatrick,Olathe, KS (US); Richard Allan Kloss, Wlnllg´en Josseph, SOFTWARE DEFECT PREDICTION, 2014 [WEKA] WEKA Online, www.cs.waikato.ac.nz/ml/weka. [PEFS] Cathrin Weil, Rahul Premraj, Thomas Zimmermann, Andreas ZellerPredicting Effort to Fix Software Bugs, 2006 53
  • 60. [CSDE] Victor S. Shenga, Bin Gub, Wei Fangc, Jian Wud, Cost-Sensitive Learning for Defect Escalation, 2001 [DSNA] Jaideep Srivastava, Muhammad A. Ahmad, Nishith Pathak, David Kuo-Wei Hsu Data Mining Based Social Network Analysis from Online Behavior, 2008 [PHMM] Felix Salfner, Predicting Failures with Hidden Markov Models, 2005 54