The document provides information about an exam, including admittance details, exam regulations, and seminar and assignment information. It then discusses using data mining to predict the most defect-prone source code entities by analyzing past bug and version control data, as well as source code metrics. The process involves defining the problem, preparing the data, exploring the data to understand relationships, building a prediction model using machine learning techniques, and validating the model on test data. The goal is to prioritize testing of the most defect-prone entities identified by the model.
2. Exam Admittance
50%
ROOM ...
Question or Problems:
kim@cs.uni-saarland.de
3. After-Exam Registration
Not registered = No after exam
But please do only register when you plan to participate
4. Exam Regulations
‣ Single sided cheat sheet ‣ No dictionaries
‣ ask supervision
‣ Bags to be left at entrance
‣ Hand in exam & cheat sheet
‣ Student ID on desk
‣ Additionalpaper only from
‣ Name + MatNr. on every
supervision
sheet (incl. cheat sheet)
‣ Stick to one language
‣ per exercise
‣ (german or english)
5. Seminar on Code Modi cation
at Runtime by Frank Padberg
Topics
July
Runtime optimization of byte code
22
‣
‣ on-the-fly creation of classes
‣ self-modifying code
‣ ... AND MORE! Initial Meeting
Vorbesprechung
http://www.st.cs.uni-saarland.de/edu/codemod09/rcm09.html
8. Books
Data Mining: Concepts and Techniques Data Mining: Practical Machine Learning Tools
and Techniques
by Jiawei Han & Micheline Kamber by Ian H. Witten & Eibe Frank
10. Imagine
‣ 30,000 classes
‣ ~ 5.5 million lines of code
‣ ~3000 defect per release
‣ 700 developers
You as Quality Manager Your product
11. Your Boss
Test the system!
You have 6 months, $500,000.
And don’t miss any bug!
12.
13. The Problem
‣ Not enough time to test everything
‣ What to test? What to test first?
‣ Not enough money to pay enough testers
‣ To which extend?
Central question:
Where are the most defect prone entities in my system?
21. What is data mining
mining?
Data mining is the process of discovering
actionable information from large sets of data.
22. The Mining Model
Defining the
problem
Preparing
Deploying data
and updating
models
Exploring
data
Violating Building
models models
http://technet.microsoft.com/en-us/library/ms174949.aspx
23. Step 1: De ning Problem
‣ Clearly define the problem
Defining the
‣ What are you looking for? problem
Preparing
‣ Scope of problem Deploying data
and updating
‣ Types of relationships models
Exploring
data
Violating
‣ Define how to evaluate models
Building
models
‣ Prediction, recommendation
or just patterns
24. Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
should we test most?
25. Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
should we test most?
Which are the most
defect prone
entities in my system?
26. Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
should we test most?
Which are the most
defect prone
entities in my system?
In the past, which entities had
the most defects?
27. Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
should we test most?
Which properties of Which are the most
source code entities correlate defect prone
with defects? entities in my system?
In the past, which entities had
the most defects?
28. Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
Version Archive
Source Code
29. Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity
(quality)
Version Archive
Source Code
30. Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity
(quality)
Version Archive
source code
properties
Source Code (metrics)
31. Data Sources: Heuristics
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity
(quality)
Version Archive
“... commit messages that contain fix and bug id ...”
32. Data Sources: Metrics
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
‣ Complicity metrics
‣ McCabe, FanIn, FanOut, Couplings
‣ (see Lecture “Metrics and Estimation”)
source code
‣ Time metrics Source Code
properties
(metrics)
‣ How many changes
‣ How many different authors
‣ Age of code
33. Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity
(quality)
Version Archive
source code
properties
Source Code (metrics)
34. Step 2: Prepare Data
‣ Highly distributed data:
‣ Version repository, bug data
base, time trackers, ...
Defining the
problem
Preparing
‣ Data integration Deploying data
and updating
‣ Excel, CSV, SQL, ARFF, ... models
Exploring
data
Violating Building
‣ Data cleaning models models
‣ missing values, noise, inter-
correlations
39. Example Mining File
ge fi les!
l! L ar col umn
Ca refu nes, 300
illion li
e.g. :5m
...
entities data points
output
40. Step 3: Explore Data
You cannot validate the output if you don’t know the input
‣ Descriptive data summary
Defining the
‣ max, min, mean, pareto, distribution problem
Preparing
Deploying data
‣ Data Selection and updating
models
Exploring
‣ Relevance of data data
Violating Building
models models
‣ Data reduction
‣ aggregation, subset selection
41. Descriptive Data Summary
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
‣ How good can a prediction
possibly be?
‣ Does it make sense to predict
the top 20%
20% of entities contain 80% of defects
42. Step 3: Explore Data
Data sufficiency
Defining the
problem
Preparing
‣ Maybe the data will not help Deploying
and updating
data
to solve the problem models
Exploring
data
Violating Building
models
‣ Redefine problem models
‣ Search for alternatives
‣ Access different data
43. Step 3: Explore Data
Data sufficiency
Defining the
problem
Preparing
‣ Maybe the data will not help Deploying
and updating
data
to solve the problem models
Exploring
data
Violating Building
models
‣ Redefine problem models
‣ Search for alternatives
‣ Access different data
44. Step 3: Explore Data
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity (quality)
Version Archive
source code Does complexity
Source Code
properties
(metrics) correlate with defects?
45. Step 3: Explore Data
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Bug Database
past defect
per entity (quality)
Version Archive
source code Does complexity
Source Code
properties
(metrics) correlate with defects?
YES!
46. Step 4: Build Model
‣ Mining model only container Defining the
problem
‣ parameters and mining Preparing
Deploying data
structure and updating
models
‣ output value Exploring
data
Violating Building
‣ Now we need some models models
statistics / machine learners
48. Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
‣ Regression
‣ Predicting concrete, continuous values
‣ Difficult and very imprecise
‣ But desirable
‣ Classification
‣ Predicting class labels (e.g. more that X defects or not)
‣ Easier and more precise
‣ Vague information (how many defects in code?)
49. Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
50. Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Rule-
Based
Class
ificat
Support Vec tor Machine ion
Linear Reg ression Lazy Learners
ee Bayesian Network
n Tr
ci sio Logistic Reg ression
D e
51. Training and Testing
‣ Training set
‣ The data set to train the model
‣ Which columns correlate with output values?
‣ Which columns correlate with each other?
‣ Testing set
‣ A data set independent of the training data set
‣ used to fine-tune the estimates of the model parameters
52. Training and Testing
Random split
+ Only one version needed
+ No overlaps between DATA SET
training and testing entities
- Does not reflect real life
- Which random set is the
best one? (because they are all
different)
53. Training and Testing
Random split
+ Only one version needed
+ No overlaps between DATA SET
training and testing entities
- Does not reflect real life
- Which random set is the
best one? (because they are all training data (2/3)
different)
testing data (1/3)
54. Training and Testing
Random split
+ Only one version needed
+ No overlaps between DATA SET
training and testing entities
- Does not reflect real life
- Which random set is the
best one? (because they are all training data (2/3)
different)
testing data (1/3)
55. Training and Testing
DATA SET
version N
Forward estimation
+ Reflectsreal life training data
+ Reproducable result testing data
- Two versions needed DATA SET
version N+1
58. Step 4: Build Model
machine
training set learner
(black box)
59. Step 4: Build Model
input machine
training set learner
(black box)
60. Step 4: Build Model
input machine
training set learner
(black box)
output
iction Model
Pred
61. Step 4: Build Model
input machine
training set learner
(black box)
output
testing set
iction Model
Pred
62. Step 4: Build Model
input machine
training set learner
(black box)
output
input
testing set
iction Model
Pred
63. Step 4: Build Model
input machine
training set learner
(black box)
output
input output
testing set
iction Model
Pred Prediction
64. Step 5: Validating Model
‣ Test data has same stucture
but different content Defining the
problem
Preparing
data
‣ Goal is to use model to Deploying
and updating
models
correctly estimate output Exploring
data
values Violating Building
models models
‣ Compare estimation with
real values (fine tuning)
65. Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
66. Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Never predict concrete number!
Because people will take them for real!
67. Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
sorted descending
real defects per entity predicted defects per entity
68. Evaluation
Step 1: Define the problem
Step 2: Prepare Data correctly predicted defect prone modules
Step 3: Explore Data (true positives)
Step 4: Building the Model
Step 5: Validating the Model
real defects per entity predicted defects per entity
69. Recall, Precision, Accuracy
Predict defects ?
Yes No
false
Yes true positives
negatives
Real defects?
false true
No positives negatives
70. Recall, Precision, Accuracy
Predict defects ?
Yes No
true false
Real Yes positives negatives
defects? No false true
positives negatives
true positives
true positives + false positives
Precision
Predicted defect prone entities
will be defect prone!
71. Recall, Precision, Accuracy
Predict defects ?
Yes No
Real Yes true
positives
false
negatives Recall
defects? No false true
positives negatives
true positives
Precision
true positives + false negative
All defect prone entities
get predicted as defect prone.
73. Step 6: Deploying Model
‣ Integrate model into
development or quality Defining the
problem
assurance process Preparing
data
Deploying
and updating
models
‣ Update model frequently Exploring
data
(because change happens) Violating Building
models models
‣ Frequently validate the
precision of your model
74. Step 6: Deploying Model
‣ Integrate model into m od els!
t
jecDefining the ct data!
development or quality -pro on proje
ross ndend problem
assurance process ith c epe
l w
efu els highly d
Preparing
Car Deploying data
d and updating
Many mo models
‣ Update model frequently Exploring
data
(because change happens) Violating Building
models models
‣ Frequently validate the
precision of your model
91. Assistance
Future environments will
•mine patterns from program + process
•apply rules to make predictions
•provide assistance in all development decisions
•adapt advice to project history
93. Wikis
Joy of Use
Participation Usability
Recommendation Social Software
Collaboration Perpetual Beta Simplicity
Empirical SE 2.0
Trust
Economy
Remixability The Long Tail
DataDriven