Software Engineering Course 2009 - Mining Software Archives

Exam Admittance

50%
ROOM ...

Question or Problems:
kim@cs.uni-saarland.de

After-Exam Registration

Not registered = No after exam
But please do only register when you plan to participate

Exam Regulations

‣ Single sided cheat sheet ‣ No dictionaries
‣ ask supervision
‣ Bags to be left at entrance
‣ Hand in exam & cheat sheet
‣ Student ID on desk
‣ Additionalpaper only from
‣ Name + MatNr. on every
supervision
sheet (incl. cheat sheet)
‣ Stick to one language
‣ per exercise
‣ (german or english)

Seminar on Code Modi cation
at Runtime by Frank Padberg

Topics
July
Runtime optimization of byte code

22
‣

‣ on-the-ﬂy creation of classes
‣ self-modifying code
‣ ... AND MORE! Initial Meeting
Vorbesprechung

http://www.st.cs.uni-saarland.de/edu/codemod09/rcm09.html

Current Assignment

http://www.st.cs.uni-saarland.de/edu/se/2009/handouts/mutation_tyes.png

MINING SOFTWARE REPOSITORIES
Software Engineering Course 2009
Kim Herzig - Saarland University

Books

Data Mining: Concepts and Techniques Data Mining: Practical Machine Learning Tools
and Techniques
by Jiawei Han & Micheline Kamber by Ian H. Witten & Eibe Frank

Imagine

You as Quality Manager

Imagine

‣ 30,000 classes
‣ ~ 5.5 million lines of code
‣ ~3000 defect per release
‣ 700 developers

You as Quality Manager Your product

Your Boss

Test the system!
You have 6 months, $500,000.
And don’t miss any bug!

The Problem

‣ Not enough time to test everything
‣ What to test? What to test ﬁrst?

‣ Not enough money to pay enough testers
‣ To which extend?

Central question:
Where are the most defect prone entities in my system?

Can we learn from history?
... to predict or estimate the future?

What is data mining
mining?
Data mining is the process of discovering
actionable information from large sets of data.

The Mining Model
Deﬁning the
problem
Preparing
Deploying data
and updating
models
Exploring
data
Violating Building
models models

http://technet.microsoft.com/en-us/library/ms174949.aspx

Step 1: De ning Problem

‣ Clearly define the problem
Defining the
‣ What are you looking for? problem
Preparing
‣ Scope of problem Deploying data
and updating
‣ Types of relationships models
Exploring
data

Violating
‣ Define how to evaluate models
Building
models

‣ Prediction, recommendation
or just patterns

Defect Prediction Problem
Step 1: Deﬁne the problem
Step 2: Prepare Data
Step 3: Explore Data Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
should we test most?


Which are the most
defect prone
entities in my system?


Which are the most
defect prone
entities in my system?

In the past, which entities had
the most defects?


Which properties of Which are the most
source code entities correlate defect prone
with defects? entities in my system?

In the past, which entities had
the most defects?

Data Sources
Step 3: Explore Data

Bug Database

Version Archive

Source Code

Data Sources

Bug Database
past defect
per entity
(quality)

Version Archive

Source Code

Data Sources

Bug Database
past defect
per entity
(quality)

Version Archive

source code
properties
Source Code (metrics)

Data Sources: Heuristics

Bug Database
past defect
per entity
(quality)

Version Archive

“... commit messages that contain ﬁx and bug id ...”

Data Sources: Metrics

‣ Complicity metrics
‣ McCabe, FanIn, FanOut, Couplings
‣ (see Lecture “Metrics and Estimation”)
source code
‣ Time metrics Source Code
properties
(metrics)

‣ How many changes
‣ How many different authors
‣ Age of code

‣ Highly distributed data:
‣ Version repository, bug data
base, time trackers, ...
Deﬁning the
problem
Preparing
‣ Data integration Deploying data
and updating
‣ Excel, CSV, SQL, ARFF, ... models
Exploring
data

Violating Building
‣ Data cleaning models models

‣ missing values, noise, inter-
correlations

Example Mining File

entities

Example Mining File

...
entities data points

Example Mining File

...
output

Example Mining File

ge ﬁ les!
l! L ar col umn
Ca refu nes, 300
illion li
e.g. :5m

...
output

You cannot validate the output if you don’t know the input

‣ Descriptive data summary
Deﬁning the
‣ max, min, mean, pareto, distribution problem
Preparing
Deploying data

‣ Data Selection and updating
models
Exploring
‣ Relevance of data data

Violating Building
models models

‣ Data reduction
‣ aggregation, subset selection

Descriptive Data Summary

‣ How good can a prediction
possibly be?
‣ Does it make sense to predict
the top 20%

20% of entities contain 80% of defects


Data sufficiency
Defining the
problem
Preparing
‣ Maybe the data will not help Deploying
and updating
data

to solve the problem models
Exploring
data

Violating Building
models
‣ Redefine problem models

‣ Search for alternatives
‣ Access different data


Bug Database
past defect
per entity (quality)

Version Archive

source code Does complexity
Source Code
properties
(metrics) correlate with defects?


Bug Database
past defect
per entity (quality)

Version Archive

source code Does complexity
Source Code
properties
(metrics) correlate with defects?

YES!

Step 4: Build Model

‣ Mining model only container Deﬁning the
problem
‣ parameters and mining Preparing
Deploying data
structure and updating
models
‣ output value Exploring
data

Violating Building
‣ Now we need some models models

statistics / machine learners

Building the Model
‣ Regression
‣ Predicting concrete, continuous values
‣ Difﬁcult and very imprecise
‣ But desirable

‣ Classiﬁcation
‣ Predicting class labels (e.g. more that X defects or not)
‣ Easier and more precise
‣ Vague information (how many defects in code?)

Building the Model

Building the Model

Rule-
Based
Class
iﬁcat
Support Vec tor Machine ion

Linear Reg ression Lazy Learners

ee Bayesian Network
n Tr
ci sio Logistic Reg ression
D e

Training and Testing

‣ Training set
‣ The data set to train the model
‣ Which columns correlate with output values?
‣ Which columns correlate with each other?

‣ Testing set
‣ A data set independent of the training data set
‣ used to ﬁne-tune the estimates of the model parameters


Random split
+ Only one version needed
+ No overlaps between DATA SET
training and testing entities

- Does not reﬂect real life
- Which random set is the
best one? (because they are all
different)


Random split
+ Only one version needed
+ No overlaps between DATA SET
training and testing entities

- Does not reﬂect real life
- Which random set is the
best one? (because they are all training data (2/3)
different)
testing data (1/3)


DATA SET
version N
Forward estimation
+ Reﬂectsreal life training data
+ Reproducable result testing data

- Two versions needed DATA SET
version N+1

Step 4: Build Model

training set

Step 4: Build Model

machine
training set learner
(black box)

Step 4: Build Model

input machine
(black box)

Step 4: Build Model

input machine
(black box)

output

iction Model
Pred

Step 4: Build Model

input machine
(black box)

output

testing set

iction Model
Pred

Step 4: Build Model

input machine
(black box)

output

input
testing set

iction Model
Pred

Step 4: Build Model

input machine
(black box)

output

input output
testing set

iction Model
Pred Prediction

Step 5: Validating Model

‣ Test data has same stucture
but different content Deﬁning the
problem
Preparing
data
‣ Goal is to use model to Deploying
and updating
models
correctly estimate output Exploring
data
values Violating Building
models models

‣ Compare estimation with
real values (ﬁne tuning)

Evaluation

Evaluation

Never predict concrete number!
Because people will take them for real!

Evaluation

sorted descending

real defects per entity predicted defects per entity

Evaluation
Step 2: Prepare Data correctly predicted defect prone modules
Step 3: Explore Data (true positives)

real defects per entity predicted defects per entity

Recall, Precision, Accuracy

Predict defects ?

Yes No

false
Yes true positives
negatives
Real defects?
false true
No positives negatives

Predict defects ?
Yes No
true false
Real Yes positives negatives
defects? No false true
positives negatives

true positives
true positives + false positives
Precision

Predicted defect prone entities
will be defect prone!

Predict defects ?
Yes No

Real Yes true
positives
false
negatives Recall
positives negatives

true positives
Precision
true positives + false negative

All defect prone entities
get predicted as defect prone.

Predict defects ?
Yes No
true false
Real Yes positives negatives
Recall
positives negatives

Precision

Accuracy
true positives + true negatives
true positives + true negatives + false negative + false positives

The overall correctness

Step 6: Deploying Model

‣ Integrate model into
development or quality Deﬁning the
problem
assurance process Preparing
data
Deploying
and updating
models
‣ Update model frequently Exploring
data
(because change happens) Violating Building
models models

‣ Frequently validate the
precision of your model

Step 6: Deploying Model

‣ Integrate model into m od els!
t
jecDeﬁning the ct data!
development or quality -pro on proje
ross ndend problem
assurance process ith c epe
l w
efu els highly d
Preparing

Car Deploying data

d and updating

Many mo models
‣ Update model frequently Exploring
data
(because change happens) Violating Building
models models

‣ Frequently validate the
precision of your model

Prediction Results
Training Testing Precision Recall Accuracy
2.0 0.692 0.265 0.876
2.0 2.1 0.478 0.191 0.890
3.0 0.613 0.171 0.861
2.0 0.664 0.203 0.870
2.1 2.1 0.668 0.160 0.900
3.0 0.717 0.139 0.864
2.0 0.578 0.277 0.866
3.0 2.1 0.528 0.220 0.894
3.0 0.675 0.224 0.869

Predicting java classes: Classiﬁcation:
has bugs, has no bugs

Prediction Results
2.0 0.692 0.265 0.876
2.0 2.1 0.478 0.191 0.890
3.0 0.613 0.171
d efe cts! 0.861

uses
2.0 0.664 0.203 0.870
2.1
xity ca
2.1 0.668 0.160 0.900
Com ple 3.0 0.717 0.139 0.864
2.0 0.578 0.277 0.866
3.0 2.1 0.528 0.220 0.894
3.0 0.675 0.224 0.869


Prediction Results
2.0 0.692 0.265 0.876
2.0 2.1 0.478 0.191 0.890
3.0 0.613 0.171
d efe cts! 0.861

uses ity!
2.0 0.664 0.203 0.870
2.1
lexi
2.1ty ca 0.668
com plex
0.160 0.900
Comp 3.0
me from
0.717 0.139 0.864
2.0
efec ts co 0.578 0.277 0.866
3.0
not all d 2.1 0.528 0.220 0.894
But 3.0 0.675 0.224 0.869


Code
e-mail
Bug Reports Changes
Proﬁles

What to mine?
Traces Effort Speciﬁcation
Chats
Tests
Navigation Models

Code
e-mail
Bug Reports Changes
Proﬁles

Traces Effort Sepciﬁcation
Chats
Tests
Navigation Models

Models Specs Code Traces Proﬁles Tests

Data Mining Input Sources

e-mail Bugs Effort Navigati Change Chats


People who changed function
f() also changed ....



Which modules should
I test most?



How long will it take
to x this bug?



Should I use
design A or B ?



This requirement
is risky!


Assistance

Future environments will
•mine patterns from program + process
•apply rules to make predictions
•provide assistance in all development decisions
•adapt advice to project history

Wikis
Joy of Use
Participation Usability
Recommendation Social Software
Collaboration Perpetual Beta Simplicity

Empirical SE 2.0
Trust

Economy
Remixability The Long Tail
DataDriven

Bachelor/Master Theses
in software mining

Software Engineering Course 2009 - Mining Software Archives

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Software Engineering Course 2009 - Mining Software Archives

Similar to Software Engineering Course 2009 - Mining Software Archives (20)

More from Kim Herzig

More from Kim Herzig (12)

Recently uploaded

Recently uploaded (20)

Software Engineering Course 2009 - Mining Software Archives