Parameterizing and Assembling
IR-based Solutions for SE Tasks using
Genetic Algorithms
Annibale Panichella, Bogdan Dit, Rocco Oliveto,
Max Di Penta, Denys Poshyvanyk and Andrea De Lucia
Information Retrieval
600
papers in 10 years
30
applications in S.E.
Use Case Insert Laboratory Data
Description The user inserts the data of a specific
laboratory
Events 1. The user opens the Laboratory GUI
2. The user inserts the Laboratory data
.
.
.
/* *This class implements the GUI for
managing laboratories data */
public class GUILaboratoryData {
private jFrame window;
private jButton insert;
...
public GUILaboratoryData(){
window = new JFrame();
insert = new JButton();
...
}
...
}
GUILaboratoryData.java
Information Retrieval
Use Case UC 11.4
Use Case Insert Laboratory Data
Description The user inserts the data of a specific
laboratory
Events 1. The user opens the Laboratory
GUI
2. The user inserts the Laboratory
data
.
.
.
/* *This class implements the GUI for
managing laboratories data */
public class GUILaboratoryData {
private jFrame window;
private jButton insert;
...
public GUILaboratoryData(){
window = new JFrame();
insert = new JButton();
...
}
...
}
GUILaboratoryData.java
Use Case UC 11.4
Information Retrieval
Use Case Insert Laboratory Data
Description The user inserts the data of a specific
laboratory
Events 1. The user opens the Laboratory
GUI
2. The user inserts the Laboratory
data
.
.
.
/* *This class implements the GUI for
managing laboratories data */
public class GUI Laboratory Data {
private j Frame window;
private j Button insert;
...
public GUI Laboratory Data(){
window = new JFrame();
insert = new JButton();
...
}
...
}
GUILaboratoryData.java
Information Retrieval
Use Case UC 11.4
Use Case Insert Laboratory Data
Description The user inserts the data of a specific
laboratory
Events 1. The user opens the Laboratory
GUI
2. The user inserts the Laboratory
data
.
.
.
/* *This class implements the GUI for
managing laboratories data */
public class GUI Laboratory Data {
private j Frame window;
private j Button insert;
...
public GUI Laboratory Data(){
window = new JFrame();
insert = new JButton();
...
}
...
}
GUILaboratoryData.java
Information Retrieval
Use Case UC 11.4
Textual
Similarity
= 42%
IR Process
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Software Maintenance task:
- Traceability Recovery
- Source code labelling
- Bug duplication
- Feature Location
- …
Software Artifacts
IR Process
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean
LSI (k)
LDA (ɑ, β, n, k)
VSM
PLSI
…
Boolean
Term Freq
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Software Maintenance task:
- Traceability Recovery
- Source code labelling
- Bug duplication
- Feature Location
- …
Software Artifacts
IR Process
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean
LSI (k)
LDA (ɑ, β, n, k)
VSM
PLSI
…
Boolean
Term Freq
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Software Maintenance task:
- Traceability Recovery
- Source code labelling
- Bug duplication
- Feature Location
- …
Which
configuration
to use?
Software Artifacts
What is the “best” IR process?
It is not possible to build a set of
guidelines for assembling IR-based
solutions for a given data set
What is the “best” IR process?
It is not possible to build a set of
guidelines for assembling IR-based
solutions for a given data set
Different datasets and different SE tasks
require different IR parameters
What is the “best” IR process?
It is not possible to build a set of
guidelines for assembling IR-based
solutions for a given data set
Different datasets and different SE tasks
require different IR parameters
If not well calibrated, IR techniques
perform worst than simple heuristics.
A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, S. Panichella.
“Labeling Source Code with Information Retrieval
Methods: An Empirical Study”. EMSE 2014
What is the “best” IR process?
It is not possible to build a set of
guidelines for assembling IR-based
solutions for a given data set
Different datasets and different SE tasks
require different IR parameters
If not well calibrated, IR techniques
perform worst than simple heuristics.
A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, S. Panichella.
“Labeling Source Code with Information Retrieval
Methods: An Empirical Study”. EMSE 2014
for the project and
for the SE task
Strategy to find the best configuration…
Strategy to find the best configuration…
It must be unsupervised..
Strategy to find the best configuration…
It must be unsupervised..
and general…
Our Idea
Information Retrieval
Clustering Analysis
Conjecture: there is a relationship between quality of clusters
and IR process performances
Predicting the performances?
Term 1
Term 2
Doc 1
Doc 2
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean
LSI (k)
LDA (ɑ, β, n, k)
VSM
Boolean
Term Frequency
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Predicting the performances?
Term 1
Term 2
Doc 1
Doc 2
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean
LSI (k = 10)
LDA (ɑ, β, n, k)
VSM
Boolean
Term Frequency
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Predicting the performances?
Term 1
Term 2
Doc 1
Doc 2
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean
LSI (k = 5)
LDA (ɑ, β, n, k)
VSM
Boolean
Term Frequency
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Predicting the performances?
Term 1
Term 2
Doc 1
Doc 2
Special Chars.
Digits
White space
..
No stop word
Stop-word function
Java stop-word list
English stop-word list
Italian stop-word list
…
No Stemmer
Porter
English Snowball
Italian Snowball
….
Cosine Similarity
Hellinger Distance
Jaccard Dist.
Euclidean Dist.
LSI (k = 3)
LDA (ɑ, β, n, k)
VSM
Boolean
Term Frequency
TF-IDF
Log(TF+1)
Entropy
Term
Extraction
Stop-Word
List / Function
Morphological
Analysis
Term
Weighting
IR Methods
Distance
Function
Silhouette Coefficient
Term 1
Term 2
Doc 1
Doc 2
a
a = Cluster Cohesion
Silhouette Coefficient
a = Cluster Cohesion
Term 1
Term 2
Doc 1
Doc 2
b = Cluster Separation
a
b
Good Cluster:
Separation >> Cohesion
b - a
max{a, b}
Silhouette = ∈ [-1; 1]
Search-Based Solution (GA-IR)
1) Problem Reformulation: Finding the IR process which maximises the quality
of clusters
2) Solution Encoding:
3) Fitness Function: Silhouette Coefficient
4) Solver: Genetic Algorithms
- Uniform mutation with probability of 1/n
- Population size = 50
- Single -point crossover with probability 0.80
- N. generations = 100
X =
b - a
max{a, b}
Character
Pruning
Identifier
Splitting
Stop word
Removal
Morph.
Analysis
IR Method
Similarity
Function
No pruning Camel
Case
No Stop
word
Porter
Stemmer
LSI (k=10) Cosine
Empirical Evaluation
Task1: Traceability Recovery Task2: Duplicate Bug Report
Identification
Project Version # Reports # Dupl.
Eclipse 3.0 224 44
Project LOC Source Target
EasyClinic 20k Use Case
Java
Classes
eTour 40K Use Case Java
Classes
i-Trus 10k Use Case JSP
Performance metrics:
• Precision
• Recall
• Average Precision
Performance metrics:
• Recall Rate
• List size
Performance metrics:
• Recall Rate
• List size
Empirical Evaluation
Task1: Traceability Recovery Task2: Duplicate Bug Report
Identification
Project Version # Reports # Dupl.
Eclipse 3.0 224 44
Project LOC Source Target
EasyClinic 20k Use Case
Java
Classes
eTour 40K Use Case Java
Classes
i-Trus 10k Use Case JSP
Performance metrics:
• Precision
• Recall
• Average Precision
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
AveragePrecision
Frequency
High Results Variability
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
AveragePrecision
Frequency
RQ1: How do the settings of IR-based techniques
instantiated by GA-IR compare to an ideal configuration?
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
AveragePrecision
Frequency
Average Precision:
- Ideal Config. = 47.01%
- GA-IR = 46.94%
GA-IR
RQ1: How do the settings of IR-based techniques instantiated
by GA-IR compare to an ideal configuration?
Mean over 30
independent runs
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
AveragePrecision
Frequency
Average Precision:
- Ideal Config. = 47.01%
- GA-IR = 46.94%
GA-IR
RQ2: How do the IR techniques and configurations
instantiated by GA-IR compare to those previously used in
literature for the same tasks?
Task 1: Traceability Recovery
System = eTour
Size = 40K LOC
# IR Config. = 89,640
AveragePrecision
Frequency
Average Precision:
- Ideal Config. = 47.01%
- GA-IR = 46.94%
- Ref. LSI = 30.93%
- Ref. VSM = 29.94%
RQ2: How do the IR techniques and configurations instantiated by GA-IR
compare to those previously used in literature for the same tasks?
GA-IR
LSI
VSM
Ideal GA-IR Ref. LSI Ref. VSM
Task 1: Traceability Recovery
Ideal GA-IR Ref. LSI Ref. VSM
Task 1: Traceability Recovery
GA-IR outperforms baseline (p-value < 0.05)
GA-IR is statistically equivalent to the
ideal configuration (p-value = 1.0)
Clustering Hypothesis
Silhouette Coefficient
AveragePrecision
0 1
50
100
10
0.2- 0.2 0.4 0.6 0.8
eTour
Clustering Hypothesis
AveragePrecision
eTour
0 1
50
100
10
0.2- 0.2 0.4 0.6 0.8
Silhouette Coefficient
Thanks! Question?

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic Algorithms

  • 1.
    Parameterizing and Assembling IR-basedSolutions for SE Tasks using Genetic Algorithms Annibale Panichella, Bogdan Dit, Rocco Oliveto, Max Di Penta, Denys Poshyvanyk and Andrea De Lucia
  • 2.
    Information Retrieval 600 papers in10 years 30 applications in S.E.
  • 3.
    Use Case InsertLaboratory Data Description The user inserts the data of a specific laboratory Events 1. The user opens the Laboratory GUI 2. The user inserts the Laboratory data . . . /* *This class implements the GUI for managing laboratories data */ public class GUILaboratoryData { private jFrame window; private jButton insert; ... public GUILaboratoryData(){ window = new JFrame(); insert = new JButton(); ... } ... } GUILaboratoryData.java Information Retrieval Use Case UC 11.4
  • 4.
    Use Case InsertLaboratory Data Description The user inserts the data of a specific laboratory Events 1. The user opens the Laboratory GUI 2. The user inserts the Laboratory data . . . /* *This class implements the GUI for managing laboratories data */ public class GUILaboratoryData { private jFrame window; private jButton insert; ... public GUILaboratoryData(){ window = new JFrame(); insert = new JButton(); ... } ... } GUILaboratoryData.java Use Case UC 11.4 Information Retrieval
  • 5.
    Use Case InsertLaboratory Data Description The user inserts the data of a specific laboratory Events 1. The user opens the Laboratory GUI 2. The user inserts the Laboratory data . . . /* *This class implements the GUI for managing laboratories data */ public class GUI Laboratory Data { private j Frame window; private j Button insert; ... public GUI Laboratory Data(){ window = new JFrame(); insert = new JButton(); ... } ... } GUILaboratoryData.java Information Retrieval Use Case UC 11.4
  • 6.
    Use Case InsertLaboratory Data Description The user inserts the data of a specific laboratory Events 1. The user opens the Laboratory GUI 2. The user inserts the Laboratory data . . . /* *This class implements the GUI for managing laboratories data */ public class GUI Laboratory Data { private j Frame window; private j Button insert; ... public GUI Laboratory Data(){ window = new JFrame(); insert = new JButton(); ... } ... } GUILaboratoryData.java Information Retrieval Use Case UC 11.4 Textual Similarity = 42%
  • 7.
    IR Process Term Extraction Stop-Word List /Function Morphological Analysis Term Weighting IR Methods Distance Function Software Maintenance task: - Traceability Recovery - Source code labelling - Bug duplication - Feature Location - … Software Artifacts
  • 8.
    IR Process Special Chars. Digits Whitespace .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean LSI (k) LDA (ɑ, β, n, k) VSM PLSI … Boolean Term Freq TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function Software Maintenance task: - Traceability Recovery - Source code labelling - Bug duplication - Feature Location - … Software Artifacts
  • 9.
    IR Process Special Chars. Digits Whitespace .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean LSI (k) LDA (ɑ, β, n, k) VSM PLSI … Boolean Term Freq TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function Software Maintenance task: - Traceability Recovery - Source code labelling - Bug duplication - Feature Location - … Which configuration to use? Software Artifacts
  • 10.
    What is the“best” IR process? It is not possible to build a set of guidelines for assembling IR-based solutions for a given data set
  • 11.
    What is the“best” IR process? It is not possible to build a set of guidelines for assembling IR-based solutions for a given data set Different datasets and different SE tasks require different IR parameters
  • 12.
    What is the“best” IR process? It is not possible to build a set of guidelines for assembling IR-based solutions for a given data set Different datasets and different SE tasks require different IR parameters If not well calibrated, IR techniques perform worst than simple heuristics. A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, S. Panichella. “Labeling Source Code with Information Retrieval Methods: An Empirical Study”. EMSE 2014
  • 13.
    What is the“best” IR process? It is not possible to build a set of guidelines for assembling IR-based solutions for a given data set Different datasets and different SE tasks require different IR parameters If not well calibrated, IR techniques perform worst than simple heuristics. A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, S. Panichella. “Labeling Source Code with Information Retrieval Methods: An Empirical Study”. EMSE 2014 for the project and for the SE task
  • 14.
    Strategy to findthe best configuration…
  • 15.
    Strategy to findthe best configuration… It must be unsupervised..
  • 16.
    Strategy to findthe best configuration… It must be unsupervised.. and general…
  • 17.
    Our Idea Information Retrieval ClusteringAnalysis Conjecture: there is a relationship between quality of clusters and IR process performances
  • 18.
    Predicting the performances? Term1 Term 2 Doc 1 Doc 2 Special Chars. Digits White space .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean LSI (k) LDA (ɑ, β, n, k) VSM Boolean Term Frequency TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function
  • 19.
    Predicting the performances? Term1 Term 2 Doc 1 Doc 2 Special Chars. Digits White space .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean LSI (k = 10) LDA (ɑ, β, n, k) VSM Boolean Term Frequency TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function
  • 20.
    Predicting the performances? Term1 Term 2 Doc 1 Doc 2 Special Chars. Digits White space .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean LSI (k = 5) LDA (ɑ, β, n, k) VSM Boolean Term Frequency TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function
  • 21.
    Predicting the performances? Term1 Term 2 Doc 1 Doc 2 Special Chars. Digits White space .. No stop word Stop-word function Java stop-word list English stop-word list Italian stop-word list … No Stemmer Porter English Snowball Italian Snowball …. Cosine Similarity Hellinger Distance Jaccard Dist. Euclidean Dist. LSI (k = 3) LDA (ɑ, β, n, k) VSM Boolean Term Frequency TF-IDF Log(TF+1) Entropy Term Extraction Stop-Word List / Function Morphological Analysis Term Weighting IR Methods Distance Function
  • 22.
    Silhouette Coefficient Term 1 Term2 Doc 1 Doc 2 a a = Cluster Cohesion
  • 23.
    Silhouette Coefficient a =Cluster Cohesion Term 1 Term 2 Doc 1 Doc 2 b = Cluster Separation a b Good Cluster: Separation >> Cohesion b - a max{a, b} Silhouette = ∈ [-1; 1]
  • 24.
    Search-Based Solution (GA-IR) 1)Problem Reformulation: Finding the IR process which maximises the quality of clusters 2) Solution Encoding: 3) Fitness Function: Silhouette Coefficient 4) Solver: Genetic Algorithms - Uniform mutation with probability of 1/n - Population size = 50 - Single -point crossover with probability 0.80 - N. generations = 100 X = b - a max{a, b} Character Pruning Identifier Splitting Stop word Removal Morph. Analysis IR Method Similarity Function No pruning Camel Case No Stop word Porter Stemmer LSI (k=10) Cosine
  • 25.
    Empirical Evaluation Task1: TraceabilityRecovery Task2: Duplicate Bug Report Identification Project Version # Reports # Dupl. Eclipse 3.0 224 44 Project LOC Source Target EasyClinic 20k Use Case Java Classes eTour 40K Use Case Java Classes i-Trus 10k Use Case JSP Performance metrics: • Precision • Recall • Average Precision Performance metrics: • Recall Rate • List size
  • 26.
    Performance metrics: • RecallRate • List size Empirical Evaluation Task1: Traceability Recovery Task2: Duplicate Bug Report Identification Project Version # Reports # Dupl. Eclipse 3.0 224 44 Project LOC Source Target EasyClinic 20k Use Case Java Classes eTour 40K Use Case Java Classes i-Trus 10k Use Case JSP Performance metrics: • Precision • Recall • Average Precision
  • 27.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640
  • 28.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640 AveragePrecision Frequency High Results Variability
  • 29.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640 AveragePrecision Frequency RQ1: How do the settings of IR-based techniques instantiated by GA-IR compare to an ideal configuration?
  • 30.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640 AveragePrecision Frequency Average Precision: - Ideal Config. = 47.01% - GA-IR = 46.94% GA-IR RQ1: How do the settings of IR-based techniques instantiated by GA-IR compare to an ideal configuration? Mean over 30 independent runs
  • 31.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640 AveragePrecision Frequency Average Precision: - Ideal Config. = 47.01% - GA-IR = 46.94% GA-IR RQ2: How do the IR techniques and configurations instantiated by GA-IR compare to those previously used in literature for the same tasks?
  • 32.
    Task 1: TraceabilityRecovery System = eTour Size = 40K LOC # IR Config. = 89,640 AveragePrecision Frequency Average Precision: - Ideal Config. = 47.01% - GA-IR = 46.94% - Ref. LSI = 30.93% - Ref. VSM = 29.94% RQ2: How do the IR techniques and configurations instantiated by GA-IR compare to those previously used in literature for the same tasks? GA-IR LSI VSM
  • 33.
    Ideal GA-IR Ref.LSI Ref. VSM Task 1: Traceability Recovery
  • 34.
    Ideal GA-IR Ref.LSI Ref. VSM Task 1: Traceability Recovery GA-IR outperforms baseline (p-value < 0.05) GA-IR is statistically equivalent to the ideal configuration (p-value = 1.0)
  • 35.
  • 36.
  • 37.