ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Improving IR-based Traceability
Recovery Using Smoothing Filters

Andrea Massimiliano Rocco Annibale Sebastiano
De Lucia Di Penta Oliveto Panichella Panichella

Software traceability
“The degree to which a relationship can be established
between two products of a software development process”
[IEEE Glossary for Software Terminology]

Source
Use case Test case
code

Source
Use case Test case
code

Important for:
Up-to-date traceability

 program comprehension Up-to-date traceability
 requirement tracing links rarely exists →
links rarely exists →
 impact analysis
need to recover them
need to recover them
 software reuse
 …

IR-based traceability recovery

Antoniol et al., 2002 (VSM+Probabilistic model)
Antoniol et al., 2002 (VSM+Probabilistic model)
Marcus and Maletic, 2003 (LSI)
Marcus and Maletic, 2003 (LSI)

Traditional IR vs.
IR applied to Software Engineering
Traditional IR IR applied to SE
 Deals with  We have sets of

heterogeneous homogeneous
documents for what documents for what
concerns: concerns
 Linguistic choices  Syntax, linguistic
 Syntax choices
 Semantics  Examples:
 We just live with that  Use cases, test
differences documents, design
documents follow a
common template and
contain recurrent words

Problem
 Different kinds of software artifacts require specific
preprocessing
Test case Change the date for a visit:
C51
C51 Version: 0 02 000
Version: 0 02 000
Use case
Use case Satisfies the request to modify a visit
for a patient request to modify a visit
Satisfies the
for a patient
UcModVis
UcModVis
Priority
Priority High
High
....
....
Test description
Test description
Input
Input Select a visit:
Select a visit:
26/09/2003 11:00 First visit
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Change: 03/10/2003 11:00
Oracle
Oracle Invalid sequence: The system does not allow
Invalid sequence: The system does not allow
to change a booking
to change a booking
Coverage
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None

Problem
 Different kinds of software artifacts require specific
preprocessing
C51
C51 Version: 0 02 000
Version: 0 02 000
Use case
Use case Satisfies the request to modify a visit
for a patient request to modify a visit
Satisfies the
for a patient
UcModVis
UcModVis
Priority
Priority High
High Artifact-specific words do
....
.... not bring useful
Test description
Test description
information
26/09/2003 11:00 First visit
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Change: 03/10/2003 11:00
Oracle
Oracle Invalid sequence: The system does not allow
Invalid sequence: The system does not allow
to change a booking
to change a booking
Coverage
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Valid classes: CE1 CE8 CE14 CE19 CE21

A similar problem: image processing

Noisy images

Pixels with peaks of low Pixels with peaks of
color intensity high color intensity

Noise

Reducing noise using smoothing filters

Mean filter
1
g ( x, y ) =
M
∑ f ( n, m )
f ( n , m )∈S

Image vs. traceability noise
Image noise: Traceability noise:
 Pixels with high or  Terms and linguistic
low color intensity patterns occurring in
 Pixels are position many artifacts of a given
dependent category
 Use cases, test

cases..
 Artifacts (columns) are
position independent
d1 d2 d2 d1

Representing the noise
Source Documents Target Documents

s1 s2 s3 L sk t1 t2 t3 L tz
word1  v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z 
v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z 
word 2  2,1 L L 
M  M O M O M O M M O M O M O M 
 
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z 

Linguistic information strictly Linguistic information strictly belonging
belonging to source documents to target documents

Common Information Common Information
for Source Documents For target documents


v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z 
word 2  2,1 L L 
 
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn,1 L vn,2 L vn,3 L vn, z 

1 k 
k ∑ v1, j  1 m
 z ∑ v1, j 

 j =1   j = k +1 
1 k  1 m 
 ∑ v2, j 
Mean source vector S=  k ∑
S =  j =1
v2, j  Mean target vector T= T =  z j = k +1 

   M 
M  m 
 k  1 
1   z ∑ vn , j 
∑ vn, j 
Common Information  k j =1  Common Information  j = k +1 

for Source Documents For target documents

The Mean Vectors are like the continuous component of a signal…
The Mean Vectors are like the continuous component of a signal…


v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z 
word 2  2,1 L L 
 
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z 

- -
S T
(mean target (mean target vector)
vector)

Filtered
Filtered Filtered
Filtered
Source Set
Source Set Target Set
Target Set

Empirical Study

 Goal: analyze the effect of smoothing filter
 Purpose: investigating how the filter affects
traceability recovery
 Quality focus: traceability recovery performance
(precision and recall)
 Perspective:
 Researchers: evaluating the novel technique
 Project managers: adopt a better traceability recovery
technique
 Context: artifacts from two systems
 EasyClinic and Pine

Context

EasyClinic Pine
Description Medical doctor office Text-based
management email client
Language Java C
Files/Classe 37 31
s
KLOC 20 130
Documents 113 100
Language Italian English
Artifacts Use cases Requirements
Interaction diagrams Use cases
Source code
Test cases

Research Questions and Factors
 RQ1: Does the smoothing filter improve the
recovery performances of VSM-based traceability
recovery?
 RQ2: Does the smoothing filter improve the
recovery performances of LSI-based traceability
recovery?
 RQ3: How do the performances vary for different
types of artifacts?

 Factors:
 Use of filter: YES, NO
 Technique: VSM, LSI
 Artifact: Req., UC, Int. Diagrams, Code, TC
 System: Easyclinic, Pine

Analysis Method
 Performances evaluated by precision and recall:
correct ∩ retrieved correct ∩ retrieved
precision = recall =
retrieved correct

M1 M2
 We statistically compare the #
of false positives of different 0

methods for each correct link
2
identified
2
 Wilcoxon Rank Sum test
3
 Cliff’s delta effect size

EasyClinic: Use cases into source (VSM)

Filtered
Precision

Not Filtered

Recall

EasyClinic: Use cases into source (LSI)

Filtered
Precision

Not Filtered

Recall

EasyClinic: Test cases into source (LSI)

Test cases are:
 Short documents

 Limited vocabulary

 Mostly consistent with
source code
Precision

Filtered
Not Filtered

Recall

Pine: Use cases into requirements (LSI)
Precision

Filtere
d

Not Filtered

Recall

Statistical Comparison

Data set Traced VSM LSI
Artifacts
p-value Effect size p-value Effect size

EasyClinic UC→Code <0.01 0.50 <0.01 0.50
(large) (large)

Int. Diag. <0.01 0.52 <0.01 0.34
→ Code (large) (medium)

Pine TC → Code 1.00 - 1.00 -
(negligible) (negligible)

Req. → UC <0.01 0.58 <0.01 0.58
(large) (large)

Link precision improvement

Login Patient
Login Patient
vs. Person
vs. Person
Poor vocabulary
Poor vocabulary
overlap (10%)
overlap (10%)

Threats to validity
 Construct validity
 Mainly related to our oracle
 Provided by developers and for EasyClinic also peer-
reviewed
 Internal validity
 Improvements could be due to other reasons…
 However we compared different techniques (VSM, LSI)
 The approach works well regardless of stop word
removal/stemming and use of tf-idf
 Conclusion validity
 Conclusions based on proper (non-parametric) statistics
 External validity
 We considered systems with different characteristics and
artifacts
 … but further studies are desirable

Conclusions

 We proposed the use of smoothing filter to
improve performances of IR-based traceability
recovery
 Idea inspired from digital signal processing
 The filter significantly improves IR-based
traceability recovery based on VSM (RQ1) and LSI
(RQ2)
 Filter particularly suitable for artifacts having a
higher verbosity (RQ3)
 e.g., requirements and use cases
 Less useful for artifacts composed of short
sentences and using a limited vocabulary
 e.g., test cases

Work-in-progress

 Study replication
 Different systems and artifacts
 Use of relevance feedback
 More sophisticated smoothing technique
 Non linear filters
 Use in other applications of IR to software
engineering
 E.g. impact analysis or feature location

ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Similar to ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters (20)

More from Sebastiano Panichella

More from Sebastiano Panichella (20)

ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters