Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improving IR-based Traceability     Recovery Using Smoothing Filters Andrea    Massimiliano   Rocco      Annibale    Sebas...
Software traceability    “The degree to which a relationship can be established    between two products of a software deve...
IR-based traceability recovery                    Antoniol et al., 2002 (VSM+Probabilistic model)                     Anto...
Traditional IR vs.IR applied to Software EngineeringTraditional IR               IR applied to SE Deals with             ...
Problem    Different kinds of software artifacts require specific     preprocessing    Test case Change the date for a vi...
Problem    Different kinds of software artifacts require specific     preprocessing    Test case Change the date for a vi...
A similar problem: image processing
Noisy images   Pixels with peaks of low           Pixels with peaks of        color intensity               high color int...
Reducing noise using smoothing filters                                Mean filter                                         ...
Image vs. traceability noise   Image noise:              Traceability noise:      Pixels with high or      Terms and lin...
Representing the noise                   Source Documents                          Target Documents         s1            ...
Representing the noise                  Source Documents                            Target Documents           s1         ...
Representing the noise               Source Documents                      Target Documents         s1        s2        s3...
Empirical Study   Goal: analyze the effect of smoothing filter   Purpose: investigating how the filter affects    tracea...
Context               EasyClinic              PineDescription    Medical doctor office   Text-based               manageme...
Research Questions and Factors   RQ1: Does the smoothing filter improve the    recovery performances of VSM-based traceab...
Analysis Method    Performances evaluated by precision and recall:               correct ∩ retrieved              correct...
Results
EasyClinic: Use cases into source (VSM)                           Filtered Precision                      Not Filtered    ...
EasyClinic: Use cases into source (LSI)                          Filtered  Precision                       Not Filtered   ...
EasyClinic: Test cases into source (LSI)                       Test cases are:                          Short documents  ...
Pine: Use cases into requirements (LSI)Precision                       Filtere                       d             Not Fil...
Statistical Comparison Data set     Traced                VSM                       LSI              Artifacts            ...
Link precision improvement                             Login Patient                              Login Patient           ...
Threats to validity   Construct validity       Mainly related to our oracle       Provided by developers and for EasyCl...
Conclusions   We proposed the use of smoothing filter to    improve performances of IR-based traceability    recovery   ...
Work-in-progress   Study replication       Different systems and artifacts       Use of relevance feedback   More soph...
That’s all folks!
Upcoming SlideShare
Loading in …5
×

ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

684 views

Published on

Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic
Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.

  • Be the first to comment

  • Be the first to like this

ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

  1. 1. Improving IR-based Traceability Recovery Using Smoothing Filters Andrea Massimiliano Rocco Annibale SebastianoDe Lucia Di Penta Oliveto Panichella Panichella
  2. 2. Software traceability “The degree to which a relationship can be established between two products of a software development process” [IEEE Glossary for Software Terminology] Source Use case Test case code Source Use case Test case code Important for: Up-to-date traceability  program comprehension Up-to-date traceability  requirement tracing links rarely exists → links rarely exists →  impact analysis need to recover them need to recover them  software reuse  …
  3. 3. IR-based traceability recovery Antoniol et al., 2002 (VSM+Probabilistic model) Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI) Marcus and Maletic, 2003 (LSI)
  4. 4. Traditional IR vs.IR applied to Software EngineeringTraditional IR IR applied to SE Deals with  We have sets of heterogeneous homogeneous documents for what documents for what concerns: concerns  Linguistic choices  Syntax, linguistic  Syntax choices  Semantics  Examples: We just live with that  Use cases, test differences documents, design documents follow a common template and contain recurrent words
  5. 5. Problem Different kinds of software artifacts require specific preprocessing Test case Change the date for a visit: Test case Change the date for a visit: C51 C51 Version: 0 02 000 Version: 0 02 000 Use case Use case Satisfies the request to modify a visit for a patient request to modify a visit Satisfies the for a patient UcModVis UcModVis Priority Priority High High .... .... Test description Test description Input Input Select a visit: Select a visit: 26/09/2003 11:00 First visit 26/09/2003 11:00 First visit Change: 03/10/2003 11:00 Change: 03/10/2003 11:00 Oracle Oracle Invalid sequence: The system does not allow Invalid sequence: The system does not allow to change a booking to change a booking Coverage Coverage Valid classes: CE1 CE8 CE14 CE19 CE21 Valid classes: CE1 CE8 CE14 CE19 CE21 Invalid classes: None Invalid classes: None
  6. 6. Problem Different kinds of software artifacts require specific preprocessing Test case Change the date for a visit: Test case Change the date for a visit: C51 C51 Version: 0 02 000 Version: 0 02 000 Use case Use case Satisfies the request to modify a visit for a patient request to modify a visit Satisfies the for a patient UcModVis UcModVis Priority Priority High High Artifact-specific words do .... .... not bring useful Test description Test description Input Select a visit: information Input Select a visit: 26/09/2003 11:00 First visit 26/09/2003 11:00 First visit Change: 03/10/2003 11:00 Change: 03/10/2003 11:00 Oracle Oracle Invalid sequence: The system does not allow Invalid sequence: The system does not allow to change a booking to change a booking Coverage Coverage Valid classes: CE1 CE8 CE14 CE19 CE21 Valid classes: CE1 CE8 CE14 CE19 CE21 Invalid classes: None Invalid classes: None
  7. 7. A similar problem: image processing
  8. 8. Noisy images Pixels with peaks of low Pixels with peaks of color intensity high color intensity Noise
  9. 9. Reducing noise using smoothing filters Mean filter 1 g ( x, y ) = M ∑ f ( n, m ) f ( n , m )∈S
  10. 10. Image vs. traceability noise Image noise: Traceability noise:  Pixels with high or  Terms and linguistic low color intensity patterns occurring in  Pixels are position many artifacts of a given dependent category  Use cases, test cases..  Artifacts (columns) are position independent d1 d2 d2 d1
  11. 11. Representing the noise Source Documents Target Documents s1 s2 s3 L sk t1 t2 t3 L tzword1  v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z  v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z word 2  2,1 L L  M  M O M O M O M M O M O M O M   word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z  Linguistic information strictly Linguistic information strictly belonging belonging to source documents to target documents Common Information Common Information for Source Documents For target documents
  12. 12. Representing the noise Source Documents Target Documents s1 s2 s3 L sk t1 t2 t3 L tzword1  v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z  v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z word 2  2,1 L L  M  M O M O M O M M O M O M O M   word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn,1 L vn,2 L vn,3 L vn, z  1 k  k ∑ v1, j  1 m  z ∑ v1, j    j =1   j = k +1  1 k  1 m   ∑ v2, j  Mean source vector S=  k ∑ S =  j =1 v2, j  Mean target vector T= T =  z j = k +1      M  M  m   k  1  1   z ∑ vn , j  ∑ vn, j  Common Information  k j =1  Common Information  j = k +1   for Source Documents For target documents The Mean Vectors are like the continuous component of a signal… The Mean Vectors are like the continuous component of a signal…
  13. 13. Representing the noise Source Documents Target Documents s1 s2 s3 L sk t1 t2 t3 L tzword1  v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z  v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z word 2  2,1 L L  M  M O M O M O M M O M O M O M   word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z  - - S T (mean target (mean target vector) vector) Filtered Filtered Filtered Filtered Source Set Source Set Target Set Target Set
  14. 14. Empirical Study Goal: analyze the effect of smoothing filter Purpose: investigating how the filter affects traceability recovery Quality focus: traceability recovery performance (precision and recall) Perspective:  Researchers: evaluating the novel technique  Project managers: adopt a better traceability recovery technique Context: artifacts from two systems  EasyClinic and Pine
  15. 15. Context EasyClinic PineDescription Medical doctor office Text-based management email clientLanguage Java CFiles/Classe 37 31sKLOC 20 130Documents 113 100Language Italian EnglishArtifacts Use cases Requirements Interaction diagrams Use cases Source code Test cases
  16. 16. Research Questions and Factors RQ1: Does the smoothing filter improve the recovery performances of VSM-based traceability recovery? RQ2: Does the smoothing filter improve the recovery performances of LSI-based traceability recovery? RQ3: How do the performances vary for different types of artifacts? Factors:  Use of filter: YES, NO  Technique: VSM, LSI  Artifact: Req., UC, Int. Diagrams, Code, TC  System: Easyclinic, Pine
  17. 17. Analysis Method  Performances evaluated by precision and recall: correct ∩ retrieved correct ∩ retrieved precision = recall = retrieved correct M1 M2  We statistically compare the # of false positives of different 0 methods for each correct link 2 identified 2  Wilcoxon Rank Sum test 3  Cliff’s delta effect size
  18. 18. Results
  19. 19. EasyClinic: Use cases into source (VSM) Filtered Precision Not Filtered Recall
  20. 20. EasyClinic: Use cases into source (LSI) Filtered Precision Not Filtered Recall
  21. 21. EasyClinic: Test cases into source (LSI) Test cases are:  Short documents  Limited vocabulary  Mostly consistent with source codePrecision Filtered Not Filtered Recall
  22. 22. Pine: Use cases into requirements (LSI)Precision Filtere d Not Filtered Recall
  23. 23. Statistical Comparison Data set Traced VSM LSI Artifacts p-value Effect size p-value Effect sizeEasyClinic UC→Code <0.01 0.50 <0.01 0.50 (large) (large) Int. Diag. <0.01 0.52 <0.01 0.34 → Code (large) (medium)Pine TC → Code 1.00 - 1.00 - (negligible) (negligible) Req. → UC <0.01 0.58 <0.01 0.58 (large) (large)
  24. 24. Link precision improvement Login Patient Login Patient vs. Person vs. Person Poor vocabulary Poor vocabulary overlap (10%) overlap (10%)
  25. 25. Threats to validity Construct validity  Mainly related to our oracle  Provided by developers and for EasyClinic also peer- reviewed Internal validity  Improvements could be due to other reasons…  However we compared different techniques (VSM, LSI)  The approach works well regardless of stop word removal/stemming and use of tf-idf Conclusion validity  Conclusions based on proper (non-parametric) statistics External validity  We considered systems with different characteristics and artifacts  … but further studies are desirable
  26. 26. Conclusions We proposed the use of smoothing filter to improve performances of IR-based traceability recovery Idea inspired from digital signal processing The filter significantly improves IR-based traceability recovery based on VSM (RQ1) and LSI (RQ2) Filter particularly suitable for artifacts having a higher verbosity (RQ3)  e.g., requirements and use cases Less useful for artifacts composed of short sentences and using a limited vocabulary  e.g., test cases
  27. 27. Work-in-progress Study replication  Different systems and artifacts  Use of relevance feedback More sophisticated smoothing technique  Non linear filters Use in other applications of IR to software engineering  E.g. impact analysis or feature location
  28. 28. That’s all folks!

×