Late Propagation
  in Software Clones
Liliane Barbour, Foutse Khomh,
          and Ying Zou
Late Propagation (LP)
• Definition: An inconsistent change that diverges a
  clone pair, later followed by a consistent, re-
  synchronizing change.
• It can be risky because failure to propagate changes
  between clones in a clone pair can lead to faults
• In our work, we found that 8-21% of genealogies
  contain a late propagation




                                                         2
LP With Propagation Example from
                ArgoUML
//Clone A, Revision 595
add Field(new UMLComboBox(typeModel),1,0,0);

//Clone B, Revision 595
add Field(new UMLComboBox(classifierModel),2,0,0);

//Diverging Change: Clone A, Revision 602
add Field(new UMLComboBoxNavigator(this,”NavClass”,
         new UMLComboBox(typeModel)),1,0,0);

//Re-synchronizing Change: Clone B, Revision 604
add Field(new UMLComboBoxNavigator (this,”NavClass”,
         new UMLComboBox(classifierModel)),2,0,0);
                                                          Clone A   Clone B

                                                Revision 595



                                                Revision 602              Diverging
                                                                          Change


                                                                          Re-synchronizing
                                                Revision 604              Change    3
LP Without Propagation Example
               from Ant
//Clone A, Revision 270250                                  Clone A   Clone B
if( destFile == null )
{                                                    Revision
   destFile = new File(destDir,file.getName());      270250
}

//Clone B, Revision 270250                           Revision              Diverging
if (destFile == null ) {                             270264                Change
   destFile = new File(destDir,file.getName());
}
                                                   Revision                Re-synchronizing
// Diverging Change: Clone A, Revision 270264      271109                  Change
if ( m_destFile == null )
{
   m_destFile = new File(m_destDir,m_file.getName());
}

//Re-synchronizing Change: Clone A, Revision 271109
if ( destFile == null ) {
   destFile = new File(destDir,file.getName());
}



                                                                                   4
Types of Late Propagation
Propagation       LP     Modified During Modified During   Modified During
Category          Type   Diverging Change the Period of    Re-synchronizing
                                          Divergence       Change
Propagation        LP1          A               A                  B
Always Occurs      LP2          A             A and B              B
                   LP3          A               A               A and B
Propagation May    LP4          A             A and B              A
or May Not         LP5          A             A and B           A and B
Occur
                   LP6       A and B          A and B            A or B
                   LP7       A and B          A and B           A and B
Propagation        LP8          A               A                  A
Never Occurs



                                                                              5
Research Questions
RQ1: Are there different types of LP?

RQ2: Are some types of LP more fault-prone than
  others?

RQ3: Which type of LP experiences the highest
    proportion of faults?



                                                  6
Subject Systems


                             # Gen    # LP     # Gen    # LP
System   # LOC # Revisions   CCFinder CCFinder Simian   Simian
ArgoUML 3.1M       18k         14k      1.1k     111      23
  Ant    2.3M     1.0M         30k      4.7k     461      80




                                                                 7
Our Approach




               8
Mining the SVN




• Use J-Rex to mine the SVN
• Heuristics used to identify reason for commit
  (Mockus et al., 2000)
• Snapshots of all revisions to each Java file are stored
  in an XML file
• Test files are removed
                                                            9
Clone Detection




• Contents of each method revision extracted into
  individual files
• Perform clone detection once on all snapshots
• Two existing clone detection tools are used
   – Simian (text-based) and CCFinder (token-based)
                                                      10
Building Clone Genealogies




• Build clone genealogies using the existing clone list
• Query the SVN using diff to track changes to each
  clone in a clone pair over time.
• If a change modifies one of the clones in a clone
  pair, query the clone list for a matching clone
                                                          11
RQ1: Are there different types of LP?




                                    12
RQ1: Are there different types of LP?
                                            Breakdown of LP Type by System
                                   80%
Percentage of All LP Occurrences



                                   70%
                                   60%
                                   50%
                                   40%
                                   30%
                                   20%
                                   10%
                                    0%
                                          LP1     LP2       LP3     LP4     LP5       LP6     LP7     LP8
                                                                      LP Types
                                   ArgoUML - Simian     ArgoUML - CCFinder     Ant - Simian   Ant - CCFinder


                There is representation from multiple types of LP
                          and across all categories of LP.                                                     13
RQ2: Are some types of LP more fault-
         prone than others?




      Part 1: Is Late Propagation fault-prone?

 Part 2: Are specific types of late propagation more
                       fault-prone?

                                                       14
Part 1: Is Late Propagation Fault-
                  prone?
                              LP vs. Non-LP
                               Odds Ratios
                   4
                                                                     ArgoUML – Simian
      Odds Ratio




                   3
                                                                    is omitted because
                   2
                                                                    it is not statistically
                   1                                                      significant
                   0
               Ant - Simian   ArgoUML - CCFinder   Ant - CCFinder


In all significant cases, the odds ratio is greater than 1.
 Therefore, LP genealogies are more fault prone than
                    non-LP genealogies.
                                                                                      15
Part 2: Are specific types of late
 propagation more fault-prone?
                    Odds Ratios Between Each LP Type
                        and Non-LP Genealogies
               16
               14
               12
  Odds Ratio




               10
                8
                6
                4
                2
                0
                      LP1     LP2   LP3    LP4    LP5    LP6   LP7     LP8
                                             LP Type
                    Ant - Simian    ArgoUML - CCFinder    Ant - CCFinder

Note: ArgoUML – Simian is omitted because it is not statistically significant   16
RQ2 Observations
• In general, some LP types are not more fault-prone
  than non-LP genealogies (i.e. odds ratio < 1)
• Some types that make up a small proportion of LP
  instances have a very high odds ratio
• LP7 and LP8 occur frequently but have low odds
  ratios.
Each type of LP has a different level of fault-proneness.



                                                       17
RQ3: Which type of LP experiences
 the highest proportion of faults?




                                     18
RQ3: Which type of LP experiences
 the highest proportion of faults?
                                          Percentage of Fault Occurrences
                                             Broken Down by LP Type
  Percentage of Fault Occurrences




                                    80%

                                    60%

                                    40%

                                    20%

                                    0%
                                           LP1   LP2    LP3    LP4    LP5   LP6    LP7    LP8
                                                                 LP Type

                                      Ant - Simian     ArgoUML - CCFinder    Ant - CCFinder

Note: ArgoUML – Simian is omitted because it is not statistically significant                   19
RQ3 Observations
• LP7 and LP8 contribute a large proportion of the
  faults but have lower odds ratios (RQ2)
   – When faults occur, they occur in large numbers
• Overall, LP7 and LP8 are the most dangerous, with
  the other types being system dependent in their
  fault-proneness.


       The proportion of faults is different for
                   each LP type.

                                                      20
Conclusion
• In general, LP genealogies are more fault-prone than
  non-LP genealogies
• LP7 and LP8 are the riskiest, in terms of their fault-
  proneness and magnitude of faults.
   – LP8 contains no propagation of changes
   – LP7 may or may not contain any propagation of
     changes
• The fault-proneness and fault-occurrence is
  dependent on the LP type and is system-dependent.

                                                       21
22

Late Propagation in Software Clones

  • 1.
    Late Propagation in Software Clones Liliane Barbour, Foutse Khomh, and Ying Zou
  • 2.
    Late Propagation (LP) •Definition: An inconsistent change that diverges a clone pair, later followed by a consistent, re- synchronizing change. • It can be risky because failure to propagate changes between clones in a clone pair can lead to faults • In our work, we found that 8-21% of genealogies contain a late propagation 2
  • 3.
    LP With PropagationExample from ArgoUML //Clone A, Revision 595 add Field(new UMLComboBox(typeModel),1,0,0); //Clone B, Revision 595 add Field(new UMLComboBox(classifierModel),2,0,0); //Diverging Change: Clone A, Revision 602 add Field(new UMLComboBoxNavigator(this,”NavClass”, new UMLComboBox(typeModel)),1,0,0); //Re-synchronizing Change: Clone B, Revision 604 add Field(new UMLComboBoxNavigator (this,”NavClass”, new UMLComboBox(classifierModel)),2,0,0); Clone A Clone B Revision 595 Revision 602 Diverging Change Re-synchronizing Revision 604 Change 3
  • 4.
    LP Without PropagationExample from Ant //Clone A, Revision 270250 Clone A Clone B if( destFile == null ) { Revision destFile = new File(destDir,file.getName()); 270250 } //Clone B, Revision 270250 Revision Diverging if (destFile == null ) { 270264 Change destFile = new File(destDir,file.getName()); } Revision Re-synchronizing // Diverging Change: Clone A, Revision 270264 271109 Change if ( m_destFile == null ) { m_destFile = new File(m_destDir,m_file.getName()); } //Re-synchronizing Change: Clone A, Revision 271109 if ( destFile == null ) { destFile = new File(destDir,file.getName()); } 4
  • 5.
    Types of LatePropagation Propagation LP Modified During Modified During Modified During Category Type Diverging Change the Period of Re-synchronizing Divergence Change Propagation LP1 A A B Always Occurs LP2 A A and B B LP3 A A A and B Propagation May LP4 A A and B A or May Not LP5 A A and B A and B Occur LP6 A and B A and B A or B LP7 A and B A and B A and B Propagation LP8 A A A Never Occurs 5
  • 6.
    Research Questions RQ1: Arethere different types of LP? RQ2: Are some types of LP more fault-prone than others? RQ3: Which type of LP experiences the highest proportion of faults? 6
  • 7.
    Subject Systems # Gen # LP # Gen # LP System # LOC # Revisions CCFinder CCFinder Simian Simian ArgoUML 3.1M 18k 14k 1.1k 111 23 Ant 2.3M 1.0M 30k 4.7k 461 80 7
  • 8.
  • 9.
    Mining the SVN •Use J-Rex to mine the SVN • Heuristics used to identify reason for commit (Mockus et al., 2000) • Snapshots of all revisions to each Java file are stored in an XML file • Test files are removed 9
  • 10.
    Clone Detection • Contentsof each method revision extracted into individual files • Perform clone detection once on all snapshots • Two existing clone detection tools are used – Simian (text-based) and CCFinder (token-based) 10
  • 11.
    Building Clone Genealogies •Build clone genealogies using the existing clone list • Query the SVN using diff to track changes to each clone in a clone pair over time. • If a change modifies one of the clones in a clone pair, query the clone list for a matching clone 11
  • 12.
    RQ1: Are theredifferent types of LP? 12
  • 13.
    RQ1: Are theredifferent types of LP? Breakdown of LP Type by System 80% Percentage of All LP Occurrences 70% 60% 50% 40% 30% 20% 10% 0% LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8 LP Types ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder There is representation from multiple types of LP and across all categories of LP. 13
  • 14.
    RQ2: Are sometypes of LP more fault- prone than others? Part 1: Is Late Propagation fault-prone? Part 2: Are specific types of late propagation more fault-prone? 14
  • 15.
    Part 1: IsLate Propagation Fault- prone? LP vs. Non-LP Odds Ratios 4 ArgoUML – Simian Odds Ratio 3 is omitted because 2 it is not statistically 1 significant 0 Ant - Simian ArgoUML - CCFinder Ant - CCFinder In all significant cases, the odds ratio is greater than 1. Therefore, LP genealogies are more fault prone than non-LP genealogies. 15
  • 16.
    Part 2: Arespecific types of late propagation more fault-prone? Odds Ratios Between Each LP Type and Non-LP Genealogies 16 14 12 Odds Ratio 10 8 6 4 2 0 LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8 LP Type Ant - Simian ArgoUML - CCFinder Ant - CCFinder Note: ArgoUML – Simian is omitted because it is not statistically significant 16
  • 17.
    RQ2 Observations • Ingeneral, some LP types are not more fault-prone than non-LP genealogies (i.e. odds ratio < 1) • Some types that make up a small proportion of LP instances have a very high odds ratio • LP7 and LP8 occur frequently but have low odds ratios. Each type of LP has a different level of fault-proneness. 17
  • 18.
    RQ3: Which typeof LP experiences the highest proportion of faults? 18
  • 19.
    RQ3: Which typeof LP experiences the highest proportion of faults? Percentage of Fault Occurrences Broken Down by LP Type Percentage of Fault Occurrences 80% 60% 40% 20% 0% LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8 LP Type Ant - Simian ArgoUML - CCFinder Ant - CCFinder Note: ArgoUML – Simian is omitted because it is not statistically significant 19
  • 20.
    RQ3 Observations • LP7and LP8 contribute a large proportion of the faults but have lower odds ratios (RQ2) – When faults occur, they occur in large numbers • Overall, LP7 and LP8 are the most dangerous, with the other types being system dependent in their fault-proneness. The proportion of faults is different for each LP type. 20
  • 21.
    Conclusion • In general,LP genealogies are more fault-prone than non-LP genealogies • LP7 and LP8 are the riskiest, in terms of their fault- proneness and magnitude of faults. – LP8 contains no propagation of changes – LP7 may or may not contain any propagation of changes • The fault-proneness and fault-occurrence is dependent on the LP type and is system-dependent. 21
  • 22.