Can Better Identifier Splitting
Techniques H l F
T h i      Help Feature LLocation?
                                i ?
Bogdan Dit, Latifa Guerrouj, D
B d Di L if G             j Denys P h
                                  Poshyvanyk, Gi li
                                           k Giuliano A
                                                      Antoniol
                                                           i l




     SEMERU

19th IEEE International Conference on Program Comprehension
             (ICPC’11) – Kingston, Ontario, Canada
2
Textual information embeds
    domain k
    d      i knowledge
                  l d




                             3
Textual information embeds
                                      domain k
                                      d      i knowledge
                                                    l d




                                   About 70% of source code
                                     consists of identifiers*




* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
                                                                        4
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
Textual information embeds
                                      domain k
                                      d      i knowledge
                                                    l d




                                   About 70% of source code
                                     consists of identifiers*


                              Identifiers are important source of
                             information for maintenance tasks:
                                • traceability link recovery
                                • feature location
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
                                                                        5
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
# of Cumulative Feature Location
Papers based on Textual Information




                                 6
Related Work on Identifiers
• Takang et al. (JPL 96)
            al (JPL’96)
  – programs with full-word identifiers are more
    understandable than those with abbreviated
    ones
• Lawrie et al. (ICPC 06)
            al (ICPC’06)
  – full words and recognizable abbreviations
    lead to better comprehension
• Binkley et al. (ICPC’09)
  – CamelCase style is easier to recognize than
    underscore                                     7
Related Work on Identifiers
• Enslen et al. (MSR 09)
            al (MSR’09)
  – Samurai: algorithm for splitting identifiers
    (using tables of identifier frequencies)
• Guerrouj et al. (JSME’11)
  – TIDIER: algorithm for splitting identifiers
    (using contextual information)
• Other related work
  – Deissenboeck and Pizka (SQJ’06), Antoniol et
    al. (ICSM’07),
    al (ICSM’07) Haiduc and Marcus (ICPC’08)
                                    (ICPC’08),
    etc.                                        8
Splitting Identifiers Correctly is
           Challenging
            h ll




                                 9
Identifier Splitting Algorithms
Original Identifier
userId
setGID
print_file2device
print file2device
SSLCertificate
MINstring
USERID
currentsize
readadapterobject
       p     j
tolocale
imitating
DEFMASKBit

                                       10
Identifier Splitting Algorithms
Original Identifier   Camel Case
userId                user Id
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject
                             p     j
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit

                                            11
Identifier Splitting Algorithms
                                                 Handles
Original Identifier   Camel Case              underscore and
userId                user Id
                                                  digits
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject Fails
                             p     j            at mixed cases
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit
                                            Fails at same case
                                                 identifiers 12
Identifier Splitting Algorithms
Original Identifier   Camel Case
userId                user Id
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject
                             p     j
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit

                                            13
Identifier Splitting Algorithms
Original Identifier   Camel Case            Samurai
userId                user Id               user Id
setGID                set GID               set GID
print_file2device
print file2device     print file 2 device   print file 2 device
SSLCertificate        SSL Certificate       SSL Certificate
MINstring             MI Nstring            MIN string
USERID                USERID                USER ID
currentsize           currentsize           current size
readadapterobject
       p     j        readadapterobject
                             p     j        read adapter object
                                                    p      j
tolocale              tolocale              tol ocal e
imitating             imitating             imi ta ting
DEFMASKBit            DEFMASK Bit           DEF MASK Bit

                                                              14
Identifier Splitting Algorithms
Original Identifier       Camel Case               Samurai
userId                   user Id
                      Splits some cases            user Id
setGID                   set GID                   set GID
                      where CamelCase
print_file2device
print file2device        print file 2 device       print file 2 device
SSLCertificate
                             cannot
                         SSL Certificate           SSL Certificate
MINstring                 MI Nstring               MIN string
USERID                    USERID                   USER ID
currentsize               currentsize              current size
readadapterobject
       p     j            readadapterobject
                                 p     j           read adapter object
                                                           p      j
tolocale                  tolocale                 tol ocal e
imitating                 imitating                imi ta ting
DEFMASKBit                DEFMASK Bit              DEF MASK Bit

                                      Oversplits                     15
# of Cumulative Feature Location
Papers based on Textual Information




                                 16
# of Cumulative Feature Location
Papers based on Textual Information



Existing feature location techniques
      use Camel Case splitting



                                   17
Information Retrieval FLT
• Generate corpus              synchronized void print(TestResult result,
• Preprocessing                long runTime) throws IOE
                               l       Ti ) h
                                 printHeader(runTime);
                                                        IOException{
                                                                i {

   –   Remove non-literals       printErrors(result);
   –   Remove stop words         p
                                 printFailures(result);
                                              (       );
                                 printFooter(result);
   –   Split identifiers
                               }
   –   Stemming
                               synchronized void print TestResult result
• I d i
  Indexing                     long runTime throws IOException
   – Term-by-document matrix   printHeader runTime printErrors result
                               printFailures result printFooter result
   – Singular Value
        g
     Decomposition
• User formulate query         print TestResult result runTime
                               IOException printHeader runTime
                                O cept o p t eade u               e
• G
  Generate results
         t     lt              printErrors result printFailures result
• Ranked list                  printFooter result                        18
Information Retrieval FLT
• Generate corpus              print Test Result result run Time IO
                               Exception print Header run Ti print
                               E      ti     i tH d          Time i t
• Preprocessing                Errors result print Failures result print
   –   Remove non-literals     Footer result
   –   Remove stop words
   –   Split identifiers
                               print test result result run time io
   –   Stemming                exception print head run time print error
• I d i
  Indexing                     result print fail result print foot result
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition                      print   test   result   ...
• User formulate query
                                   m1     5      1       3      ...
• G
  Generate results
         t     lt
                                   m2    ...     ...     ...    ...
• Ranked list                                                               19
Information Retrieval FLT
                                    print   test   result   ...
• Generate corpus
• Preprocessing                m1    5       1       3      ...

   –   Remove non-literals     m2    ...     ...     ...    ...
   –   Remove stop words
   –   Split identifiers
   –   Stemming
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition
• User formulate query
• G
  Generate results
         t     lt
• Ranked list                                                     20
IR and Dynamic Information FLT
• Generate corpus
• Preprocessing
   –   Remove non-literals
   –   Remove stop words
   –   Split identifiers
   –   Stemming
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value Decomposition
        g                p
                                    Collect execution
• User formulate query                    trace
• Generate results
• Ranked list of executed methods
                                                    21
Research Goal
               R      hG l
Evaluate how advanced splitting techniques impact
  the
  th performance of feature location techniques
         f        ff t      l    ti t h i




                                             22
Information Retrieval FLT
• Generate corpus
• Preprocessing
   –   Remove non-literals
                               Replace Camel Case with :
   –   Remove stop words         •Samurai
   –   Split identifiers         •“Perfect” Splitting
   –   Stemming                  algorithm (
                                   g       (Oracle) )
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition                Better

• User formulate query
• G
  Generate results
         t     lt
                                  Worst
                                  W




• Ranked list                                        23
Extract Identifiers




       All
   Identifiers




 Building
 B ildi
the Oracle
                      24
Extract Identifiers




                          Same
       All                 p
                          split?
   Identifiers         (CamelCase
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                     25
Extract Identifiers




                          Same
       All                 p
                          split?
   Identifiers         (CamelCase
                         Samurai     • Assume they are
                         TIDIER)
                                     correct
                              YES
                                     • Manually verified a
                      Concordant     sample
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                       26
                                     • Threat to validity
Manually
                                             Split
                                          Identifiers




Extract Identifiers                       Manual Split




                          Same            Discordant
                                     NO
       All                 p
                          split?             Split
   Identifiers         (CamelCase
                                          Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                         27
Manually
                                                    Split
                                                 Identifiers




                         Consensus
Extract Identifiers                              Manual Split
                       between authors

                                      Checked
                          Same
       All                 p
                          split?     source codeDiscordant
                                         NO
                                                    Split
   Identifiers         (CamelCase
                                                 Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                                28
Identifiers          Manually
• Examples: DT, i3,       that could             Split
                          not be split        Identifiers
P754, zzz, etc.

• Left unchanged
Extract Identifiers                           Manual Split




                          Same                Discordant
                                         NO
       All                 p
                          split?                 Split
   Identifiers         (CamelCase
                                              Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                             29
Design of the Case Study




                           30
Design of the Case Study
• RQ: Does a FLT with an advanced
  splitting algorithm produce better results
  than the same FLT using the CamelCase
  splitting algorithm?




                                           31
How to Compare two FLTs?




                           32
How to Compare two FLTs?
• Effectiveness measure for each feature
           IR
    Method       LSI
                score
    M121     0.92
    M64      0.89
    M15      0.86       Gold t
                        G ld set method
                                   th d
    M39      0.80
    M7       0.74
    M152     0.65
             0 65
    M234     0.56       Effectiveness = 5
    M12      0.54
    M78      0.52
             0 52


                                            33
How to Compare two FLTs?
• Effectiveness measure for each feature
           IR
    Method       LSI                                        y
                                                         IRDyn
                score
                                                        Method     LSI
    M121     0.92                                                 score
    M64      0.89
                                 Gold set method        M15      0.86
    M15      0.86                                       M7       0.74
    M39      0.80                                       M234     0.56
    M7       0.74                                       M12      0.54
    M152     0.65
             0 65
                                    Effectiveness = 2
                                       ect e ess
    M234     0.56
    M12      0.54
    M78      0.52
             0 52

                        Method                                            34
                        Executed method (from trace)
Which FLTs are we Comparing?




                           35
Software Systems
  •   Rhino 1 6R5
            1.6R5
  •   138 classes, 1,870 methods, 32K LOC
  •   Eaddy
      E dd et al.’s d *
                 l ’ data*
  •   2 datasets
 Dataset        Size           Queries                Gold Sets              Execution
                                                                            Information
RhinoFeatures    241    Sections of                   Eaddy et al.*   Full Execution Traces
                        ECMAScript                                    (from unit tests)
                        documentation
 Rhino
 Rhi Bugs        143    Bug title d
                        B titl and                    Eaddy t l *
                                                      E dd et al.*    N/A
                        description                     (CVS)
                                                                                      36
 * http://www.cs.columbia.edu/~eaddy/concerntagger/
Software Systems
  • jEdit 4 3
          4.3
  • 483 classes, 6.4K methods, 109K LOC
  •2d t t
      datasets
Dataset         Size         Queries           Gold Sets       Execution
                                                              Information
jEditFeatures   64     Feature (or Patch)        SVN       Marked Execution
                       title and description               Traces
 jEditBugs      86     Bug title and             SVN       Marked Execution
                       description                         Traces

                        Datasets available at:
      http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/       37
Generating the jEdit   SVN Commits between
     Datasets               v4.2-v4.3




                                       38
Generating the jEdit
     Datasets

                                     SVN commit
                                      message




                          Title

                           +
                       Description

                           =
                         Query
Generating the jEdit
     Datasets

                                                 Changed files




              Previous                 Current
               Version
               V i                     Version
                                       V i
                          Compare
                            using
                         Eclipse AST
                          c pse S



                     Modified methods
                        (gold set)        40
Presenting the Results




                         41
Presenting the Results
       Box plot of all effectiveness measure in
                        datasets
        (e.g., 241 datapoints for RhinoFeatures)


     Average




     Median


                                            42
IR FLTs



RhinoFeatures
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures                    RhinoBugs




jEditFeatures                    jEditBugs 45
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures                     RhinoBugs

                Datasets with
                features have
                 better results
                than datasets
                  with bugs


jEditFeatures                     jEditBugs 46
IRDyn FLTs
                                      N/A

                 Similar median
                 S
                  and average
RhinoFeatures                     RhinoBugs




jEditFeatures                     jEditBugs 47
IRDyn FLTs
                                       N/A

                 Similar median
                 S
                  and average
RhinoFeatures                      RhinoBugs

                 Datasets with
                   atasets t
                 features have
                  better results
                 than datasets
                   with bugs


jEditFeatures                      jEditBugs 48
Compare FLTs by Percentages
IROracle   IRCamelCase
             (Baseline)
  10            17
  20            15
  18            18
   5             9
   4            16
  19             7
  12            28
  14            15
                               49
Compare FLTs by Percentages
IROracle   IRCamelCase
                          5/8
             (Baseline)
  10            17
  20            15
  18            18
                                2/8
   5             9
   4            16
  19             7
  12            28
  14            15
                                 50
IR



RhinoFeatures            RhinoBugs




                                    51
   jEditFeatures        jEditBugs
IR


                   Datasets with features
RhinoFeatures               vs.              RhinoBugs
                    Datasets with bugs




                                                        52
   jEditFeatures                            jEditBugs
IRDyn
                                     N/A


                 Similar trend
RhinoFeatures                       RhinoBugs




                                             53
 jEditFeatures                   jEditBugs
Statistical Results
• Wilcoxon signed-rank test
           signed rank
• Null hypothesis
  – Th
    There is no statistical significance diff
            i      t ti ti l i ifi       difference
    in terms of effectiveness between
    IRSamurai/IROracle and IRCamelCase
                    l            l

• Alternative hypothesis
  – IRSamurai/IROracle h statistically significantly
                       has t ti ti ll i ifi      tl
    higher effectiveness than IRCamelCase
• alpha = 0 05
   l h    0.05
                                                       54
IR



RhinoFeatures       The only         RhinoBugs
                   statistical
                significant result
                    (p=0.05)




                                              55
jEditFeatures
                                     jEditB
Qualitative Results
• Vocabulary mismatch between queries
  and code:
  – Name of developers (e.g., Slava, Carlos)
  – Id ifi
    Identifiers specific to communication (
                    ifi           i i (e.g.,
    thanks, greetings, annoying)




                                               56
Qualitative Results
• Features are more “descriptive” than
  bugs




                                         57
Qualitative Results
• Features are more “descriptive” than
  bugs




Words “join” and
 “line” are not
  mentioned




                                         58
Threats to Validity
• External
  – 2 Java applications (different domains)
  – More systems needed
• Construct
  – Errors may be p
             y    present in Oracle and g
                                        gold sets
  – We used data produced by other researchers
• Internal
  – Subjectivity and bias in building the Oracle
• Conclusion
  – Non-parametric test: Wilcoxon signed-rank      59
Research Questions
• RQ1 Does IRSamurai outperform IRCamelCase in
  terms of effectiveness? NO

• RQ2 Does IRSSamuraiDyn outperform IRC
                    i                 CamelCaseDyn
                                          lC
  in terms of effectiveness? NO

• RQ3 Does IROracle outperform IRCamelCase in terms
  of effectiveness? I some cases (Rhi )
   f ff ti         ? In           (Rhino)

• RQ4 Does IROracleDyn outperform IRCamelCaseDyn
                                              60
  in terms of effectiveness? NO
Future Work
• More systems and datasets
• Different maintenance tasks
  – T
    Traceability li k recovery
          bilit link
• Consider other splitting algorithms




                                        61
Conclusions
• Advanced splitting technique could
  improve FLTs
  – We found some empirical evidence
• Splitting has more impact on IR FLT
• If execution information is available, it is
   f             f                l bl
  not necessary to use an advance splitting
  technique
      h i


                                             62
Thank you! Questions?
   SEMERU @ William and Mary
http://www.cs.wm.edu/semeru/
        bdit@cs.wm.edu
        bdi          d




        SEMERU


                               63
References
• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The
  Effects of Comments and Identifier Names on Program Comprehensibility:
  An Experimental Investigation", Journal of Programming Languages, vol. 4,
  no. 3, 1996, pp. 143-167
• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D
             al.         Lawrie, D., Morrell, C., Feild, H.,      Binkley D.,
  "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June
  14-16 2006, pp. 3-12
• Binkley et al. (
        y        (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,
                       )       y, ,        , ,           , ,              , ,
  "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009,
  pp. 158-167
• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K.,
  "Mining Source C d to A
  "Mi i S          Code    Automatically S li Id ifi
                                     i ll Split Identifiers f S f
                                                             for Software
  Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80
• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and
  Guéhéneuc, Y. G., TIDIER:
  Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech
  Recognition Techniques", JSME, vol. to appear, 2011
                                                                              64

ICPC11b.ppt

  • 1.
    Can Better IdentifierSplitting Techniques H l F T h i Help Feature LLocation? i ? Bogdan Dit, Latifa Guerrouj, D B d Di L if G j Denys P h Poshyvanyk, Gi li k Giuliano A Antoniol i l SEMERU 19th IEEE International Conference on Program Comprehension (ICPC’11) – Kingston, Ontario, Canada
  • 2.
  • 3.
    Textual information embeds domain k d i knowledge l d 3
  • 4.
    Textual information embeds domain k d i knowledge l d About 70% of source code consists of identifiers* * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software 4 Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 5.
    Textual information embeds domain k d i knowledge l d About 70% of source code consists of identifiers* Identifiers are important source of information for maintenance tasks: • traceability link recovery • feature location * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software 5 Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 6.
    # of CumulativeFeature Location Papers based on Textual Information 6
  • 7.
    Related Work onIdentifiers • Takang et al. (JPL 96) al (JPL’96) – programs with full-word identifiers are more understandable than those with abbreviated ones • Lawrie et al. (ICPC 06) al (ICPC’06) – full words and recognizable abbreviations lead to better comprehension • Binkley et al. (ICPC’09) – CamelCase style is easier to recognize than underscore 7
  • 8.
    Related Work onIdentifiers • Enslen et al. (MSR 09) al (MSR’09) – Samurai: algorithm for splitting identifiers (using tables of identifier frequencies) • Guerrouj et al. (JSME’11) – TIDIER: algorithm for splitting identifiers (using contextual information) • Other related work – Deissenboeck and Pizka (SQJ’06), Antoniol et al. (ICSM’07), al (ICSM’07) Haiduc and Marcus (ICPC’08) (ICPC’08), etc. 8
  • 9.
    Splitting Identifiers Correctlyis Challenging h ll 9
  • 10.
    Identifier Splitting Algorithms OriginalIdentifier userId setGID print_file2device print file2device SSLCertificate MINstring USERID currentsize readadapterobject p j tolocale imitating DEFMASKBit 10
  • 11.
    Identifier Splitting Algorithms OriginalIdentifier Camel Case userId user Id setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 11
  • 12.
    Identifier Splitting Algorithms Handles Original Identifier Camel Case underscore and userId user Id digits setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject Fails p j at mixed cases tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit Fails at same case identifiers 12
  • 13.
    Identifier Splitting Algorithms OriginalIdentifier Camel Case userId user Id setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 13
  • 14.
    Identifier Splitting Algorithms OriginalIdentifier Camel Case Samurai userId user Id user Id setGID set GID set GID print_file2device print file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject p j readadapterobject p j read adapter object p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit 14
  • 15.
    Identifier Splitting Algorithms OriginalIdentifier Camel Case Samurai userId user Id Splits some cases user Id setGID set GID set GID where CamelCase print_file2device print file2device print file 2 device print file 2 device SSLCertificate cannot SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject p j readadapterobject p j read adapter object p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit Oversplits 15
  • 16.
    # of CumulativeFeature Location Papers based on Textual Information 16
  • 17.
    # of CumulativeFeature Location Papers based on Textual Information Existing feature location techniques use Camel Case splitting 17
  • 18.
    Information Retrieval FLT •Generate corpus synchronized void print(TestResult result, • Preprocessing long runTime) throws IOE l Ti ) h printHeader(runTime); IOException{ i { – Remove non-literals printErrors(result); – Remove stop words p printFailures(result); ( ); printFooter(result); – Split identifiers } – Stemming synchronized void print TestResult result • I d i Indexing long runTime throws IOException – Term-by-document matrix printHeader runTime printErrors result printFailures result printFooter result – Singular Value g Decomposition • User formulate query print TestResult result runTime IOException printHeader runTime O cept o p t eade u e • G Generate results t lt printErrors result printFailures result • Ranked list printFooter result 18
  • 19.
    Information Retrieval FLT •Generate corpus print Test Result result run Time IO Exception print Header run Ti print E ti i tH d Time i t • Preprocessing Errors result print Failures result print – Remove non-literals Footer result – Remove stop words – Split identifiers print test result result run time io – Stemming exception print head run time print error • I d i Indexing result print fail result print foot result – Term-by-document matrix – Singular Value g Decomposition print test result ... • User formulate query m1 5 1 3 ... • G Generate results t lt m2 ... ... ... ... • Ranked list 19
  • 20.
    Information Retrieval FLT print test result ... • Generate corpus • Preprocessing m1 5 1 3 ... – Remove non-literals m2 ... ... ... ... – Remove stop words – Split identifiers – Stemming • I d i Indexing – Term-by-document matrix – Singular Value g Decomposition • User formulate query • G Generate results t lt • Ranked list 20
  • 21.
    IR and DynamicInformation FLT • Generate corpus • Preprocessing – Remove non-literals – Remove stop words – Split identifiers – Stemming • I d i Indexing – Term-by-document matrix – Singular Value Decomposition g p Collect execution • User formulate query trace • Generate results • Ranked list of executed methods 21
  • 22.
    Research Goal R hG l Evaluate how advanced splitting techniques impact the th performance of feature location techniques f ff t l ti t h i 22
  • 23.
    Information Retrieval FLT •Generate corpus • Preprocessing – Remove non-literals Replace Camel Case with : – Remove stop words •Samurai – Split identifiers •“Perfect” Splitting – Stemming algorithm ( g (Oracle) ) • I d i Indexing – Term-by-document matrix – Singular Value g Decomposition Better • User formulate query • G Generate results t lt Worst W • Ranked list 23
  • 24.
    Extract Identifiers All Identifiers Building B ildi the Oracle 24
  • 25.
    Extract Identifiers Same All p split? Identifiers (CamelCase Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 25
  • 26.
    Extract Identifiers Same All p split? Identifiers (CamelCase Samurai • Assume they are TIDIER) correct YES • Manually verified a Concordant sample Building B ildi Split Identifiers the Oracle 26 • Threat to validity
  • 27.
    Manually Split Identifiers Extract Identifiers Manual Split Same Discordant NO All p split? Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 27
  • 28.
    Manually Split Identifiers Consensus Extract Identifiers Manual Split between authors Checked Same All p split? source codeDiscordant NO Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 28
  • 29.
    Identifiers Manually • Examples: DT, i3, that could Split not be split Identifiers P754, zzz, etc. • Left unchanged Extract Identifiers Manual Split Same Discordant NO All p split? Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 29
  • 30.
    Design of theCase Study 30
  • 31.
    Design of theCase Study • RQ: Does a FLT with an advanced splitting algorithm produce better results than the same FLT using the CamelCase splitting algorithm? 31
  • 32.
    How to Comparetwo FLTs? 32
  • 33.
    How to Comparetwo FLTs? • Effectiveness measure for each feature IR Method LSI score M121 0.92 M64 0.89 M15 0.86 Gold t G ld set method th d M39 0.80 M7 0.74 M152 0.65 0 65 M234 0.56 Effectiveness = 5 M12 0.54 M78 0.52 0 52 33
  • 34.
    How to Comparetwo FLTs? • Effectiveness measure for each feature IR Method LSI y IRDyn score Method LSI M121 0.92 score M64 0.89 Gold set method M15 0.86 M15 0.86 M7 0.74 M39 0.80 M234 0.56 M7 0.74 M12 0.54 M152 0.65 0 65 Effectiveness = 2 ect e ess M234 0.56 M12 0.54 M78 0.52 0 52 Method 34 Executed method (from trace)
  • 35.
    Which FLTs arewe Comparing? 35
  • 36.
    Software Systems • Rhino 1 6R5 1.6R5 • 138 classes, 1,870 methods, 32K LOC • Eaddy E dd et al.’s d * l ’ data* • 2 datasets Dataset Size Queries Gold Sets Execution Information RhinoFeatures 241 Sections of Eaddy et al.* Full Execution Traces ECMAScript (from unit tests) documentation Rhino Rhi Bugs 143 Bug title d B titl and Eaddy t l * E dd et al.* N/A description (CVS) 36 * http://www.cs.columbia.edu/~eaddy/concerntagger/
  • 37.
    Software Systems • jEdit 4 3 4.3 • 483 classes, 6.4K methods, 109K LOC •2d t t datasets Dataset Size Queries Gold Sets Execution Information jEditFeatures 64 Feature (or Patch) SVN Marked Execution title and description Traces jEditBugs 86 Bug title and SVN Marked Execution description Traces Datasets available at: http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/ 37
  • 38.
    Generating the jEdit SVN Commits between Datasets v4.2-v4.3 38
  • 39.
    Generating the jEdit Datasets SVN commit message Title + Description = Query
  • 40.
    Generating the jEdit Datasets Changed files Previous Current Version V i Version V i Compare using Eclipse AST c pse S Modified methods (gold set) 40
  • 41.
  • 42.
    Presenting the Results Box plot of all effectiveness measure in datasets (e.g., 241 datapoints for RhinoFeatures) Average Median 42
  • 43.
  • 44.
    IR FLTs Similar median S and average RhinoFeatures
  • 45.
    IR FLTs Similar median S and average RhinoFeatures RhinoBugs jEditFeatures jEditBugs 45
  • 46.
    IR FLTs Similar median S and average RhinoFeatures RhinoBugs Datasets with features have better results than datasets with bugs jEditFeatures jEditBugs 46
  • 47.
    IRDyn FLTs N/A Similar median S and average RhinoFeatures RhinoBugs jEditFeatures jEditBugs 47
  • 48.
    IRDyn FLTs N/A Similar median S and average RhinoFeatures RhinoBugs Datasets with atasets t features have better results than datasets with bugs jEditFeatures jEditBugs 48
  • 49.
    Compare FLTs byPercentages IROracle IRCamelCase (Baseline) 10 17 20 15 18 18 5 9 4 16 19 7 12 28 14 15 49
  • 50.
    Compare FLTs byPercentages IROracle IRCamelCase 5/8 (Baseline) 10 17 20 15 18 18 2/8 5 9 4 16 19 7 12 28 14 15 50
  • 51.
    IR RhinoFeatures RhinoBugs 51 jEditFeatures jEditBugs
  • 52.
    IR Datasets with features RhinoFeatures vs. RhinoBugs Datasets with bugs 52 jEditFeatures jEditBugs
  • 53.
    IRDyn N/A Similar trend RhinoFeatures RhinoBugs 53 jEditFeatures jEditBugs
  • 54.
    Statistical Results • Wilcoxonsigned-rank test signed rank • Null hypothesis – Th There is no statistical significance diff i t ti ti l i ifi difference in terms of effectiveness between IRSamurai/IROracle and IRCamelCase l l • Alternative hypothesis – IRSamurai/IROracle h statistically significantly has t ti ti ll i ifi tl higher effectiveness than IRCamelCase • alpha = 0 05 l h 0.05 54
  • 55.
    IR RhinoFeatures The only RhinoBugs statistical significant result (p=0.05) 55 jEditFeatures jEditB
  • 56.
    Qualitative Results • Vocabularymismatch between queries and code: – Name of developers (e.g., Slava, Carlos) – Id ifi Identifiers specific to communication ( ifi i i (e.g., thanks, greetings, annoying) 56
  • 57.
    Qualitative Results • Featuresare more “descriptive” than bugs 57
  • 58.
    Qualitative Results • Featuresare more “descriptive” than bugs Words “join” and “line” are not mentioned 58
  • 59.
    Threats to Validity •External – 2 Java applications (different domains) – More systems needed • Construct – Errors may be p y present in Oracle and g gold sets – We used data produced by other researchers • Internal – Subjectivity and bias in building the Oracle • Conclusion – Non-parametric test: Wilcoxon signed-rank 59
  • 60.
    Research Questions • RQ1Does IRSamurai outperform IRCamelCase in terms of effectiveness? NO • RQ2 Does IRSSamuraiDyn outperform IRC i CamelCaseDyn lC in terms of effectiveness? NO • RQ3 Does IROracle outperform IRCamelCase in terms of effectiveness? I some cases (Rhi ) f ff ti ? In (Rhino) • RQ4 Does IROracleDyn outperform IRCamelCaseDyn 60 in terms of effectiveness? NO
  • 61.
    Future Work • Moresystems and datasets • Different maintenance tasks – T Traceability li k recovery bilit link • Consider other splitting algorithms 61
  • 62.
    Conclusions • Advanced splittingtechnique could improve FLTs – We found some empirical evidence • Splitting has more impact on IR FLT • If execution information is available, it is f f l bl not necessary to use an advance splitting technique h i 62
  • 63.
    Thank you! Questions? SEMERU @ William and Mary http://www.cs.wm.edu/semeru/ bdit@cs.wm.edu bdi d SEMERU 63
  • 64.
    References • Takang etal. (1996) Takang, A., Grubb, P., and Macredie, R., "The Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Investigation", Journal of Programming Languages, vol. 4, no. 3, 1996, pp. 143-167 • Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D al. Lawrie, D., Morrell, C., Feild, H., Binkley D., "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June 14-16 2006, pp. 3-12 • Binkley et al. ( y (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C., ) y, , , , , , , , "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009, pp. 158-167 • Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K., "Mining Source C d to A "Mi i S Code Automatically S li Id ifi i ll Split Identifiers f S f for Software Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80 • Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and Guéhéneuc, Y. G., TIDIER: Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques", JSME, vol. to appear, 2011 64