SlideShare a Scribd company logo
1 of 64
Download to read offline
Can Better Identifier Splitting
Techniques H l F
T h i      Help Feature LLocation?
                                i ?
Bogdan Dit, Latifa Guerrouj, D
B d Di L if G             j Denys P h
                                  Poshyvanyk, Gi li
                                           k Giuliano A
                                                      Antoniol
                                                           i l




     SEMERU

19th IEEE International Conference on Program Comprehension
             (ICPC’11) – Kingston, Ontario, Canada
2
Textual information embeds
    domain k
    d      i knowledge
                  l d




                             3
Textual information embeds
                                      domain k
                                      d      i knowledge
                                                    l d




                                   About 70% of source code
                                     consists of identifiers*




* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
                                                                        4
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
Textual information embeds
                                      domain k
                                      d      i knowledge
                                                    l d




                                   About 70% of source code
                                     consists of identifiers*


                              Identifiers are important source of
                             information for maintenance tasks:
                                • traceability link recovery
                                • feature location
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
                                                                        5
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
# of Cumulative Feature Location
Papers based on Textual Information




                                 6
Related Work on Identifiers
• Takang et al. (JPL 96)
            al (JPL’96)
  – programs with full-word identifiers are more
    understandable than those with abbreviated
    ones
• Lawrie et al. (ICPC 06)
            al (ICPC’06)
  – full words and recognizable abbreviations
    lead to better comprehension
• Binkley et al. (ICPC’09)
  – CamelCase style is easier to recognize than
    underscore                                     7
Related Work on Identifiers
• Enslen et al. (MSR 09)
            al (MSR’09)
  – Samurai: algorithm for splitting identifiers
    (using tables of identifier frequencies)
• Guerrouj et al. (JSME’11)
  – TIDIER: algorithm for splitting identifiers
    (using contextual information)
• Other related work
  – Deissenboeck and Pizka (SQJ’06), Antoniol et
    al. (ICSM’07),
    al (ICSM’07) Haiduc and Marcus (ICPC’08)
                                    (ICPC’08),
    etc.                                        8
Splitting Identifiers Correctly is
           Challenging
            h ll




                                 9
Identifier Splitting Algorithms
Original Identifier
userId
setGID
print_file2device
print file2device
SSLCertificate
MINstring
USERID
currentsize
readadapterobject
       p     j
tolocale
imitating
DEFMASKBit

                                       10
Identifier Splitting Algorithms
Original Identifier   Camel Case
userId                user Id
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject
                             p     j
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit

                                            11
Identifier Splitting Algorithms
                                                 Handles
Original Identifier   Camel Case              underscore and
userId                user Id
                                                  digits
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject Fails
                             p     j            at mixed cases
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit
                                            Fails at same case
                                                 identifiers 12
Identifier Splitting Algorithms
Original Identifier   Camel Case
userId                user Id
setGID                set GID
print_file2device
print file2device     print file 2 device
SSLCertificate        SSL Certificate
MINstring             MI Nstring
USERID                USERID
currentsize           currentsize
readadapterobject
       p     j        readadapterobject
                             p     j
tolocale              tolocale
imitating             imitating
DEFMASKBit            DEFMASK Bit

                                            13
Identifier Splitting Algorithms
Original Identifier   Camel Case            Samurai
userId                user Id               user Id
setGID                set GID               set GID
print_file2device
print file2device     print file 2 device   print file 2 device
SSLCertificate        SSL Certificate       SSL Certificate
MINstring             MI Nstring            MIN string
USERID                USERID                USER ID
currentsize           currentsize           current size
readadapterobject
       p     j        readadapterobject
                             p     j        read adapter object
                                                    p      j
tolocale              tolocale              tol ocal e
imitating             imitating             imi ta ting
DEFMASKBit            DEFMASK Bit           DEF MASK Bit

                                                              14
Identifier Splitting Algorithms
Original Identifier       Camel Case               Samurai
userId                   user Id
                      Splits some cases            user Id
setGID                   set GID                   set GID
                      where CamelCase
print_file2device
print file2device        print file 2 device       print file 2 device
SSLCertificate
                             cannot
                         SSL Certificate           SSL Certificate
MINstring                 MI Nstring               MIN string
USERID                    USERID                   USER ID
currentsize               currentsize              current size
readadapterobject
       p     j            readadapterobject
                                 p     j           read adapter object
                                                           p      j
tolocale                  tolocale                 tol ocal e
imitating                 imitating                imi ta ting
DEFMASKBit                DEFMASK Bit              DEF MASK Bit

                                      Oversplits                     15
# of Cumulative Feature Location
Papers based on Textual Information




                                 16
# of Cumulative Feature Location
Papers based on Textual Information



Existing feature location techniques
      use Camel Case splitting



                                   17
Information Retrieval FLT
• Generate corpus              synchronized void print(TestResult result,
• Preprocessing                long runTime) throws IOE
                               l       Ti ) h
                                 printHeader(runTime);
                                                        IOException{
                                                                i {

   –   Remove non-literals       printErrors(result);
   –   Remove stop words         p
                                 printFailures(result);
                                              (       );
                                 printFooter(result);
   –   Split identifiers
                               }
   –   Stemming
                               synchronized void print TestResult result
• I d i
  Indexing                     long runTime throws IOException
   – Term-by-document matrix   printHeader runTime printErrors result
                               printFailures result printFooter result
   – Singular Value
        g
     Decomposition
• User formulate query         print TestResult result runTime
                               IOException printHeader runTime
                                O cept o p t eade u               e
• G
  Generate results
         t     lt              printErrors result printFailures result
• Ranked list                  printFooter result                        18
Information Retrieval FLT
• Generate corpus              print Test Result result run Time IO
                               Exception print Header run Ti print
                               E      ti     i tH d          Time i t
• Preprocessing                Errors result print Failures result print
   –   Remove non-literals     Footer result
   –   Remove stop words
   –   Split identifiers
                               print test result result run time io
   –   Stemming                exception print head run time print error
• I d i
  Indexing                     result print fail result print foot result
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition                      print   test   result   ...
• User formulate query
                                   m1     5      1       3      ...
• G
  Generate results
         t     lt
                                   m2    ...     ...     ...    ...
• Ranked list                                                               19
Information Retrieval FLT
                                    print   test   result   ...
• Generate corpus
• Preprocessing                m1    5       1       3      ...

   –   Remove non-literals     m2    ...     ...     ...    ...
   –   Remove stop words
   –   Split identifiers
   –   Stemming
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition
• User formulate query
• G
  Generate results
         t     lt
• Ranked list                                                     20
IR and Dynamic Information FLT
• Generate corpus
• Preprocessing
   –   Remove non-literals
   –   Remove stop words
   –   Split identifiers
   –   Stemming
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value Decomposition
        g                p
                                    Collect execution
• User formulate query                    trace
• Generate results
• Ranked list of executed methods
                                                    21
Research Goal
               R      hG l
Evaluate how advanced splitting techniques impact
  the
  th performance of feature location techniques
         f        ff t      l    ti t h i




                                             22
Information Retrieval FLT
• Generate corpus
• Preprocessing
   –   Remove non-literals
                               Replace Camel Case with :
   –   Remove stop words         •Samurai
   –   Split identifiers         •“Perfect” Splitting
   –   Stemming                  algorithm (
                                   g       (Oracle) )
• I d i
  Indexing
   – Term-by-document matrix
   – Singular Value
        g
     Decomposition                Better

• User formulate query
• G
  Generate results
         t     lt
                                  Worst
                                  W




• Ranked list                                        23
Extract Identifiers




       All
   Identifiers




 Building
 B ildi
the Oracle
                      24
Extract Identifiers




                          Same
       All                 p
                          split?
   Identifiers         (CamelCase
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                     25
Extract Identifiers




                          Same
       All                 p
                          split?
   Identifiers         (CamelCase
                         Samurai     • Assume they are
                         TIDIER)
                                     correct
                              YES
                                     • Manually verified a
                      Concordant     sample
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                       26
                                     • Threat to validity
Manually
                                             Split
                                          Identifiers




Extract Identifiers                       Manual Split




                          Same            Discordant
                                     NO
       All                 p
                          split?             Split
   Identifiers         (CamelCase
                                          Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                         27
Manually
                                                    Split
                                                 Identifiers




                         Consensus
Extract Identifiers                              Manual Split
                       between authors

                                      Checked
                          Same
       All                 p
                          split?     source codeDiscordant
                                         NO
                                                    Split
   Identifiers         (CamelCase
                                                 Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                                28
Identifiers          Manually
• Examples: DT, i3,       that could             Split
                          not be split        Identifiers
P754, zzz, etc.

• Left unchanged
Extract Identifiers                           Manual Split




                          Same                Discordant
                                         NO
       All                 p
                          split?                 Split
   Identifiers         (CamelCase
                                              Identifiers
                         Samurai
                         TIDIER)


                              YES

                      Concordant
 Building
 B ildi                   Split
                       Identifiers
the Oracle
                                                             29
Design of the Case Study




                           30
Design of the Case Study
• RQ: Does a FLT with an advanced
  splitting algorithm produce better results
  than the same FLT using the CamelCase
  splitting algorithm?




                                           31
How to Compare two FLTs?




                           32
How to Compare two FLTs?
• Effectiveness measure for each feature
           IR
    Method       LSI
                score
    M121     0.92
    M64      0.89
    M15      0.86       Gold t
                        G ld set method
                                   th d
    M39      0.80
    M7       0.74
    M152     0.65
             0 65
    M234     0.56       Effectiveness = 5
    M12      0.54
    M78      0.52
             0 52


                                            33
How to Compare two FLTs?
• Effectiveness measure for each feature
           IR
    Method       LSI                                        y
                                                         IRDyn
                score
                                                        Method     LSI
    M121     0.92                                                 score
    M64      0.89
                                 Gold set method        M15      0.86
    M15      0.86                                       M7       0.74
    M39      0.80                                       M234     0.56
    M7       0.74                                       M12      0.54
    M152     0.65
             0 65
                                    Effectiveness = 2
                                       ect e ess
    M234     0.56
    M12      0.54
    M78      0.52
             0 52

                        Method                                            34
                        Executed method (from trace)
Which FLTs are we Comparing?




                           35
Software Systems
  •   Rhino 1 6R5
            1.6R5
  •   138 classes, 1,870 methods, 32K LOC
  •   Eaddy
      E dd et al.’s d *
                 l ’ data*
  •   2 datasets
 Dataset        Size           Queries                Gold Sets              Execution
                                                                            Information
RhinoFeatures    241    Sections of                   Eaddy et al.*   Full Execution Traces
                        ECMAScript                                    (from unit tests)
                        documentation
 Rhino
 Rhi Bugs        143    Bug title d
                        B titl and                    Eaddy t l *
                                                      E dd et al.*    N/A
                        description                     (CVS)
                                                                                      36
 * http://www.cs.columbia.edu/~eaddy/concerntagger/
Software Systems
  • jEdit 4 3
          4.3
  • 483 classes, 6.4K methods, 109K LOC
  •2d t t
      datasets
Dataset         Size         Queries           Gold Sets       Execution
                                                              Information
jEditFeatures   64     Feature (or Patch)        SVN       Marked Execution
                       title and description               Traces
 jEditBugs      86     Bug title and             SVN       Marked Execution
                       description                         Traces

                        Datasets available at:
      http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/       37
Generating the jEdit   SVN Commits between
     Datasets               v4.2-v4.3




                                       38
Generating the jEdit
     Datasets

                                     SVN commit
                                      message




                          Title

                           +
                       Description

                           =
                         Query
Generating the jEdit
     Datasets

                                                 Changed files




              Previous                 Current
               Version
               V i                     Version
                                       V i
                          Compare
                            using
                         Eclipse AST
                          c pse S



                     Modified methods
                        (gold set)        40
Presenting the Results




                         41
Presenting the Results
       Box plot of all effectiveness measure in
                        datasets
        (e.g., 241 datapoints for RhinoFeatures)


     Average




     Median


                                            42
IR FLTs



RhinoFeatures
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures                    RhinoBugs




jEditFeatures                    jEditBugs 45
IR FLTs

                Similar median
                S
                 and average
RhinoFeatures                     RhinoBugs

                Datasets with
                features have
                 better results
                than datasets
                  with bugs


jEditFeatures                     jEditBugs 46
IRDyn FLTs
                                      N/A

                 Similar median
                 S
                  and average
RhinoFeatures                     RhinoBugs




jEditFeatures                     jEditBugs 47
IRDyn FLTs
                                       N/A

                 Similar median
                 S
                  and average
RhinoFeatures                      RhinoBugs

                 Datasets with
                   atasets t
                 features have
                  better results
                 than datasets
                   with bugs


jEditFeatures                      jEditBugs 48
Compare FLTs by Percentages
IROracle   IRCamelCase
             (Baseline)
  10            17
  20            15
  18            18
   5             9
   4            16
  19             7
  12            28
  14            15
                               49
Compare FLTs by Percentages
IROracle   IRCamelCase
                          5/8
             (Baseline)
  10            17
  20            15
  18            18
                                2/8
   5             9
   4            16
  19             7
  12            28
  14            15
                                 50
IR



RhinoFeatures            RhinoBugs




                                    51
   jEditFeatures        jEditBugs
IR


                   Datasets with features
RhinoFeatures               vs.              RhinoBugs
                    Datasets with bugs




                                                        52
   jEditFeatures                            jEditBugs
IRDyn
                                     N/A


                 Similar trend
RhinoFeatures                       RhinoBugs




                                             53
 jEditFeatures                   jEditBugs
Statistical Results
• Wilcoxon signed-rank test
           signed rank
• Null hypothesis
  – Th
    There is no statistical significance diff
            i      t ti ti l i ifi       difference
    in terms of effectiveness between
    IRSamurai/IROracle and IRCamelCase
                    l            l

• Alternative hypothesis
  – IRSamurai/IROracle h statistically significantly
                       has t ti ti ll i ifi      tl
    higher effectiveness than IRCamelCase
• alpha = 0 05
   l h    0.05
                                                       54
IR



RhinoFeatures       The only         RhinoBugs
                   statistical
                significant result
                    (p=0.05)




                                              55
jEditFeatures
                                     jEditB
Qualitative Results
• Vocabulary mismatch between queries
  and code:
  – Name of developers (e.g., Slava, Carlos)
  – Id ifi
    Identifiers specific to communication (
                    ifi           i i (e.g.,
    thanks, greetings, annoying)




                                               56
Qualitative Results
• Features are more “descriptive” than
  bugs




                                         57
Qualitative Results
• Features are more “descriptive” than
  bugs




Words “join” and
 “line” are not
  mentioned




                                         58
Threats to Validity
• External
  – 2 Java applications (different domains)
  – More systems needed
• Construct
  – Errors may be p
             y    present in Oracle and g
                                        gold sets
  – We used data produced by other researchers
• Internal
  – Subjectivity and bias in building the Oracle
• Conclusion
  – Non-parametric test: Wilcoxon signed-rank      59
Research Questions
• RQ1 Does IRSamurai outperform IRCamelCase in
  terms of effectiveness? NO

• RQ2 Does IRSSamuraiDyn outperform IRC
                    i                 CamelCaseDyn
                                          lC
  in terms of effectiveness? NO

• RQ3 Does IROracle outperform IRCamelCase in terms
  of effectiveness? I some cases (Rhi )
   f ff ti         ? In           (Rhino)

• RQ4 Does IROracleDyn outperform IRCamelCaseDyn
                                              60
  in terms of effectiveness? NO
Future Work
• More systems and datasets
• Different maintenance tasks
  – T
    Traceability li k recovery
          bilit link
• Consider other splitting algorithms




                                        61
Conclusions
• Advanced splitting technique could
  improve FLTs
  – We found some empirical evidence
• Splitting has more impact on IR FLT
• If execution information is available, it is
   f             f                l bl
  not necessary to use an advance splitting
  technique
      h i


                                             62
Thank you! Questions?
   SEMERU @ William and Mary
http://www.cs.wm.edu/semeru/
        bdit@cs.wm.edu
        bdi          d




        SEMERU


                               63
References
• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The
  Effects of Comments and Identifier Names on Program Comprehensibility:
  An Experimental Investigation", Journal of Programming Languages, vol. 4,
  no. 3, 1996, pp. 143-167
• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D
             al.         Lawrie, D., Morrell, C., Feild, H.,      Binkley D.,
  "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June
  14-16 2006, pp. 3-12
• Binkley et al. (
        y        (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,
                       )       y, ,        , ,           , ,              , ,
  "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009,
  pp. 158-167
• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K.,
  "Mining Source C d to A
  "Mi i S          Code    Automatically S li Id ifi
                                     i ll Split Identifiers f S f
                                                             for Software
  Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80
• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and
  Guéhéneuc, Y. G., TIDIER:
  Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech
  Recognition Techniques", JSME, vol. to appear, 2011
                                                                              64

More Related Content

Similar to ICPC11c.ppt

CPK Cryptosystem In Solaris
CPK Cryptosystem In SolarisCPK Cryptosystem In Solaris
CPK Cryptosystem In SolarisZhi Guan
 
Bypass Security Checking with Frida
Bypass Security Checking with FridaBypass Security Checking with Frida
Bypass Security Checking with FridaSatria Ady Pradana
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Emoji Encryption Using AES Algorithm
Emoji Encryption Using AES AlgorithmEmoji Encryption Using AES Algorithm
Emoji Encryption Using AES Algorithmijtsrd
 
ISSA: Next Generation Tokenization for Compliance and Cloud Data Protection
ISSA: Next Generation Tokenization for Compliance and Cloud Data ProtectionISSA: Next Generation Tokenization for Compliance and Cloud Data Protection
ISSA: Next Generation Tokenization for Compliance and Cloud Data ProtectionUlf Mattsson
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
BLE as Active RFID
BLE as Active RFIDBLE as Active RFID
BLE as Active RFIDreelyActive
 
iPhonical and model-driven software development for the iPhone
iPhonical and model-driven software development for the iPhoneiPhonical and model-driven software development for the iPhone
iPhonical and model-driven software development for the iPhoneHeiko Behrens
 
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...Hafez Kamal
 
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....Alessandro Cinelli (cirpo)
 
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....Michele Orselli
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
The Belgian E Id Hacker Vs Developer
The Belgian E Id Hacker Vs DeveloperThe Belgian E Id Hacker Vs Developer
The Belgian E Id Hacker Vs Developerbeires
 
Bypass Security Checking with Frida
Bypass Security Checking with FridaBypass Security Checking with Frida
Bypass Security Checking with FridaSatria Ady Pradana
 
BCS_PKI_part1.ppt
BCS_PKI_part1.pptBCS_PKI_part1.ppt
BCS_PKI_part1.pptUskuMusku1
 

Similar to ICPC11c.ppt (20)

Icpc11b.ppt
Icpc11b.pptIcpc11b.ppt
Icpc11b.ppt
 
CPK Cryptosystem In Solaris
CPK Cryptosystem In SolarisCPK Cryptosystem In Solaris
CPK Cryptosystem In Solaris
 
Bypass Security Checking with Frida
Bypass Security Checking with FridaBypass Security Checking with Frida
Bypass Security Checking with Frida
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Emoji Encryption Using AES Algorithm
Emoji Encryption Using AES AlgorithmEmoji Encryption Using AES Algorithm
Emoji Encryption Using AES Algorithm
 
ISSA: Next Generation Tokenization for Compliance and Cloud Data Protection
ISSA: Next Generation Tokenization for Compliance and Cloud Data ProtectionISSA: Next Generation Tokenization for Compliance and Cloud Data Protection
ISSA: Next Generation Tokenization for Compliance and Cloud Data Protection
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
BLE as Active RFID
BLE as Active RFIDBLE as Active RFID
BLE as Active RFID
 
iPhonical and model-driven software development for the iPhone
iPhonical and model-driven software development for the iPhoneiPhonical and model-driven software development for the iPhone
iPhonical and model-driven software development for the iPhone
 
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...
NanoSec Conference 2019: Code Execution Analysis in Mobile Apps - Abdullah Jo...
 
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
 
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
The Belgian E Id Hacker Vs Developer
The Belgian E Id Hacker Vs DeveloperThe Belgian E Id Hacker Vs Developer
The Belgian E Id Hacker Vs Developer
 
Perceptual Computing
Perceptual ComputingPerceptual Computing
Perceptual Computing
 
Bypass Security Checking with Frida
Bypass Security Checking with FridaBypass Security Checking with Frida
Bypass Security Checking with Frida
 
internet
internetinternet
internet
 
BCS_PKI_part1.ppt
BCS_PKI_part1.pptBCS_PKI_part1.ppt
BCS_PKI_part1.ppt
 
Chado-XML
Chado-XMLChado-XML
Chado-XML
 
Rights Technologies for E-Publishing
Rights Technologies for E-PublishingRights Technologies for E-Publishing
Rights Technologies for E-Publishing
 

More from Ptidej Team

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software MiniaturisationPtidej Team
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel BriandPtidej Team
 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel AbdellatifPtidej Team
 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh KermansaraviPtidej Team
 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel GrichiPtidej Team
 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano PolitowskiPtidej Team
 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisisPtidej Team
 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptPtidej Team
 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptPtidej Team
 

More from Ptidej Team (20)

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software Miniaturisation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel Abdellatif
 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh Kermansaravi
 
Mouna Abidi
Mouna AbidiMouna Abidi
Mouna Abidi
 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel Grichi
 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano Politowski
 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisis
 
MIPA
MIPAMIPA
MIPA
 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.ppt
 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.ppt
 
Medicine15.ppt
Medicine15.pptMedicine15.ppt
Medicine15.ppt
 
Qrs17b.ppt
Qrs17b.pptQrs17b.ppt
Qrs17b.ppt
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Icsoc15.ppt
Icsoc15.pptIcsoc15.ppt
Icsoc15.ppt
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

ICPC11c.ppt

  • 1. Can Better Identifier Splitting Techniques H l F T h i Help Feature LLocation? i ? Bogdan Dit, Latifa Guerrouj, D B d Di L if G j Denys P h Poshyvanyk, Gi li k Giuliano A Antoniol i l SEMERU 19th IEEE International Conference on Program Comprehension (ICPC’11) – Kingston, Ontario, Canada
  • 2. 2
  • 3. Textual information embeds domain k d i knowledge l d 3
  • 4. Textual information embeds domain k d i knowledge l d About 70% of source code consists of identifiers* * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software 4 Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 5. Textual information embeds domain k d i knowledge l d About 70% of source code consists of identifiers* Identifiers are important source of information for maintenance tasks: • traceability link recovery • feature location * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software 5 Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 6. # of Cumulative Feature Location Papers based on Textual Information 6
  • 7. Related Work on Identifiers • Takang et al. (JPL 96) al (JPL’96) – programs with full-word identifiers are more understandable than those with abbreviated ones • Lawrie et al. (ICPC 06) al (ICPC’06) – full words and recognizable abbreviations lead to better comprehension • Binkley et al. (ICPC’09) – CamelCase style is easier to recognize than underscore 7
  • 8. Related Work on Identifiers • Enslen et al. (MSR 09) al (MSR’09) – Samurai: algorithm for splitting identifiers (using tables of identifier frequencies) • Guerrouj et al. (JSME’11) – TIDIER: algorithm for splitting identifiers (using contextual information) • Other related work – Deissenboeck and Pizka (SQJ’06), Antoniol et al. (ICSM’07), al (ICSM’07) Haiduc and Marcus (ICPC’08) (ICPC’08), etc. 8
  • 9. Splitting Identifiers Correctly is Challenging h ll 9
  • 10. Identifier Splitting Algorithms Original Identifier userId setGID print_file2device print file2device SSLCertificate MINstring USERID currentsize readadapterobject p j tolocale imitating DEFMASKBit 10
  • 11. Identifier Splitting Algorithms Original Identifier Camel Case userId user Id setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 11
  • 12. Identifier Splitting Algorithms Handles Original Identifier Camel Case underscore and userId user Id digits setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject Fails p j at mixed cases tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit Fails at same case identifiers 12
  • 13. Identifier Splitting Algorithms Original Identifier Camel Case userId user Id setGID set GID print_file2device print file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject p j readadapterobject p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 13
  • 14. Identifier Splitting Algorithms Original Identifier Camel Case Samurai userId user Id user Id setGID set GID set GID print_file2device print file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject p j readadapterobject p j read adapter object p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit 14
  • 15. Identifier Splitting Algorithms Original Identifier Camel Case Samurai userId user Id Splits some cases user Id setGID set GID set GID where CamelCase print_file2device print file2device print file 2 device print file 2 device SSLCertificate cannot SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject p j readadapterobject p j read adapter object p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit Oversplits 15
  • 16. # of Cumulative Feature Location Papers based on Textual Information 16
  • 17. # of Cumulative Feature Location Papers based on Textual Information Existing feature location techniques use Camel Case splitting 17
  • 18. Information Retrieval FLT • Generate corpus synchronized void print(TestResult result, • Preprocessing long runTime) throws IOE l Ti ) h printHeader(runTime); IOException{ i { – Remove non-literals printErrors(result); – Remove stop words p printFailures(result); ( ); printFooter(result); – Split identifiers } – Stemming synchronized void print TestResult result • I d i Indexing long runTime throws IOException – Term-by-document matrix printHeader runTime printErrors result printFailures result printFooter result – Singular Value g Decomposition • User formulate query print TestResult result runTime IOException printHeader runTime O cept o p t eade u e • G Generate results t lt printErrors result printFailures result • Ranked list printFooter result 18
  • 19. Information Retrieval FLT • Generate corpus print Test Result result run Time IO Exception print Header run Ti print E ti i tH d Time i t • Preprocessing Errors result print Failures result print – Remove non-literals Footer result – Remove stop words – Split identifiers print test result result run time io – Stemming exception print head run time print error • I d i Indexing result print fail result print foot result – Term-by-document matrix – Singular Value g Decomposition print test result ... • User formulate query m1 5 1 3 ... • G Generate results t lt m2 ... ... ... ... • Ranked list 19
  • 20. Information Retrieval FLT print test result ... • Generate corpus • Preprocessing m1 5 1 3 ... – Remove non-literals m2 ... ... ... ... – Remove stop words – Split identifiers – Stemming • I d i Indexing – Term-by-document matrix – Singular Value g Decomposition • User formulate query • G Generate results t lt • Ranked list 20
  • 21. IR and Dynamic Information FLT • Generate corpus • Preprocessing – Remove non-literals – Remove stop words – Split identifiers – Stemming • I d i Indexing – Term-by-document matrix – Singular Value Decomposition g p Collect execution • User formulate query trace • Generate results • Ranked list of executed methods 21
  • 22. Research Goal R hG l Evaluate how advanced splitting techniques impact the th performance of feature location techniques f ff t l ti t h i 22
  • 23. Information Retrieval FLT • Generate corpus • Preprocessing – Remove non-literals Replace Camel Case with : – Remove stop words •Samurai – Split identifiers •“Perfect” Splitting – Stemming algorithm ( g (Oracle) ) • I d i Indexing – Term-by-document matrix – Singular Value g Decomposition Better • User formulate query • G Generate results t lt Worst W • Ranked list 23
  • 24. Extract Identifiers All Identifiers Building B ildi the Oracle 24
  • 25. Extract Identifiers Same All p split? Identifiers (CamelCase Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 25
  • 26. Extract Identifiers Same All p split? Identifiers (CamelCase Samurai • Assume they are TIDIER) correct YES • Manually verified a Concordant sample Building B ildi Split Identifiers the Oracle 26 • Threat to validity
  • 27. Manually Split Identifiers Extract Identifiers Manual Split Same Discordant NO All p split? Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 27
  • 28. Manually Split Identifiers Consensus Extract Identifiers Manual Split between authors Checked Same All p split? source codeDiscordant NO Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 28
  • 29. Identifiers Manually • Examples: DT, i3, that could Split not be split Identifiers P754, zzz, etc. • Left unchanged Extract Identifiers Manual Split Same Discordant NO All p split? Split Identifiers (CamelCase Identifiers Samurai TIDIER) YES Concordant Building B ildi Split Identifiers the Oracle 29
  • 30. Design of the Case Study 30
  • 31. Design of the Case Study • RQ: Does a FLT with an advanced splitting algorithm produce better results than the same FLT using the CamelCase splitting algorithm? 31
  • 32. How to Compare two FLTs? 32
  • 33. How to Compare two FLTs? • Effectiveness measure for each feature IR Method LSI score M121 0.92 M64 0.89 M15 0.86 Gold t G ld set method th d M39 0.80 M7 0.74 M152 0.65 0 65 M234 0.56 Effectiveness = 5 M12 0.54 M78 0.52 0 52 33
  • 34. How to Compare two FLTs? • Effectiveness measure for each feature IR Method LSI y IRDyn score Method LSI M121 0.92 score M64 0.89 Gold set method M15 0.86 M15 0.86 M7 0.74 M39 0.80 M234 0.56 M7 0.74 M12 0.54 M152 0.65 0 65 Effectiveness = 2 ect e ess M234 0.56 M12 0.54 M78 0.52 0 52 Method 34 Executed method (from trace)
  • 35. Which FLTs are we Comparing? 35
  • 36. Software Systems • Rhino 1 6R5 1.6R5 • 138 classes, 1,870 methods, 32K LOC • Eaddy E dd et al.’s d * l ’ data* • 2 datasets Dataset Size Queries Gold Sets Execution Information RhinoFeatures 241 Sections of Eaddy et al.* Full Execution Traces ECMAScript (from unit tests) documentation Rhino Rhi Bugs 143 Bug title d B titl and Eaddy t l * E dd et al.* N/A description (CVS) 36 * http://www.cs.columbia.edu/~eaddy/concerntagger/
  • 37. Software Systems • jEdit 4 3 4.3 • 483 classes, 6.4K methods, 109K LOC •2d t t datasets Dataset Size Queries Gold Sets Execution Information jEditFeatures 64 Feature (or Patch) SVN Marked Execution title and description Traces jEditBugs 86 Bug title and SVN Marked Execution description Traces Datasets available at: http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/ 37
  • 38. Generating the jEdit SVN Commits between Datasets v4.2-v4.3 38
  • 39. Generating the jEdit Datasets SVN commit message Title + Description = Query
  • 40. Generating the jEdit Datasets Changed files Previous Current Version V i Version V i Compare using Eclipse AST c pse S Modified methods (gold set) 40
  • 42. Presenting the Results Box plot of all effectiveness measure in datasets (e.g., 241 datapoints for RhinoFeatures) Average Median 42
  • 44. IR FLTs Similar median S and average RhinoFeatures
  • 45. IR FLTs Similar median S and average RhinoFeatures RhinoBugs jEditFeatures jEditBugs 45
  • 46. IR FLTs Similar median S and average RhinoFeatures RhinoBugs Datasets with features have better results than datasets with bugs jEditFeatures jEditBugs 46
  • 47. IRDyn FLTs N/A Similar median S and average RhinoFeatures RhinoBugs jEditFeatures jEditBugs 47
  • 48. IRDyn FLTs N/A Similar median S and average RhinoFeatures RhinoBugs Datasets with atasets t features have better results than datasets with bugs jEditFeatures jEditBugs 48
  • 49. Compare FLTs by Percentages IROracle IRCamelCase (Baseline) 10 17 20 15 18 18 5 9 4 16 19 7 12 28 14 15 49
  • 50. Compare FLTs by Percentages IROracle IRCamelCase 5/8 (Baseline) 10 17 20 15 18 18 2/8 5 9 4 16 19 7 12 28 14 15 50
  • 51. IR RhinoFeatures RhinoBugs 51 jEditFeatures jEditBugs
  • 52. IR Datasets with features RhinoFeatures vs. RhinoBugs Datasets with bugs 52 jEditFeatures jEditBugs
  • 53. IRDyn N/A Similar trend RhinoFeatures RhinoBugs 53 jEditFeatures jEditBugs
  • 54. Statistical Results • Wilcoxon signed-rank test signed rank • Null hypothesis – Th There is no statistical significance diff i t ti ti l i ifi difference in terms of effectiveness between IRSamurai/IROracle and IRCamelCase l l • Alternative hypothesis – IRSamurai/IROracle h statistically significantly has t ti ti ll i ifi tl higher effectiveness than IRCamelCase • alpha = 0 05 l h 0.05 54
  • 55. IR RhinoFeatures The only RhinoBugs statistical significant result (p=0.05) 55 jEditFeatures jEditB
  • 56. Qualitative Results • Vocabulary mismatch between queries and code: – Name of developers (e.g., Slava, Carlos) – Id ifi Identifiers specific to communication ( ifi i i (e.g., thanks, greetings, annoying) 56
  • 57. Qualitative Results • Features are more “descriptive” than bugs 57
  • 58. Qualitative Results • Features are more “descriptive” than bugs Words “join” and “line” are not mentioned 58
  • 59. Threats to Validity • External – 2 Java applications (different domains) – More systems needed • Construct – Errors may be p y present in Oracle and g gold sets – We used data produced by other researchers • Internal – Subjectivity and bias in building the Oracle • Conclusion – Non-parametric test: Wilcoxon signed-rank 59
  • 60. Research Questions • RQ1 Does IRSamurai outperform IRCamelCase in terms of effectiveness? NO • RQ2 Does IRSSamuraiDyn outperform IRC i CamelCaseDyn lC in terms of effectiveness? NO • RQ3 Does IROracle outperform IRCamelCase in terms of effectiveness? I some cases (Rhi ) f ff ti ? In (Rhino) • RQ4 Does IROracleDyn outperform IRCamelCaseDyn 60 in terms of effectiveness? NO
  • 61. Future Work • More systems and datasets • Different maintenance tasks – T Traceability li k recovery bilit link • Consider other splitting algorithms 61
  • 62. Conclusions • Advanced splitting technique could improve FLTs – We found some empirical evidence • Splitting has more impact on IR FLT • If execution information is available, it is f f l bl not necessary to use an advance splitting technique h i 62
  • 63. Thank you! Questions? SEMERU @ William and Mary http://www.cs.wm.edu/semeru/ bdit@cs.wm.edu bdi d SEMERU 63
  • 64. References • Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Investigation", Journal of Programming Languages, vol. 4, no. 3, 1996, pp. 143-167 • Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D al. Lawrie, D., Morrell, C., Feild, H., Binkley D., "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June 14-16 2006, pp. 3-12 • Binkley et al. ( y (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C., ) y, , , , , , , , "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009, pp. 158-167 • Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K., "Mining Source C d to A "Mi i S Code Automatically S li Id ifi i ll Split Identifiers f S f for Software Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80 • Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and Guéhéneuc, Y. G., TIDIER: Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques", JSME, vol. to appear, 2011 64