SlideShare a Scribd company logo
Recognizing Words from Source Code
Identifiers using Speech Recognition
             Techniques
                                              CSMR 2010, Madrid




 Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta,
      Yann-Gaรซl Guรฉhรฉneuc, and Giuliano Antoniol
Content
                    Problem Statement

                    Aligning Strings and Words

                    Meta-
                    Meta-heuristic Inspired Approach

                    Technologies

                    Case Study โ€“ Research Questions

                    Case Study โ€“ Results

CSMR 2010, Madrid   Conclusion and Future Work
       2/24
Problem Statement

                                                   The Challenge
                    A few years after deployment, documentation may
                    no longer exist.

                    If it exists, it will be almost surely outdated.

                    My customers desire to change the system, add
                    new functionalities or fix a defect.

                    The only available source of information is the
                    code:
                               Identifiers;
                               Comments.
CSMR 2010, Madrid
        3/24
Problem Statement

                                            Identifiers Semantic

                    Researchers agree that the identifier semantics are
                    important:
                          Help program comprehension;

                          Suggest clues.



                    Composed identifiers:
                          Camel Case: MyLocalAccount , User_Address

                          Contraction based: pntrctr , usrAdrss , imagEdge

                          Good and possibly known to the developers:
                          hmmm, ixoth , pqrstuvwxyz
CSMR 2010, Madrid
        4/24
Problem Statement

                     Words, Terms, Soft, and Hard Words

                    Term: any substring in a compound identifier.


                    Word: an entry in a dictionary (e.g., the English
                    dictionary).


                    Hard words: terms composing an identifier reflecting
                    domain concepts, clearly demarked:
                                baseAddress,
                                baseAddress, user_file
                    Soft words: terms different from the identifier and not
                    clearly demarked (e.g., abbreviation, contraction,
                    etc.):
CSMR 2010, Madrid               userarea, ptrcntr,
                                userarea, ptrcntr, userGid
        5/24
Problem Statement

                                                   Current Practices

                    Camel Case-based approaches plus greedy
                           Case-
                    algorithms, e.g., Lawrie et al. 2006, 2007.

                    Samurai by Enslen et al, 2009:
                          Lexicon plus a greedy algorithm;

                          If a contraction is used somewhere in the code then it is
                          likely used in the same context than the original term;

                          Frequency tables of contractions and terms to split
                          composed identifiers.


                    Limitations : Abbreviations not treated, no
                    quantification of how close the match is to the
CSMR 2010, Madrid
                    unknown string.
        6/24
Problem Statement

                                    Our Approach in Essence
                    Developers compose identifiers:
                          Using terms and words reflecting domain concepts,

                          developerโ€™s experience, knowledge.

                    Developers generate contraction via a finite set of
                    transformation rules:
                          Drop all vowels, drop prefix, drop suffix, etc.

                    Mimics developerโ€™s identifiers generation process:
                          Dictionaries capturing terms and words;

                          A search-based technique to split exactly any unknown
                          string;
                          A distance using Dynamic Time Warping (DTW) for
CSMR 2010, Madrid
                          continuous speech recognition [H. Ney, 1984].
        7/24
Aligning Strings and
       Words
                                                               Modified H. Ney DTW
                                                   3       5    4                                 0

                                           U s r
                                                       4             5     4    3     2       1
                                                   2   3   4    3    4     3    3     1       0   1
                                                           3
                   Dictionary of 3 words
                                           r       1   2        2    3     2    2     0       1   2


                                                   3   4   5    4     2    1   0     3        4   5
                                           t


                                                   2   3   4    3     1    0   1     2        3   4
                                                   1   2   3    2     0    1   2     1        2   3
                                           C




                                                   3            0
                                           r




                                                       3   2         3     3    2     4       5   4
                                                   2   2   0    1    2     2    3     3       4   4
                                           P n t




                                                   1   0   1    2    2     3    3     2       3   3
                                                   0   1   2    3    1     2    2     1       2   2
CSMR 2010, Madrid                                  p   n   t     r    c    t     r    u       s   r
                                                           Identifier to split : pntrctrusr
        8/24
Meta-heuristic Inspired
      Approach
                                      Word Transformation Rules
                          Constraint: String must remain longer or equal to 3 chars


                            Drop all vowels                pointer โ†’ pntr

                            Drop a random vowel            user โ†’ usr

                            Drop a random character        pntr โ†’ ptr

                            Drop suffix (ing, tion, ed,    available โ†’ avail
                            ment, able)

                            Drop the last m characters     rectangle โ†’ rect
 CSMR 2010, Madrid
          9/24
- Meta-heuristic Inspired
  Approach

-Technologies                   Overall Splitting (Hill Climbing) Procedure

                                      Identifier         DTW Match

                                             Best Matching
                                                                              Success!
                                                              Zero Dist?

                                                                    No

                                                      Select randomly a
                                                     word with a minimal
                                                        distance <> 0


                                                         Apply a random
                                                      transformation to the                           Add transf word to
                                                          chosen word                                    temporary
                                                                                                          dictionary


                                                         Current dictionary

                                                                       yes
                              Discard word                                        Best Matching            DTW
                                                                 red Dist ?
CSMR 2010, Madrid           from temporary                                                                 Match
                                                         No
                               dictionary                                  If other transf to apply
          10/24
Case Study โ€“ Research
      Questions
                        Case Study - Research Questions

                        RQ1: What is the percentage of identifiers
                               correctly split by the proposed approach?



                        RQ2: How does the proposed approach perform
                              compared with the Camel Case splitter?



                        RQ3: What percentage of identifiers containing
                              word abbreviations is the approach able to
CSMR 2010, Madrid             map to dictionary words?
        11/24
Case Study โ€“ Results

                                             Case Study - Results

                       JHotDraw โ€“ Java

                           16 KLOC
                           155 files
                           2,348 identifiers (longer than 2 chars)
                           957 manually segmented identifiers


                       Lynx โ€“ C

                           174 KLOC
                           247 files
                           12,194 identifiers (longer than 2 chars)
                           3,085 manually segmented identifiers

CSMR 2010, Madrid
       12/24
Case Study โ€“ Results

                       RQ1 - Percentage of Correct Classifications



                                 Splits    Ids      Single      Multiple    Errors
                       Systems                    iteration    iterations
                       JHotDraw           957    891 (93%)     920 (95%)     37

                          Lynx            3,085 2,169 (70%) 2,901 (94%)      271




                            Typical cases where the approach failed:

                                             afaik, ihmo, foobar, fsize โ€ฆ

CSMR 2010, Madrid
       13/24
Case Study โ€“ Results

                                                 RQ2 - Camel Case Split

                                        Splits    Ids    Correct Split   Errors
                              Systems
                              JHotDraw            957     874 (91%)          83

                                 Lynx            3,085    561 (18%)      2,524


                       Statistical comparison (Fisherโ€™s exact test) with our approach:

                       Null Hypothesis (H0) : The propotions of correct splittings
                       obtained by the approaches are not significantly <>.

                               โ€ข JHotDraw: Odds Ratio = 1.3, p-value = 0.1

CSMR 2010, Madrid              โ€ข Lynx: Odds Ratio = 60, p-value < 0.001
       14/24
Case Study โ€“ Results

                       RQ3 - Percentage of Correctly Split Id (s)



                                Splits    Ids    Correct Split   Errors
                          Systems
                         JHotDraw        957      920 (95%)        37

                         Lynx            3,085   2,901 (94%)       271




                        The novel identifier splitting approach perfoms
                        better than the Camel Case splitter.


CSMR 2010, Madrid
       15/24
Case Study โ€“ Results

                       Multiple Possible Splits - Successes

                       borddec         bord decimal             bord decision
                       anchorlen       anchor length            anchor lender
                       drawrect        draw rectangle
                       drawroundrect   draw round rectangle
                       fillrect        fill rectangle
                       javadrawapp     java draw apply          java draw append
                       netapp          net apply                net append
                       newlen          new length               new lender
                       nothingapp      nothing apply            nothing application
                       addcolumninfo   add column information   add column inform
                       addlbl          add label
                       casecomp        case compare             case complete


                                         Max of 10000 iterations
CSMR 2010, Madrid
       16/24
Case Study โ€“ Results

                          Multiple Possible Splits - Failures

                       serialversionuid              serial version did
                       selectionzordered             selection ordered
                       removefrfigurerequestremove   remove figure request remove
                       jhotdraw                      hot draw
                       getvadjustable                get bad just able
                       fimagewidth                   him age width
                       fimageheight                  him age height
                       writeref                      write red



                                            Max of 10000 iterations


                           DTW does not account for context, syntax or semantic
CSMR 2010, Madrid
       17/24
Case Study โ€“ Results

                                     Discussion - Challenges

                        How can we expand fwrite or pdraw?
                                                    pdraw?



                        How can we avoid expanding FileLen into File
                        Lender rather than File Length?
                                                Length?



                        How can we recognize that ImagEdit has a correct
                        split at distance 1 and not 0?

                        How can we expand/split pqrstuvwxyz?
                                                pqrstuvwxyz?
CSMR 2010, Madrid
       18/24
Case Study โ€“ Results

                                                         Threats to Validity
                        External validity:
                                 We analyzed only two systems;
                                 However: different domains, different programming languages.

                        Construct validity: errors may be present in the oracle!
                                 We detected 1% error in the first oracle release;
                                 We did the best to guess programmer intention but we cannot
                                 exclude errors.


                        Reliability validity: replication package available.

                        Internal validity: subjectivity and bias in building the oracle:

                                 The same researcher built both oracles;
                                 Oracles were validated by other two researchers;
                                 Size of oracle large enough to avoid a few percent errors change
CSMR 2010, Madrid                conclusions.
       19/24
Conclusion and Future
        Work
                                                          Conclusion

                        We presented a search-based approach to
                                       search-
                        automatically segment source code identifiers.

                        The novel approach is inspired by the developer
                        behavior when composing identifiers.

                        The approach uses a dictionary, a distance computed
                        via DTW, and a set of word transformations.

                        Results on JHotDraw and Lynx show the superiority
                        of the approach over a simple Camel Case splitter.
CSMR 2010, Madrid
       20/24
Conclusion and Future
        Work
                                                        Future Work

                        We plan to:
                                to:

                            Expand the evaluation to other systems.

                            Introduce enhanced heuristics for term selection
                            and word transformations.

                            Contextualize our search by coupling our
                            algorithm with the approach of Enslen et al.
                            [ELK, 2009](restrict the search to the words used
                                  2009](restrict

CSMR 2010, Madrid
                            in the same method, class, or package).
       21/24
Finallyโ€ฆ Questions




                    Thank you for your attention

CSMR 2010, Madrid
      22/24
References

                    [ELK, 2009] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker,
                    โ€œMining source code to automatically split identifiers for software
                    analysis,โ€ Mining Software Repositories, International Workshop on,
                    vol. 0, pp. 71 - 80, 2009.

                    [H. Ney, 1984] H. Ney, โ€œThe use of a one-stage dynamic programming
                    algorithm for connected word recognition,โ€ Acoustics, Speech and
                    Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263 - 271,
                    Apr 1984.

                    D. Lawrie, C. Morrell, H. Feild, and D. Binkley, โ€œEffective identifier
                    names for comprehension and memory,โ€ Innovations in Systems and
                    Software Engineering, vol. 3, no. 4, pp. 303 - 318, 2007.

                    D. Lawrie, C. Morrel, H. Feild, and D. Binkley, โ€œWhatโ€™s in a name? a
                    study of identifiers,โ€ in Proc. of the International Conference on
                    Program Comprehension (ICPC), 2006, pp. 3 - 12.
CSMR 2010, Madrid
      23/24
Overall Splitting (Hill Climbing) Procedure


                                           Best Matching                        Success!
                                                                   Zero Dist?
              Identifier    DTW
                            Match
                                                                        No

                                               Ranked
                                               Word List   No                   Yes
                                                                  Improved?
                                         Discard word
                                        and create new
                                          dictionary
                           Temporary
                           Dictionary                           Dictionary
                                        Save word and
                                          create new
                                          dictionary
CSMR 2010, Madrid
      24/24

More Related Content

Similar to CSMR10c.ppt

Csmr10c.ppt
Csmr10c.pptCsmr10c.ppt
CSMR10a.ppt
CSMR10a.pptCSMR10a.ppt
CSMR10a.ppt
Ptidej Team
ย 
Csmr10a.ppt
Csmr10a.pptCsmr10a.ppt
Wcre12b.ppt
Wcre12b.pptWcre12b.ppt
20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform
Andrey Karpov
ย 
20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform
PVS-Studio
ย 
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Andrey Karpov
ย 
Fase08.ppt
Fase08.pptFase08.ppt
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
Ptidej Team
ย 

Similar to CSMR10c.ppt (9)

Csmr10c.ppt
Csmr10c.pptCsmr10c.ppt
Csmr10c.ppt
ย 
CSMR10a.ppt
CSMR10a.pptCSMR10a.ppt
CSMR10a.ppt
ย 
Csmr10a.ppt
Csmr10a.pptCsmr10a.ppt
Csmr10a.ppt
ย 
Wcre12b.ppt
Wcre12b.pptWcre12b.ppt
Wcre12b.ppt
ย 
20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform
ย 
20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform
ย 
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
ย 
Fase08.ppt
Fase08.pptFase08.ppt
Fase08.ppt
ย 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
ย 

More from Ptidej Team

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software Miniaturisation
Ptidej Team
ย 
Presentation
PresentationPresentation
Presentation
Ptidej Team
ย 
Presentation
PresentationPresentation
Presentation
Ptidej Team
ย 
Presentation
PresentationPresentation
Presentation
Ptidej Team
ย 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
Ptidej Team
ย 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel Abdellatif
Ptidej Team
ย 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh Kermansaravi
Ptidej Team
ย 
Mouna Abidi
Mouna AbidiMouna Abidi
Mouna Abidi
Ptidej Team
ย 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel Grichi
Ptidej Team
ย 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano Politowski
Ptidej Team
ย 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisis
Ptidej Team
ย 
MIPA
MIPAMIPA
MIPA
Ptidej Team
ย 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.ppt
Ptidej Team
ย 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.ppt
Ptidej Team
ย 
Medicine15.ppt
Medicine15.pptMedicine15.ppt
Medicine15.ppt
Ptidej Team
ย 
Qrs17b.ppt
Qrs17b.pptQrs17b.ppt
Qrs17b.ppt
Ptidej Team
ย 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
Ptidej Team
ย 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
Ptidej Team
ย 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
Ptidej Team
ย 
Icsoc15.ppt
Icsoc15.pptIcsoc15.ppt
Icsoc15.ppt
Ptidej Team
ย 

More from Ptidej Team (20)

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software Miniaturisation
ย 
Presentation
PresentationPresentation
Presentation
ย 
Presentation
PresentationPresentation
Presentation
ย 
Presentation
PresentationPresentation
Presentation
ย 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
ย 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel Abdellatif
ย 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh Kermansaravi
ย 
Mouna Abidi
Mouna AbidiMouna Abidi
Mouna Abidi
ย 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel Grichi
ย 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano Politowski
ย 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisis
ย 
MIPA
MIPAMIPA
MIPA
ย 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.ppt
ย 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.ppt
ย 
Medicine15.ppt
Medicine15.pptMedicine15.ppt
Medicine15.ppt
ย 
Qrs17b.ppt
Qrs17b.pptQrs17b.ppt
Qrs17b.ppt
ย 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
ย 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
ย 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
ย 
Icsoc15.ppt
Icsoc15.pptIcsoc15.ppt
Icsoc15.ppt
ย 

Recently uploaded

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
ย 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
ย 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
ย 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
ย 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
ย 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
ย 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
ย 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
ย 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
ย 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
ย 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Vivekanand Anglo Vedic Academy
ย 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
ย 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
ย 
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
Nguyen Thanh Tu Collection
ย 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
ย 
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
Nguyen Thanh Tu Collection
ย 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
ย 

Recently uploaded (20)

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
ย 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
ย 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
ย 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
ย 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
ย 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
ย 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
ย 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
ย 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
ย 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
ย 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
ย 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
ย 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
ย 
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH LแปšP 9 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2024-2025 - ...
ย 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
ย 
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
Bร€I TแบฌP Bแป” TRแปข TIแบพNG ANH 8 Cแบข Nฤ‚M - GLOBAL SUCCESS - Nฤ‚M HแปŒC 2023-2024 (Cร“ FI...
ย 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
ย 

CSMR10c.ppt

  • 1. Recognizing Words from Source Code Identifiers using Speech Recognition Techniques CSMR 2010, Madrid Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta, Yann-Gaรซl Guรฉhรฉneuc, and Giuliano Antoniol
  • 2. Content Problem Statement Aligning Strings and Words Meta- Meta-heuristic Inspired Approach Technologies Case Study โ€“ Research Questions Case Study โ€“ Results CSMR 2010, Madrid Conclusion and Future Work 2/24
  • 3. Problem Statement The Challenge A few years after deployment, documentation may no longer exist. If it exists, it will be almost surely outdated. My customers desire to change the system, add new functionalities or fix a defect. The only available source of information is the code: Identifiers; Comments. CSMR 2010, Madrid 3/24
  • 4. Problem Statement Identifiers Semantic Researchers agree that the identifier semantics are important: Help program comprehension; Suggest clues. Composed identifiers: Camel Case: MyLocalAccount , User_Address Contraction based: pntrctr , usrAdrss , imagEdge Good and possibly known to the developers: hmmm, ixoth , pqrstuvwxyz CSMR 2010, Madrid 4/24
  • 5. Problem Statement Words, Terms, Soft, and Hard Words Term: any substring in a compound identifier. Word: an entry in a dictionary (e.g., the English dictionary). Hard words: terms composing an identifier reflecting domain concepts, clearly demarked: baseAddress, baseAddress, user_file Soft words: terms different from the identifier and not clearly demarked (e.g., abbreviation, contraction, etc.): CSMR 2010, Madrid userarea, ptrcntr, userarea, ptrcntr, userGid 5/24
  • 6. Problem Statement Current Practices Camel Case-based approaches plus greedy Case- algorithms, e.g., Lawrie et al. 2006, 2007. Samurai by Enslen et al, 2009: Lexicon plus a greedy algorithm; If a contraction is used somewhere in the code then it is likely used in the same context than the original term; Frequency tables of contractions and terms to split composed identifiers. Limitations : Abbreviations not treated, no quantification of how close the match is to the CSMR 2010, Madrid unknown string. 6/24
  • 7. Problem Statement Our Approach in Essence Developers compose identifiers: Using terms and words reflecting domain concepts, developerโ€™s experience, knowledge. Developers generate contraction via a finite set of transformation rules: Drop all vowels, drop prefix, drop suffix, etc. Mimics developerโ€™s identifiers generation process: Dictionaries capturing terms and words; A search-based technique to split exactly any unknown string; A distance using Dynamic Time Warping (DTW) for CSMR 2010, Madrid continuous speech recognition [H. Ney, 1984]. 7/24
  • 8. Aligning Strings and Words Modified H. Ney DTW 3 5 4 0 U s r 4 5 4 3 2 1 2 3 4 3 4 3 3 1 0 1 3 Dictionary of 3 words r 1 2 2 3 2 2 0 1 2 3 4 5 4 2 1 0 3 4 5 t 2 3 4 3 1 0 1 2 3 4 1 2 3 2 0 1 2 1 2 3 C 3 0 r 3 2 3 3 2 4 5 4 2 2 0 1 2 2 3 3 4 4 P n t 1 0 1 2 2 3 3 2 3 3 0 1 2 3 1 2 2 1 2 2 CSMR 2010, Madrid p n t r c t r u s r Identifier to split : pntrctrusr 8/24
  • 9. Meta-heuristic Inspired Approach Word Transformation Rules Constraint: String must remain longer or equal to 3 chars Drop all vowels pointer โ†’ pntr Drop a random vowel user โ†’ usr Drop a random character pntr โ†’ ptr Drop suffix (ing, tion, ed, available โ†’ avail ment, able) Drop the last m characters rectangle โ†’ rect CSMR 2010, Madrid 9/24
  • 10. - Meta-heuristic Inspired Approach -Technologies Overall Splitting (Hill Climbing) Procedure Identifier DTW Match Best Matching Success! Zero Dist? No Select randomly a word with a minimal distance <> 0 Apply a random transformation to the Add transf word to chosen word temporary dictionary Current dictionary yes Discard word Best Matching DTW red Dist ? CSMR 2010, Madrid from temporary Match No dictionary If other transf to apply 10/24
  • 11. Case Study โ€“ Research Questions Case Study - Research Questions RQ1: What is the percentage of identifiers correctly split by the proposed approach? RQ2: How does the proposed approach perform compared with the Camel Case splitter? RQ3: What percentage of identifiers containing word abbreviations is the approach able to CSMR 2010, Madrid map to dictionary words? 11/24
  • 12. Case Study โ€“ Results Case Study - Results JHotDraw โ€“ Java 16 KLOC 155 files 2,348 identifiers (longer than 2 chars) 957 manually segmented identifiers Lynx โ€“ C 174 KLOC 247 files 12,194 identifiers (longer than 2 chars) 3,085 manually segmented identifiers CSMR 2010, Madrid 12/24
  • 13. Case Study โ€“ Results RQ1 - Percentage of Correct Classifications Splits Ids Single Multiple Errors Systems iteration iterations JHotDraw 957 891 (93%) 920 (95%) 37 Lynx 3,085 2,169 (70%) 2,901 (94%) 271 Typical cases where the approach failed: afaik, ihmo, foobar, fsize โ€ฆ CSMR 2010, Madrid 13/24
  • 14. Case Study โ€“ Results RQ2 - Camel Case Split Splits Ids Correct Split Errors Systems JHotDraw 957 874 (91%) 83 Lynx 3,085 561 (18%) 2,524 Statistical comparison (Fisherโ€™s exact test) with our approach: Null Hypothesis (H0) : The propotions of correct splittings obtained by the approaches are not significantly <>. โ€ข JHotDraw: Odds Ratio = 1.3, p-value = 0.1 CSMR 2010, Madrid โ€ข Lynx: Odds Ratio = 60, p-value < 0.001 14/24
  • 15. Case Study โ€“ Results RQ3 - Percentage of Correctly Split Id (s) Splits Ids Correct Split Errors Systems JHotDraw 957 920 (95%) 37 Lynx 3,085 2,901 (94%) 271 The novel identifier splitting approach perfoms better than the Camel Case splitter. CSMR 2010, Madrid 15/24
  • 16. Case Study โ€“ Results Multiple Possible Splits - Successes borddec bord decimal bord decision anchorlen anchor length anchor lender drawrect draw rectangle drawroundrect draw round rectangle fillrect fill rectangle javadrawapp java draw apply java draw append netapp net apply net append newlen new length new lender nothingapp nothing apply nothing application addcolumninfo add column information add column inform addlbl add label casecomp case compare case complete Max of 10000 iterations CSMR 2010, Madrid 16/24
  • 17. Case Study โ€“ Results Multiple Possible Splits - Failures serialversionuid serial version did selectionzordered selection ordered removefrfigurerequestremove remove figure request remove jhotdraw hot draw getvadjustable get bad just able fimagewidth him age width fimageheight him age height writeref write red Max of 10000 iterations DTW does not account for context, syntax or semantic CSMR 2010, Madrid 17/24
  • 18. Case Study โ€“ Results Discussion - Challenges How can we expand fwrite or pdraw? pdraw? How can we avoid expanding FileLen into File Lender rather than File Length? Length? How can we recognize that ImagEdit has a correct split at distance 1 and not 0? How can we expand/split pqrstuvwxyz? pqrstuvwxyz? CSMR 2010, Madrid 18/24
  • 19. Case Study โ€“ Results Threats to Validity External validity: We analyzed only two systems; However: different domains, different programming languages. Construct validity: errors may be present in the oracle! We detected 1% error in the first oracle release; We did the best to guess programmer intention but we cannot exclude errors. Reliability validity: replication package available. Internal validity: subjectivity and bias in building the oracle: The same researcher built both oracles; Oracles were validated by other two researchers; Size of oracle large enough to avoid a few percent errors change CSMR 2010, Madrid conclusions. 19/24
  • 20. Conclusion and Future Work Conclusion We presented a search-based approach to search- automatically segment source code identifiers. The novel approach is inspired by the developer behavior when composing identifiers. The approach uses a dictionary, a distance computed via DTW, and a set of word transformations. Results on JHotDraw and Lynx show the superiority of the approach over a simple Camel Case splitter. CSMR 2010, Madrid 20/24
  • 21. Conclusion and Future Work Future Work We plan to: to: Expand the evaluation to other systems. Introduce enhanced heuristics for term selection and word transformations. Contextualize our search by coupling our algorithm with the approach of Enslen et al. [ELK, 2009](restrict the search to the words used 2009](restrict CSMR 2010, Madrid in the same method, class, or package). 21/24
  • 22. Finallyโ€ฆ Questions Thank you for your attention CSMR 2010, Madrid 22/24
  • 23. References [ELK, 2009] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, โ€œMining source code to automatically split identifiers for software analysis,โ€ Mining Software Repositories, International Workshop on, vol. 0, pp. 71 - 80, 2009. [H. Ney, 1984] H. Ney, โ€œThe use of a one-stage dynamic programming algorithm for connected word recognition,โ€ Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263 - 271, Apr 1984. D. Lawrie, C. Morrell, H. Feild, and D. Binkley, โ€œEffective identifier names for comprehension and memory,โ€ Innovations in Systems and Software Engineering, vol. 3, no. 4, pp. 303 - 318, 2007. D. Lawrie, C. Morrel, H. Feild, and D. Binkley, โ€œWhatโ€™s in a name? a study of identifiers,โ€ in Proc. of the International Conference on Program Comprehension (ICPC), 2006, pp. 3 - 12. CSMR 2010, Madrid 23/24
  • 24. Overall Splitting (Hill Climbing) Procedure Best Matching Success! Zero Dist? Identifier DTW Match No Ranked Word List No Yes Improved? Discard word and create new dictionary Temporary Dictionary Dictionary Save word and create new dictionary CSMR 2010, Madrid 24/24