Published on

Software code, identifiers, expansion, splitting

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. TRIS: a Fast and Accurate Identifiers Splitting and Expansion Algorithm WCRE 2012, Kingston, Canada Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Massimiliano Di Penta
  2. 2. Content Research Context Current Practices TRIS in Essence Case Study Threats to Validity Conclusion WCRE’12, Kingston 2/25
  3. 3. Research Context Researchers agree that the identifier semantics are important: Help program comprehension: [CT99, DP05, LFB06]…; Improve software quality: [MPF08, LMFB06, JH06]…; Suggest clues: [TGM96, LMFB07]... Composed identifiers: Camel Case: MyLocalAccount, User_Address Contraction based: pntr ctr, usrAdrss , imagEdge Good and possibly known to the developers: WCRE’12, Kingston 3/25 hmmm, ixoth , pqrstuvwxyz
  4. 4. Current Practices Camel Case-based approaches [LFB06] Case[LFB06] Samurai by Enslen [ELK09]: Mines term frequencies in large source code bases and relies on two assumptions: 1) A substring composing an identifier is also likely to be used in other parts of the program or in other programs. 2) Given two possible splits of a given identifier, the split that most likely represents the developer’s intent partitions the identifier into terms occurring more often in the program. WCRE’12, Kingston 4/25 Abbreviations not treated, no quantification of how close the match is to the unknown string.
  5. 5. Current Practices TIDIER [GMA11]: [GMA11]: Inspired by speech recognition techniques and uses Dynamic Time Warping to match terms to domain concepts. Exploits context in the form of specialized dictionaries and mimics the process of transforming words via contraction rules. GenTest/Normalize GenTest/Normalize [LBM10, LB11]: GenTest generates all possible splittings and evaluates a scoring function against each proposed splitting. Normalize uses a machine translation technique, the maximum coherence model. WCRE’12, Kingston 5/25 TIDIER, GenTest and Normalize are demanding in terms of computation time.
  6. 6. TRIS in Essence TRIS assumes that developers compose identifiers: Using terms and words reflecting domain concepts developer’s experience, knowledge. TRIS uses set of transformation rules: Drop vowels, drop prefix, drop suffix, etc. TRIS treats the problem as an optimization problem where the cost function is: C(wOrig w) = α*Freq(wOrig) + C(type(wOrig w)) Freq(wOrig): frequency of wOrig in the source code WCRE’12, Kingston 6/25 C(type(wOrig w)): cost of the transformation type
  7. 7. TRIS in Essence TRIS applies a two-phase strategy: two1) Building dictionary transformation Computation of the frequency of dictionary words Construction of the set of transformations Construction of the arborescence 2) Identifier processing algorithm Construction of the auxiliary graph of the identifier Search for a shortest path in the auxiliary graph WCRE’12, Kingston 7/25
  8. 8. TRIS in Essence TRIS phases – Example: WCRE’12, Kingston 8/25 Dictionary Transformations Building Information for the Identifier callableint
  9. 9. TRIS in Essence Arborescence of Transformations for the Dictionary D WCRE’12, Kingston 9/25 Auxiliary Graph for the Identifier callableint
  10. 10. Case Study - Research Question Quality focus: Accuracy of TRIS with respect to oracles, and compared with stateof-the-art approaches: 1) Camel Case Splitter, 2) Samurai, 3) TIDIER, and 4) GenTest. Research Question: What is the accuracy of the TRIS identifier splitting and expansion approach compared with alternative state-of-the art approaches? WCRE’12, Kingston 10/25
  11. 11. Case Study – Analyzed Systems JHotDraw – Java 16 KLOC 155 files 2,348 identifiers (longer than 2 chars) 957 manually segmented identifiers Lynx – C 174 KLOC 247 files 12,194 identifiers (longer than 2 chars) 3,085 manually segmented identifiers Lawrie et al. Data Set 186 programs 26 MLOC of C 15 MLOC of C++ WCRE’12, Kingston 11/25 7 MLOC of Java 489 C/C++ sampled from 430 GNU projects
  12. 12. Case Study - Results Precision, recall and F-measure of TRIS, Camel Case, Samurai, and TIDIER on JHotDraw WCRE’12, Kingston 12/25
  13. 13. Case Study - Results Precision, recall and F-measure of TRIS, Camel Case, Samurai, and TIDIER on Lynx WCRE’12, Kingston 13/25
  14. 14. Case Study - Results Approach 1 TRIS TRIS TRIS Approach 2 Adj p-value Camel Case 0.431 Samurai 0.894 TIDIER 0.024 Cliff's delta 0.041 0.001 0.043 Comparison among approaches: Results of Wilcoxon paired test and Cliff’s Delta effect size on JHotDraw Approach 1 Approach 2 TRIS TRIS TRIS Camel Case Samurai TIDIER Adj p-value <0.001 <0.001 <0.001 Cliff's delta 0.743 0.684 0.204 Comparison among approaches: Results of Wilcoxon paired test and Cliff’s Delta effect size on Lynx WCRE’12, Kingston 14/25 Cliff’s delta Interpretation: - small for 0.148 <= d <0.33 - medium for 0.33 <= d < 0.474 - large for d >= 0.474
  15. 15. Case Study - Results Metric Precision Median Mean 3Q σ 0.6667 0.6368 1.0000 0.3681 1.0000 1.0000 0.8933 1.0000 0.2471 TIDIER 0.5000 0.6667 0.6496 1.0000 0.3654 TRIS 1.0000 1.0000 0.8720 1.0000 0.2606 TIDIER 0.4000 0.6667 0.6409 1.0000 0.3650 TRIS F-measure 1Q 0.4000 TRIS Recall Approach TIDIER 1.0000 1.0000 0.8790 1.0000 0.2524 Precision, recall and F-measure of TRIS and TIDIER on the 489 C Sampled Identifiers WCRE’12, Kingston 15/25
  16. 16. Case Study - Results Metric Approach 1Q Median Mean 3Q σ Precision TRIS 1.0000 1.0000 0.9763 1.0000 0.1184 Recall TRIS 1.0000 1.0000 0.9439 1.0000 0.1565 F-measure TRIS 1.0000 1.0000 0.9559 1.0000 0.1358 Precision, recall and F-measure of TRIS and TIDIER on the data set from LAWRIE et al Approach Identifier Splitting Correctness Samurai GenTest 16/25 82% TRIS WCRE’12, Kingston 70% 86% Correctness of the splitting provided using the data set from LAWRIE et al
  17. 17. Conclusion TRIS maps the identifier splitting/expansion problem in a graph optimization problem to find the optimal path in an identifier acyclic weighted graph. TRIS was applied on Java, C, and C++ samples and compared to Camel Case, Samurai, TIDIER, and GenTest. Results show that TRIS outperforms other approaches with medium to large effect size. TRIS is also efficient in terms of computation time: quadratic complexity in the length of the identifier. WCRE’12, Kingston 17/25
  18. 18. Finally… Questions Thank you for your attention WCRE’12, Kingston 18/25
  19. 19. References [LB11 ] D. Lawrie and D. Binkley, “Expanding identifiers to normalize source code vocabulary,” International Conference on Software Maintenance , 2011, pp. 113–122. [LBM10] D. Lawrie, D. Binkley, and C. Morrell, “Normalizing source code vocabulary,” Working Conference on Reverse Engineering, 2010, pp. 112–122. [GMA11] L. Guerrouj, D. P. Massimiliano, A. Giuliano, and Y.-G. Guéhéneuc, “Tidier: An identifier splitting approach using speech recognition techniques,” Journal of Software Maintenance and Evolution: Research and Practice, 2011. [MGD10] N. Madani, L. Guerrouj, M. Di Penta, Y. Guéhéneuc, and G. Antoniol. “Recognizing words from source code identifiers using speech recognition techniques”. 14th European Conference on Software Maintenance and Reengineering, March 2010. [ELK09] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining source code to automatically split identifiers for software analysis”. Mining Software Repositories, International Workshop on, vol. 0, pp. 71-80, 2009. WCRE’12, Kingston 19/25
  20. 20. References [MPF08] A. Marcus, D. Poshyvanyk, and R. Ferenc. “Using the conceptual cohesion of classes for fault prediction in object-oriented systems”. IEEE Transactions on Software Engineering, 34(2):287-300, 2008. LMFB07] Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. “Effective identifier names for comprehension and memory”. Innovations in Systems and Software Engineering, 3(4):303-318, 2007. [LFB06] D. Lawrie, H. Feild, and D. Binkley. “Syntactic identifier conciseness and consistency”. 6th International Workshop on Source Code Analysis and Manipulation, pages139-148, Sept 27-29, 2006. [LMFB06] D. Lawrie, C. Morrell, H. Feild, and D. Binkley. “What’s in a name? a study of identifiers”. 14th International Conference on Program Comprehension, pages 3-12, Athens, Greece, 2006. [JH06] Z. Ming Jiang and Ahmed E. Hassan. “Examining the evolution of code comments in postgresql”. 2006 International Workshop on Mining Software Repositories, pages 179-180, 2006. WCRE’12, Kingston 20/25
  21. 21. References [DP05] Florian Deissenbock and Markus Pizka. “Concise and consistent naming”. Proceedings of the International Workshop on Program Comprehension, May 2005. [Ian00] Ian Sommerville. Software Engineering. Addison-Wesley, sixth edition, 2000. [TGM96] Armstrong Takang, Penny A. Grubb, and Robert D. Macredie. “The effects of comments and identifier names on program comprehensibility: an experiential study”. Journal of Program Languages, 4(3):143–167, 1996. [HNey84] H. Ney, “The use of a one-stage dynamic programming algorithm for connected word recognition”. Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263-271, Apr 1984. WCRE’12, Kingston 21/25