Successfully reported this slideshow.
2010 CRC PhD Student Conference




    Analysing semantic networks of
identifier names to improve source code
       maint...
2010 CRC PhD Student Conference




class and method names, and are used for concept identification and location
based on t...
2010 CRC PhD Student Conference




on the existing analyses of C function and Java method identifier names [3, 5, 8],
and ...
2010 CRC PhD Student Conference




the tool will enable stakeholders (e.g. domain experts) who are not literate
in progra...
Upcoming SlideShare
Loading in …5
×

Butler

591 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Butler

  1. 1. 2010 CRC PhD Student Conference Analysing semantic networks of identifier names to improve source code maintainability and quality Simon Butler sjb792@student.open.ac.uk Supervisors Michel Wermelinger, Yijun Yu & Helen Sharp Department/Institute Centre for Research in Computing Status Part-time Probation viva After Starting date October 2008 Source code is the written expression of a software design consisting of identifier names – natural language phrases that represent concepts being manipulated by the program – embedded in a framework of keywords and operators provided by the programming language. Identifiers are crucial for program comprehen- sion [9], a necessary activity in the development and maintenance of software. Despite their importance, there is little understanding of the relationship be- tween identifier names and source code quality and maintainability. Neither is there automated support for identifier management or the selection of relevant natural language content for identifiers during software development. We will extend current understanding of the relationship between identifier name quality and source code quality and maintainability by developing tech- niques to analyse identifiers for meaning, modelling the semantic relationships between identifiers and empirically validating the models against measures of maintainability and software quality. We will also apply the analysis and mod- elling techniques in a tool to support the selection and management of identifier names during software development, and concept identification and location for program comprehension. The consistent use of clear identifier names is known to aid program com- prehension [4, 7, 8]. However, despite the advice given in programming conven- tions and the popular programming literature on the use of meaningful identifier names in source code, the reality is that identifier names are not always meaning- ful, may be selected in an ad hoc manner, and do not always follow conventions [5, 1, 2]. Researchers in the reverse engineering community have constructed mod- els to support program comprehension. The models range in complexity from textual search systems [11], to RDF-OWL ontologies created either solely from source code and identifier names [8], or with the inclusion of supporting doc- umentation and source code comments [13]. The ontologies typically focus on Page 5 of 125
  2. 2. 2010 CRC PhD Student Conference class and method names, and are used for concept identification and location based on the lexical similarity of identifier names. The approach, however, does not directly address the quality of identifier names used. The development of detailed identifier name analysis has focused on method names because their visibility and reuse in APIs implies a greater need for them to contain clear information about their purpose [10]. Caprile and Tonella [3] derived both a grammar and vocabulary for C function identifiers, sufficient for the implementation of automated name refactoring. Høst and Østvold [5] have since analysed Java method names looking for a common vocabulary that could form the basis of a naming scheme for Java methods. Their analysis of the method names used in multiple Java projects found common grammatical forms; however, there were sufficient degenerate forms for them to be unable to derive a grammar for Java method names. The consequences of identifier naming problems have been considered to be largely confined to the domain of program comprehension. However, Deißenb¨ck o and Pizka observed an improvement in maintainability when their rules of con- cise and consistent naming were applied to a project [4], and our recent work found statistical associations between identifier name quality and source code quality [1, 2]. Our studies, however, only looked at the construction of the identifier names in isolation, and not at the relationships between the meaning of the natural language content of the identifiers. We hypothesise that a rela- tionship exists between the quality of identifier names, in terms of their natural language content and semantic relationships, and the quality of source code, which can be understood in terms of the functionality, reliability, and usability of the resulting software, and its maintainability [6]. Accordingly, we seek to answer the following research question: How are the semantic relationships between identifier names, in- ferred from their natural language content and programming lan- guage structure, related to source code maintainability and quality? We will construct models of source code as semantic networks predicated on both the semantic content of identifier names and the relationships between identifier names inferred from the programming language structure. For exam- ple, the simple class Car in Figure 1 may be represented by the semantic network in Figure 2. Such models can be applied to support empirical investigations of the relationship between identifier name quality and source code quality and maintainability. The models may also be used in tools to support the manage- ment and selection of identifier names during software development, and to aid concept identification and location during source code maintenance. public c l a s s Car extends V e h i c l e { Engine e n g i n e ; } Figure 1: The class Car We will analyse identifier names mined from open source Java projects to create a catalogue of identifier structures to understand the mechanisms em- ployed by developers to encode domain information in identifiers. We will build Page 6 of 125
  3. 3. 2010 CRC PhD Student Conference on the existing analyses of C function and Java method identifier names [3, 5, 8], and anticipate the need to develop additional techniques to analyse identifiers, particularly variable identifier names. extends Car Vehicle has a has instance named Engine engine Figure 2: A semantic network of the class Car Modelling of both the structural and semantic relationships between iden- tifiers can be accomplished using Gellish [12], an extensible controlled natural language with dictionaries for natural languages – Gellish English being the variant for the English language. Unlike a conventional dictionary, a Gellish dictionary includes human- and machine-readable links between entries to de- fine relationships between concepts – thus making Gellish a semantic network – and to show hierarchical linguistic relationships such as meronymy, an entity– component relationship. Gellish dictionaries also permit the creation of multiple conceptual links for individual entries to define polysemic senses. The natural language relationships catalogued in Gellish can be applied to establish whether the structural relationship between two identifiers implied by the programming language is consistent with the conventional meaning of the natural language found in the identifier names. For example, a field is implic- itly a component of the containing class allowing the inference of a conceptual and linguistic relationship between class and field identifier names. Any incon- sistency between the two relationships could indicate potential problems with either the design or with the natural language content of the identifier names. We have assumed a model of source code development and comprehension predicated on the idea that it is advantageous for coherent and relevant semantic relationships to exist between identifier names based on their natural language content. To assess the relevance of our model to real-world source code we will validate the underlying assumption empirically. We intend to mine both software repositories and defect reporting systems to identify source code impli- cated in defect reports and evaluate the source code in terms of the coherence and consistency of models of its identifiers. To assess maintainability we will investigate how source code implicated in defect reports develops in successive versions – e.g. is the code a continuing source of defects? – and monitor areas of source code modified between versions to determine how well our model predicts defect-prone and defect-free regions of source code. We will apply the results of our research to develop a tool to support the selection and management of identifier names during software development, as well as modelling source code to support software maintenance. We will evaluate and validate the tool with software developers – both industry partners and FLOSS developers – to establish the value of identifier naming support. While intended for software developers, the visualisations of source code presented by Page 7 of 125
  4. 4. 2010 CRC PhD Student Conference the tool will enable stakeholders (e.g. domain experts) who are not literate in programming or modelling languages (like Java and UML) to examine, and feedback on, the representation of domain concepts in source code. References [1] S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Relating identifier naming flaws and code quality: an empirical study. In Proc. of the Working Conf. on Reverse Engineering, pages 31–35. IEEE Computer Society, 2009. [2] S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Exploring the influence of identifier names on code quality: an empirical study. In Proc. of the 14th European Conf. on Software Maintenance and Reengineering, pages 159–168. IEEE Computer Society, 2010. [3] B. Caprile and P. Tonella. Restructuring program identifier names. In Proc. Int’l Conf. on Software Maintenance, pages 97–107. IEEE, 2000. [4] F. Deißenb¨ck and M. Pizka. Concise and consistent naming. Software o Quality Journal, 14(3):261–282, Sep 2006. [5] E. W. Høst and B. M. Østvold. The Java programmer’s phrase book. In Software Language Engineering, volume 5452 of LNCS, pages 322–341. Springer, 2008. [6] International Standards Organisation. ISO/IEC 9126-1: Software engineer- ing – product quality, 2001. [7] D. Lawrie, H. Feild, and D. Binkley. An empirical study of rules for well- formed identifiers. Journal of Software Maintenance and Evolution: Re- search and Practice, 19(4):205–229, 2007. [8] D. Ratiu. Intentional Meaning of Programs. PhD thesis, Technische Uni- ¸ versit¨t M¨nchen, 2009. a u [9] V. Rajlich and N. Wilde. The role of concepts in program comprehension. In Proc. 10th Int’l Workshop on Program Comprehension, pages 271–278. IEEE, 2002. [10] M. Robillard. What makes APIs hard to learn? Answers from developers. IEEE Software, 26(6):27–34, Nov.-Dec. 2009. [11] G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker. Identifying word relations in software: a comparative study of semantic similarity tools. In Proc Int’l Conf. on Program Comprehension, pages 123–132. IEEE, June 2008. [12] A. S. H. P. van Renssen. Gellish: a generic extensible ontological language. Delft University Press, 2005. [13] R. Witte, Y. Zhang, and J. Rilling. Empowering software maintainers with semantic web technologies. In European Semantic Web Conf., pages 37–52, 2007. Page 8 of 125

×