Successfully reported this slideshow.
On the Use of Domain Terms in        Source Code            Sonia Haiduc           Andrian Marcus               ICPC 2008 ...
Importance of Domain Terms in        Program Comprehension• Requirements documents, conversation  between stakeholders, et...
What is the Code about?void setAccount(…)   void getCarPart(){                    {  int x;               int x;  int y;  ...
Lexical Agreement• Furnas: 20% probability of two people choosing  the same word to express the same concept• Summarizatio...
Research QuestionsRQ1. To what degree are domain terms found in the source code of software from a particular problem doma...
Case Study• 6 graph theory libraries (2 C++, 4 Java)• 135 domain concepts, 193 domain terms  (http://www.cs.wayne.edu/~sev...
Lexica• Domain vocabulary• Software vocabulary• Software domain vocabulary• Filtering and stemming
RQ1. Degree of Domain Terms in           Source Code• On average, 42% of the domain terms appear in the  software domain v...
RQ2. Domain Terms in Identifiers        and Comments• On average, 90% of the domain terms found in a  software library are...
RQ3. Lexical Agreement• Software libraries – partial summaries of a  problem domain• Pair-wise lexical agreement measure f...
Threats to Validity• Only one domain considered• List of domain terms and concepts manually  picked• Verification in sourc...
Future Work• More case studies, different domains• Verification of the meaning of terms in source  code - consider word re...
Conclusion• 42% of domain terms are found in the source code     -> domain ontologies constructed from the source     code...
Upcoming SlideShare
Loading in …5
×

On the Use of Domain Terms in Source Code

949 views

Published on

  • Be the first to comment

  • Be the first to like this

On the Use of Domain Terms in Source Code

  1. 1. On the Use of Domain Terms in Source Code Sonia Haiduc Andrian Marcus ICPC 2008 Amsterdam, The Netherlands
  2. 2. Importance of Domain Terms in Program Comprehension• Requirements documents, conversation between stakeholders, etc. -> expressed using domain terms• Source code – representation of domain -> domain concepts in the source code• Using domain terms in source code - essential for reuse -> intent of the software• Concept/concern location, traceability, etc.• XP -> system metaphor -> the use of the same words to describe the same concepts is desirable
  3. 3. What is the Code about?void setAccount(…) void getCarPart(){ { int x; int x; int y; float y; String z; String z; … …} }
  4. 4. Lexical Agreement• Furnas: 20% probability of two people choosing the same word to express the same concept• Summarization: 20% lexical agreement between summaries of the same document-> catastrophe for comprehending source code
  5. 5. Research QuestionsRQ1. To what degree are domain terms found in the source code of software from a particular problem domain?RQ2. Which is the preponderant source of domain terms: identifiers or comments?RQ3. To what level do programmers agree in choosing domain terms across systems from the same problem domain?
  6. 6. Case Study• 6 graph theory libraries (2 C++, 4 Java)• 135 domain concepts, 193 domain terms (http://www.cs.wayne.edu/~severe/icpc2008)
  7. 7. Lexica• Domain vocabulary• Software vocabulary• Software domain vocabulary• Filtering and stemming
  8. 8. RQ1. Degree of Domain Terms in Source Code• On average, 42% of the domain terms appear in the software domain vocabulary of one library• 77% of the domain terms were used in at least one of the six libraries• The size of the software domain vocabularies is correlated with the size of the software systems and software vocabularies
  9. 9. RQ2. Domain Terms in Identifiers and Comments• On average, 90% of the domain terms found in a software library are found in comments, whereas only 78% are found in identifiers -> comments richer source of domain terms and should not be ignored• 23% of domain terms found in a software library are found only in comments, whereas only 11% are found only in identifiers -> comments complete the domain information when missing from identifiers
  10. 10. RQ3. Lexical Agreement• Software libraries – partial summaries of a problem domain• Pair-wise lexical agreement measure from summarization• Agreement of 63% between pairs of libraries (compared to 24% in document summaries)• 18 domain terms used in all libraries• Up to 98% of domain terms reused between pairs of libraries
  11. 11. Threats to Validity• Only one domain considered• List of domain terms and concepts manually picked• Verification in source code of term meaning• Considered the software libraries as partial summaries of a domain
  12. 12. Future Work• More case studies, different domains• Verification of the meaning of terms in source code - consider word relationships to determine the meaning of words automatically• Analyze other software artifacts (requirement documents, user manuals, bug reports, etc.)
  13. 13. Conclusion• 42% of domain terms are found in the source code -> domain ontologies constructed from the source code will be far from complete• Comments are a richer source of domain terms than identifiers and contain extra domain terms -> comments should not be ignored by tools nor by programmers• High lexical agreement between programmers -> developers familiar with the domain will have an easier time understanding source code in the same domain, even when written by others. -> domain-specific tools

×