Supporting Program
Comprehension with Source
   Code Summarization
     Sonia Haiduc*, Jairo Aponte**, Andrian Marcus*

                    ICSE NIER 2010



 *                                          **
Developers read source code

• Before performing maintenance on a
  system, developers need to understand
  its source code

• During comprehension, programmers
  search and browse the code
Skimming vs. reading code
• Skimming (Starke’09): quickly reading the names of
  software artifacts
  + Fast
  – Insufficient information
  – Shallow understanding

• Reading in depth
   – Slow
   – Too much information
   + Deeper understanding
Code summaries

• Automatically generated, short, yet accurate
  descriptions of source code entities

• They give more information than just the
  header or the name of an artifact

• Significantly shorter and faster to read than
  the source code they summarize
What should we summarize?
• Code
   –   Packages
   –   Classes
   –   Methods
   –   Method sequences
   –   Etc.

• Other artifacts
   – Bug reports (ICSE 2010 - S. Rastakar, G. Murphy, G. Murray)
   – E-mails
   – Etc.
What should we include
         in code summaries?

• Semantic information
  – What does the source code do?
  – Identifiers and comments that capture the main concepts


• Structural information
  – How does the code work?
  – Class relationships, callers and callees, members of a
    class, etc.
Description: VFS virtual file system read write
              mkdir directory path save      +
Internal classes: DirectoryEntry             +
Methods: listDirectory, mkdir, constructPath +
Fields: WRITE_CAP, READ_CAP, lock            +
Sub-classes: FileVFS, FavoritesVFS           +
Other: ...
How should we generate
        code summaries?

• Semantic information: automatic text
  summarization
  – Machine Learning
  – Discourse-based approaches
  – Term-based Text Retrieval techniques


• Structural information: static analysis
How can we evaluate code
          summaries?

• How good are the automatic summaries
  when compared to manual ones?

• How useful are the automatic code
  summaries for SE tasks?
Preliminary evaluation

• Compared automatic code summaries
  with developer code summaries

• 6 developers, 12 methods in ATunes

• Used only lexical information – 5 most
  relevant terms
Results
• Automatic source code summaries good in
  reflecting developers’ summaries

• Text Retrieval techniques work as well on
  source code as on natural language in reflecting
  human summaries

• Developers make use of structural information in
  their code summaries:
  – Method name terms
  – Class name terms
  – Formal parameter types terms
What are we doing now?

• What type and how much structural
  information should be included in code
  summaries?
• How do developers generate summaries?
• Are different summaries needed for
  different tasks?
• How useful are the code summaries for
  SE tasks?, etc.
In summary…
• Automatic code summaries:
  –   Short yet accurate descriptions of source code
  –   Can reduce the effort of program comprehension
  –   Embed both semantic and structural information
  –   Can be generated for a variety of software entities

• Visit my poster
  (HINT: look for the huge and colorful one)
• www.cs.wayne.edu/~severe and
  www.cs.wayne.edu/~shaiduc
• sonja@wayne.edu

Supporting program comprehension with source code summarization icse nier 2010

  • 1.
    Supporting Program Comprehension withSource Code Summarization Sonia Haiduc*, Jairo Aponte**, Andrian Marcus* ICSE NIER 2010 * **
  • 2.
    Developers read sourcecode • Before performing maintenance on a system, developers need to understand its source code • During comprehension, programmers search and browse the code
  • 3.
    Skimming vs. readingcode • Skimming (Starke’09): quickly reading the names of software artifacts + Fast – Insufficient information – Shallow understanding • Reading in depth – Slow – Too much information + Deeper understanding
  • 4.
    Code summaries • Automaticallygenerated, short, yet accurate descriptions of source code entities • They give more information than just the header or the name of an artifact • Significantly shorter and faster to read than the source code they summarize
  • 5.
    What should wesummarize? • Code – Packages – Classes – Methods – Method sequences – Etc. • Other artifacts – Bug reports (ICSE 2010 - S. Rastakar, G. Murphy, G. Murray) – E-mails – Etc.
  • 6.
    What should weinclude in code summaries? • Semantic information – What does the source code do? – Identifiers and comments that capture the main concepts • Structural information – How does the code work? – Class relationships, callers and callees, members of a class, etc.
  • 7.
    Description: VFS virtualfile system read write mkdir directory path save + Internal classes: DirectoryEntry + Methods: listDirectory, mkdir, constructPath + Fields: WRITE_CAP, READ_CAP, lock + Sub-classes: FileVFS, FavoritesVFS + Other: ...
  • 8.
    How should wegenerate code summaries? • Semantic information: automatic text summarization – Machine Learning – Discourse-based approaches – Term-based Text Retrieval techniques • Structural information: static analysis
  • 9.
    How can weevaluate code summaries? • How good are the automatic summaries when compared to manual ones? • How useful are the automatic code summaries for SE tasks?
  • 10.
    Preliminary evaluation • Comparedautomatic code summaries with developer code summaries • 6 developers, 12 methods in ATunes • Used only lexical information – 5 most relevant terms
  • 11.
    Results • Automatic sourcecode summaries good in reflecting developers’ summaries • Text Retrieval techniques work as well on source code as on natural language in reflecting human summaries • Developers make use of structural information in their code summaries: – Method name terms – Class name terms – Formal parameter types terms
  • 12.
    What are wedoing now? • What type and how much structural information should be included in code summaries? • How do developers generate summaries? • Are different summaries needed for different tasks? • How useful are the code summaries for SE tasks?, etc.
  • 13.
    In summary… • Automaticcode summaries: – Short yet accurate descriptions of source code – Can reduce the effort of program comprehension – Embed both semantic and structural information – Can be generated for a variety of software entities • Visit my poster (HINT: look for the huge and colorful one) • www.cs.wayne.edu/~severe and www.cs.wayne.edu/~shaiduc • sonja@wayne.edu