2. Contents
2
Why Code Summarization?
Thesis Statement
Research Questions about summary
Research Questions about tool
Automatic Code Summarization
Evaluation
Experiments Conducted
Pyramid Method
Important Findings
My Observation & Future Works
3. Why Code Summarization?
Program
comprehension 50% of all
maintenance works
Two extreme approaches – skim through and
read thoroughly
Skim through – leads to misunderstanding
Read thoroughly – time consuming
An intermediate solution – source code entity
with comprehensive textual description
3
4. Thesis Statement
New
idea: code summarization to help in
program comprehension (PC)
Applying TR methods like Latent Semantic
Indexing in source code summarization.
Combining structural information with
retrieved code summary to make it effective
for realistic purposes.
4
5. Research Questions of Code
Summarization
Summary
should be automatically generated
Generate summary to different granularity
levels – class, method, packages etc
Shorter than the source code
Capture and preserve code semantics and
structure – text as well as structure from the
code
Consistent structure – important items at first
5
6. Research Questions of Code
Summarization
Summary
should reflect the developer’s
understanding about the code
Tool should allow user to change summary
and will remember user’s choice in future
summary
Tool should rebuild the summary if the code
changes or developer’s provide feedback
6
7. Research Questions about
Summarizer Tool
7
Which summarization technique works the best for
source code?
What type of structural info necessary in summary?
Will the summary be different for different type of
maintenance task?
How long it would be?
How much will it resemble to actual summary?
How do developers generate summary?
9. Automatic Code Summarization
Two
types info extracted – lexical and
structural
Lexical info – identifiers and comments are
extracted
Common English and PL keywords are
removed
Identifiers are split into constituent words and
stemming performed.
9
10. Automatic Code Summarization
Extracted
lexical info forms the text corpus of
code where TR methods (e.g. LSI) used to
get most important n words.
Once retrieved, n words are combined with
structural info like their class name, method
name, package name, parameter name and
type etc
How to apply structural info to autogenerated summary is an important part
10
11. Automatic Code Summarization
A
method name reflects the description of
what it does.
If method name ignored by TR, the tool can
introduce it automatically
Additional info can be added like –user tags
11
12. Evaluation
12
Two types – intrinsic and extrinsic
Intrinsic – content evaluation, how closely it depicts
the document or how close to manually generated
summary
Metrics- precision, recall, pyramid method
Extrinsic – how much utility and usability it has to
support SE tasks – concept location, impact
analysis, software reuse, traceability links recovery
etc
13. Experiments Conducted
Pyramid
method
ATunes OS project, 12 methods
6 developers from different demographic
locations, undergraduate students, 3 years
Java programming experiences
Developers provided with a list of terms, they
need to choose 5 terms for each method that
suits best, 60 minutes total time
13
14. Experiments Conducted
Corpus
containing whole code vocabulary
Each method is a different document
LSI indexing the corpus against each method
terms
Cosine measure between corpus and
method and corpus words are ranked
Top 5 words from corpus are chosen
14
17. Important Findings
17
Pyramid score >=.1 and <=.5, marked it encouraging
Words chosen by developers – 98.7% in method
name, 88.9% in class name and 84.6% in parameter
name
Automatic summary terms – 20% in method name,
12.9% in class name and 30.7% in parameter name
Structural info should be considered properly in
automatic summary
Comments text not included in summary
18. My Observation &Future Works
18
The corpus development technique is not well
specified- no specification about redundancy
protection
LSI focuses on term frequency rather than structural
info which produces bad scores.
During cosine measurement structural info of term in
the method could be considered to get better results
There should have some heuristic measurement for
structural info.