Mining Source Code Descriptions
from Developer Communications

Sebastiano
Panichella

Jairo
Aponte

Massimiliano
Di Penta
...
Context: Software Project

Sequence
diagram

Documentation

Source Code

Class
diagram

Developer

Program
Comprehension

...
Context: Software Project

Sequence
diagram

Documentation

Source Code

Class
diagram

Difficult
understanding

Developer...
Context: Software Project
describes

Sequence
diagram

Documentation

Source Code

Class
diagram

Difficult
understanding
...
Context: Software Project
Coming back
to the reality...

Program
Comprehension

Source Code
Maintenance
Tasks

Difficult
u...
Idea
We argue that messages exchanged among contributors/developers
are a useful source of information to help understandi...
..................................................

When call the method IndexSplitter.split(File
destDir, String[] segs) ...
A Five Step-Approach for Mining
Method Descriptions

Developer
Step 1: Downloading emails/bugs reports and tracing them
onto classes
Two heuristics

Developer
Discussion

The discussion...
Step 1: Downloading emails/bugs reports and tracing them
onto classes
Two heuristics

Developer
Discussion

The discussion...
Step 2: Extracting paragraphs
We consider as paragraphs, text section separated by one or more

white lines
Two heuristics...
Step 2: Extracting paragraphs
We consider as paragraphs, text section separated by one or more

white lines
Two heuristics...
Step 3: Tracing paragraphs onto methods
A) A valid paragraph must contain the keyword “method”
These paragraphs must
respe...
Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods ...
Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods ...
Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods ...
Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods ...
Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods ...
Step 5: Similarity based Filtering
We rank filtered paragraphs through their textual similarity with the method they
are l...
Step 5: Similarity based Filtering
We rank filtered paragraphs through their textual similarity with the method they
are l...
Empirical Study
• Goal: analyze source code descriptions in developer
discussions
• Purpose: investigating how developer d...
Context
Research Questions
 RQ1 (method coverage): How many methods from
the analyzed software systems are described by the
parag...
RQ1: How many methods from the analyzed software
systems are described by the paragraphs identified
by the proposed approa...
RQ1: How many methods from the analyzed software
systems are described by the paragraphs identified
by the proposed approa...
RQ1: How many methods from the analyzed software
systems are described by the paragraphs identified
by the proposed approa...
RQ2: How precise is the proposed approach in identifying
method descriptions?
We sampled 250 descriptions from each projec...
RQ2: How precise is the proposed approach in identifying
method descriptions?
We sampled 250 descriptions from each projec...
RQ2: How precise is the proposed approach in identifying
method descriptions?
We sampled 250 descriptions from each projec...
RQ3: How many potentially good method descriptions are
missed by the approach?
We sampled 100 descriptions from each proje...
Conclusion
Conclusion
Conclusion
Conclusion
Upcoming SlideShare
Loading in...5
×

130614 sebastiano panichella - mining source code descriptions from developers communications

115
-1

Published on

Software mining, source code, developers, e-mails

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
115
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

130614 sebastiano panichella - mining source code descriptions from developers communications

  1. 1. Mining Source Code Descriptions from Developer Communications Sebastiano Panichella Jairo Aponte Massimiliano Di Penta Andrian Marcus Gerardo Canfora
  2. 2. Context: Software Project Sequence diagram Documentation Source Code Class diagram Developer Program Comprehension Maintenance Tasks
  3. 3. Context: Software Project Sequence diagram Documentation Source Code Class diagram Difficult understanding Developer Program Comprehension Maintenance Tasks
  4. 4. Context: Software Project describes Sequence diagram Documentation Source Code Class diagram Difficult understanding understanding Developer Program Comprehension Maintenance Tasks
  5. 5. Context: Software Project Coming back to the reality... Program Comprehension Source Code Maintenance Tasks Difficult understanding Developer
  6. 6. Idea We argue that messages exchanged among contributors/developers are a useful source of information to help understanding source code. In such situations developers need to infer knowledge from, the source code itself Developer source code descriptions in external artifacts.
  7. 7. .................................................. When call the method IndexSplitter.split(File destDir, String[] segs) from the Lucene cotrib directory(contrib/misc/src/java/org/apache/luc ene/index) it creates an index with segments descriptor file with among contributors/developers We argue that messages exchanged wrong data. Namely wrong are a useful source of information to help understanding of segment is the number representing the name source code. that would be created next in this index. Idea .................................................. In such situations developers need to infer knowledge from, CLASS: IndexSplitter the source code itself Developer METHOD: split source code descriptions in external artifacts.
  8. 8. A Five Step-Approach for Mining Method Descriptions Developer
  9. 9. Step 1: Downloading emails/bugs reports and tracing them onto classes Two heuristics Developer Discussion The discussion contains a fully-qualified class name (e.g., org.apache.lucene.analysis.MappingCharFilter); or the email contains a file name (e.g., MappingCharFilter.java) For bug reports, we complement the above heuristic by matching the bug ID of each closed bug to the commit notes, therefore tracing the bug report to the files changed in that commit When call the method IndexSplitter .split(File destDir, String[] segs) from the Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates an index with segments descriptor file with wrong data. Namely wrong is the number representing the name of segment that would be created next in this index. public void split(File destDir, String[] segs) throws IOException { destDir.mkdirs(); FSDirectory destFSDir = FSDirectory.open(destDir); SegmentInfos destInfos = new SegmentInfos } If some of the segments of the index already has this name this results either to impossibility to create new segment or in crating of an corrupted segment.
  10. 10. Step 1: Downloading emails/bugs reports and tracing them onto classes Two heuristics Developer Discussion The discussion contains a fully-qualified class name (e.g., org.apache.lucene.analysis.MappingCharFilter); or the email contains a file name (e.g., MappingCharFilter.java) For bug reports, we complement the above heuristic by matching the bug ID of each closed bug to the commit notes, therefore tracing the bug report to the files changed in that commit When call the method IndexSplitter .split(File destDir, String[] segs) from the Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates an index with segments descriptor file with wrong data. Namely wrong is the number representing the name of segment that would be created next in this index. CLASS: IndexSplitter public void split(File destDir, String[] segs) throws IOException { destDir.mkdirs(); FSDirectory destFSDir = FSDirectory.open(destDir); SegmentInfos destInfos = new SegmentInfos } If some of the segments of the index already has this name this results either to impossibility to create new segment or in crating of an corrupted segment.
  11. 11. Step 2: Extracting paragraphs We consider as paragraphs, text section separated by one or more white lines Two heuristics We prune out paragraph description from source code fragments and/or stack Traces "by using an approach inspired by the work of Bacchelli et al. Developer Discussion When call the method IndexSplitter.split(File destDir, String[] segs) from the PAR Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates 1 an index with segments descriptor file with wrong data. Namely wrong is the number representing the name of segment that would be created next in this index. public void split(File destDir, String[] segs) throws IOException { destDir.mkdirs(); FSDirectory destFSDir = FSDirectory.open(destDir); SegmentInfos destInfos = new SegmentInfos } PAR 2 If some of the segments of the index already has this name this results either to impossibility to create new segment or in crating of an corrupted segment. PAR 3
  12. 12. Step 2: Extracting paragraphs We consider as paragraphs, text section separated by one or more white lines Two heuristics We prune out paragraph description from source code fragments and/or stack Traces "by using an approach inspired by the work of Bacchelli et al. Developer Discussion When call the method IndexSplitter.split(File destDir, String[] segs) from the PAR Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates 1 an index with segments descriptor file with wrong data. Namely wrong is the number representing the name of segment that would be created next in this index. public void split(File destDir, String[] segs) throws IOException { destDir.mkdirs(); FSDirectory destFSDir = FSDirectory.open(destDir); SegmentInfos destInfos = new SegmentInfos } PAR 2 If some of the segments of the index already has this name this results either to impossibility to create new segment or in crating of an corrupted segment. PAR 3
  13. 13. Step 3: Tracing paragraphs onto methods A) A valid paragraph must contain the keyword “method” These paragraphs must respect the following two conditions: Developer Discussion A) B) and the method name must be followed by a open parenthesis— i.e., we match “foo(” B) When call the method IndexSplitter.split(File destDir, String[] segs) PAR from the Lucene cotrib directory it creates an index with segments1 descriptor file with wrong data. Namely wrong is the number representing the name of segment that would be created next in this index. CLASS: IndexSplitter METHOD: split( ...................................................................................... ...................................................................................... ...................................................................................... ......................................................................................
  14. 14. Step 4: Heuristic based Filtering We defined a set of heuristics to further filter the paragraphs associated with methods that assign each paragraph a score: .......................... Problem seems to come from MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher The Method searchMainMethods(IProgressMonitor, IJavaSearchScope, boolean) ,there's a call to addSubTypes(List, IProgressMonitor, IJavaSearchScope) Method if includesSubtypes flag is ON. This method add all types subtypes as soon as the given scope encloses them without testing if sub-types have a main! After return IType[] before the excecution .......................... CLASS: MainMethodSearchEngine METHOD: serachMainMethods SCORE
  15. 15. Step 4: Heuristic based Filtering We defined a set of heuristics to further filter the paragraphs associated with methods that assign each paragraph a score: .......................... Problem seems to come from MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher The Method searchMainMethods(IProgressMonitor, IJavaSearchScope, boolean) ,there's a call to addSubTypes(List, IProgressMonitor, IJavaSearchScope) Method if includesSubtypes flag is ON. This method add all types subtypes as soon as the given scope encloses them without testing if sub-types have a main! After return IType[] before the excecution .......................... CLASS: MainMethodSearchEngine METHOD: serachMainMethods % parameter = 100% -> s1= 1 SCORE a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 1 if the method does not have parameters
  16. 16. Step 4: Heuristic based Filtering We defined a set of heuristics to further filter the paragraphs associated with methods that assign each paragraph a score: .......................... Problem seems to come from MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher The Method searchMainMethods(IProgressMonitor, IJavaSearchScope, boolean) ,there's a call to addSubTypes(List, IProgressMonitor, IJavaSearchScope) Method if includesSubtypes flag is ON. This method add all types subtypes as soon as the given scope encloses them without testing if sub-types have a main! After return IType[] before the excecution .......................... CLASS: MainMethodSearchEngine METHOD: serachMainMethods % parameter = 100% -> s1= 1 SCORE = 1+ a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 1 if the method does not have parameters b) Syntactic descriptions (mentioning return values): check whether the Equal to one if paragraph contains the the method is s2= keyword “return”. If YES void. Value equal 1, 0 otherwise
  17. 17. Step 4: Heuristic based Filtering We defined a set of heuristics to further filter the paragraphs associated with methods that assign each paragraph a score: .......................... Problem seems to come from MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher The Method searchMainMethods(IProgressMonitor, IJavaSearchScope, boolean) ,there's a call to addSubTypes(List, IProgressMonitor, IJavaSearchScope) Method if includesSubtypes flag is ON. This method add all types subtypes as soon as the given scope encloses them without testing if sub-types have a main! After return IType[] before the excecution .......................... CLASS: MainMethodSearchEngine METHOD: serachMainMethods % parameter = 100% -> s1= 1 SCORE = 1+ 0+ 1 = 2 a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 1 if the method does not have parameters b) Syntactic descriptions (mentioning return values): check whether the Equal to one if paragraph contains the the method is s2= keyword “return”. If YES void. Value equal 1, 0 otherwise c) Overriding/Overloading: 1 if any of the “overload” or s3=“override” keywords appears in the paragraph, 0 otherwise d) Method invocations: 1 if any of the “call” or s4=“excecute” keywords appears in the paragraph, 0 otherwise
  18. 18. Step 4: Heuristic based Filtering We defined a set of heuristics to further filter the paragraphs associated with methods that assign each paragraph a score: We selected paragraphs that have: 1. s1 ≥ thP = 0.5 2. s2 + s3 + s4 ≥ thH = 1 a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 b) Syntactic descriptions (mentioning return values): check whether the Equal to one if paragraph contains the the method is s2= keyword “return”. If YES void. Value equal 1, 0 otherwise OK % parameter = 100% -> s1= 1 ≥ 0.5 SCORE = 1+ 0+ 1 = 2 ≥ 1 1 if the method does not have parameters c) Overriding/Overloading: 1 if any of the “overload” or s3=“override” keywords appears in the paragraph, 0 otherwise d) Method invocations: 1 if any of the “call” or s4=“execute” keywords appears in the paragraph, 0 otherwise
  19. 19. Step 5: Similarity based Filtering We rank filtered paragraphs through their textual similarity with the method they are likely describing. METHOD PARAGRAPH SCORE Similarity Method_3 Paragraph_4 2.5 96.1% Method_1 Paragraph_1 2.5 95.6% Method_2 Paragraph_2 1.5 97.4% Method_3 Paragraph_3 1.5 86.2% Method_1 Paragraph_3 1.5 79.0% Method_3 Paragraph_2 1.5 77.5% Method_2 Paragraph_4 1.5 64.3% Method_2 Paragraph_3 1.3 83.2% Method_3 Paragraph_1 1.3 73.9% Method_2 Paragraph_1 1.3 68.7% Method_1 Paragraph_4 1.3 53.6% Removing: - English stop words; - Programming language keywords Using: - Camel Case splitting the on remaining words - Vector Space Model
  20. 20. Step 5: Similarity based Filtering We rank filtered paragraphs through their textual similarity with the method they are likely describing. METHOD PARAGRAPH SCORE Similarity Method_3 Paragraph_4 2.5 96.1% Method_1 Paragraph_1 2.5 95.6% Method_2 Paragraph_2 1.5 97.4% Method_3 Paragraph_3 1.5 86.2% Method_1 Paragraph_3 1.5 79.0% Method_3 Paragraph_2 1.5 77.5% Method_2 Paragraph_4 1.5 64.3% Method_2 Paragraph_3 1.3 83.2% Method_3 Paragraph_1 1.3 73.9% Method_2 Paragraph_1 1.3 68.7% Method_1 Paragraph_4 1.3 53.6% Removing: - English stop words; - Programming language keywords Using: - Camel Case splitting the on remaining words - Vector Space Model th>=0.80
  21. 21. Empirical Study • Goal: analyze source code descriptions in developer discussions • Purpose: investigating how developer discussions describe methods of Java Source Code • Quality focus: find good method description in messages exchanged among contributors/developers • Context: Bug-report and mailing lists of two Java Project  Apache Lucene and Eclipse
  22. 22. Context
  23. 23. Research Questions  RQ1 (method coverage): How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?  RQ2 (precision): How precise is the proposed approach in identifying method descriptions?  RQ3 (missing descriptions): How many potentially good method descriptions are missed by the approach?
  24. 24. RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?
  25. 25. RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?
  26. 26. RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?
  27. 27. RQ2: How precise is the proposed approach in identifying method descriptions? We sampled 250 descriptions from each project
  28. 28. RQ2: How precise is the proposed approach in identifying method descriptions? We sampled 250 descriptions from each project
  29. 29. RQ2: How precise is the proposed approach in identifying method descriptions? We sampled 250 descriptions from each project
  30. 30. RQ3: How many potentially good method descriptions are missed by the approach? We sampled 100 descriptions from each project TABLE III The analysis of a sample of 100 paragraphs traced to methods, but not satisfying the Step 4 heuristic System True Negatives False Negatives Eclipse 78 22 Apache Lucene 67 33
  31. 31. Conclusion
  32. 32. Conclusion
  33. 33. Conclusion
  34. 34. Conclusion
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×