The document discusses evaluating how advanced techniques for splitting identifiers impact the performance of feature location techniques. It describes using different identifier splitting algorithms like CamelCase, Samurai, and a manually developed "perfect" splitting algorithm as inputs to information retrieval (IR) based feature location techniques. The study compares the effectiveness of IR feature location using different splitting techniques on two software systems, Rhino and jEdit, across four datasets containing features and bugs. Results are presented using box plots to show median and average effectiveness for each technique and dataset.
CNIC Information System with Pakdata Cf In Pakistan
ICPC11c.ppt
1. Can Better Identifier Splitting
Techniques H l F
T h i Help Feature LLocation?
i ?
Bogdan Dit, Latifa Guerrouj, D
B d Di L if G j Denys P h
Poshyvanyk, Gi li
k Giuliano A
Antoniol
i l
SEMERU
19th IEEE International Conference on Program Comprehension
(ICPC’11) – Kingston, Ontario, Canada
4. Textual information embeds
domain k
d i knowledge
l d
About 70% of source code
consists of identifiers*
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
4
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
5. Textual information embeds
domain k
d i knowledge
l d
About 70% of source code
consists of identifiers*
Identifiers are important source of
information for maintenance tasks:
• traceability link recovery
• feature location
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
5
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
6. # of Cumulative Feature Location
Papers based on Textual Information
6
7. Related Work on Identifiers
• Takang et al. (JPL 96)
al (JPL’96)
– programs with full-word identifiers are more
understandable than those with abbreviated
ones
• Lawrie et al. (ICPC 06)
al (ICPC’06)
– full words and recognizable abbreviations
lead to better comprehension
• Binkley et al. (ICPC’09)
– CamelCase style is easier to recognize than
underscore 7
8. Related Work on Identifiers
• Enslen et al. (MSR 09)
al (MSR’09)
– Samurai: algorithm for splitting identifiers
(using tables of identifier frequencies)
• Guerrouj et al. (JSME’11)
– TIDIER: algorithm for splitting identifiers
(using contextual information)
• Other related work
– Deissenboeck and Pizka (SQJ’06), Antoniol et
al. (ICSM’07),
al (ICSM’07) Haiduc and Marcus (ICPC’08)
(ICPC’08),
etc. 8
11. Identifier Splitting Algorithms
Original Identifier Camel Case
userId user Id
setGID set GID
print_file2device
print file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject
p j readadapterobject
p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK Bit
11
12. Identifier Splitting Algorithms
Handles
Original Identifier Camel Case underscore and
userId user Id
digits
setGID set GID
print_file2device
print file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject
p j readadapterobject Fails
p j at mixed cases
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK Bit
Fails at same case
identifiers 12
13. Identifier Splitting Algorithms
Original Identifier Camel Case
userId user Id
setGID set GID
print_file2device
print file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject
p j readadapterobject
p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK Bit
13
14. Identifier Splitting Algorithms
Original Identifier Camel Case Samurai
userId user Id user Id
setGID set GID set GID
print_file2device
print file2device print file 2 device print file 2 device
SSLCertificate SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
USERID USERID USER ID
currentsize currentsize current size
readadapterobject
p j readadapterobject
p j read adapter object
p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK Bit
14
15. Identifier Splitting Algorithms
Original Identifier Camel Case Samurai
userId user Id
Splits some cases user Id
setGID set GID set GID
where CamelCase
print_file2device
print file2device print file 2 device print file 2 device
SSLCertificate
cannot
SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
USERID USERID USER ID
currentsize currentsize current size
readadapterobject
p j readadapterobject
p j read adapter object
p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK Bit
Oversplits 15
16. # of Cumulative Feature Location
Papers based on Textual Information
16
17. # of Cumulative Feature Location
Papers based on Textual Information
Existing feature location techniques
use Camel Case splitting
17
18. Information Retrieval FLT
• Generate corpus synchronized void print(TestResult result,
• Preprocessing long runTime) throws IOE
l Ti ) h
printHeader(runTime);
IOException{
i {
– Remove non-literals printErrors(result);
– Remove stop words p
printFailures(result);
( );
printFooter(result);
– Split identifiers
}
– Stemming
synchronized void print TestResult result
• I d i
Indexing long runTime throws IOException
– Term-by-document matrix printHeader runTime printErrors result
printFailures result printFooter result
– Singular Value
g
Decomposition
• User formulate query print TestResult result runTime
IOException printHeader runTime
O cept o p t eade u e
• G
Generate results
t lt printErrors result printFailures result
• Ranked list printFooter result 18
19. Information Retrieval FLT
• Generate corpus print Test Result result run Time IO
Exception print Header run Ti print
E ti i tH d Time i t
• Preprocessing Errors result print Failures result print
– Remove non-literals Footer result
– Remove stop words
– Split identifiers
print test result result run time io
– Stemming exception print head run time print error
• I d i
Indexing result print fail result print foot result
– Term-by-document matrix
– Singular Value
g
Decomposition print test result ...
• User formulate query
m1 5 1 3 ...
• G
Generate results
t lt
m2 ... ... ... ...
• Ranked list 19
20. Information Retrieval FLT
print test result ...
• Generate corpus
• Preprocessing m1 5 1 3 ...
– Remove non-literals m2 ... ... ... ...
– Remove stop words
– Split identifiers
– Stemming
• I d i
Indexing
– Term-by-document matrix
– Singular Value
g
Decomposition
• User formulate query
• G
Generate results
t lt
• Ranked list 20
21. IR and Dynamic Information FLT
• Generate corpus
• Preprocessing
– Remove non-literals
– Remove stop words
– Split identifiers
– Stemming
• I d i
Indexing
– Term-by-document matrix
– Singular Value Decomposition
g p
Collect execution
• User formulate query trace
• Generate results
• Ranked list of executed methods
21
22. Research Goal
R hG l
Evaluate how advanced splitting techniques impact
the
th performance of feature location techniques
f ff t l ti t h i
22
23. Information Retrieval FLT
• Generate corpus
• Preprocessing
– Remove non-literals
Replace Camel Case with :
– Remove stop words •Samurai
– Split identifiers •“Perfect” Splitting
– Stemming algorithm (
g (Oracle) )
• I d i
Indexing
– Term-by-document matrix
– Singular Value
g
Decomposition Better
• User formulate query
• G
Generate results
t lt
Worst
W
• Ranked list 23
25. Extract Identifiers
Same
All p
split?
Identifiers (CamelCase
Samurai
TIDIER)
YES
Concordant
Building
B ildi Split
Identifiers
the Oracle
25
26. Extract Identifiers
Same
All p
split?
Identifiers (CamelCase
Samurai • Assume they are
TIDIER)
correct
YES
• Manually verified a
Concordant sample
Building
B ildi Split
Identifiers
the Oracle
26
• Threat to validity
27. Manually
Split
Identifiers
Extract Identifiers Manual Split
Same Discordant
NO
All p
split? Split
Identifiers (CamelCase
Identifiers
Samurai
TIDIER)
YES
Concordant
Building
B ildi Split
Identifiers
the Oracle
27
28. Manually
Split
Identifiers
Consensus
Extract Identifiers Manual Split
between authors
Checked
Same
All p
split? source codeDiscordant
NO
Split
Identifiers (CamelCase
Identifiers
Samurai
TIDIER)
YES
Concordant
Building
B ildi Split
Identifiers
the Oracle
28
29. Identifiers Manually
• Examples: DT, i3, that could Split
not be split Identifiers
P754, zzz, etc.
• Left unchanged
Extract Identifiers Manual Split
Same Discordant
NO
All p
split? Split
Identifiers (CamelCase
Identifiers
Samurai
TIDIER)
YES
Concordant
Building
B ildi Split
Identifiers
the Oracle
29
31. Design of the Case Study
• RQ: Does a FLT with an advanced
splitting algorithm produce better results
than the same FLT using the CamelCase
splitting algorithm?
31
33. How to Compare two FLTs?
• Effectiveness measure for each feature
IR
Method LSI
score
M121 0.92
M64 0.89
M15 0.86 Gold t
G ld set method
th d
M39 0.80
M7 0.74
M152 0.65
0 65
M234 0.56 Effectiveness = 5
M12 0.54
M78 0.52
0 52
33
34. How to Compare two FLTs?
• Effectiveness measure for each feature
IR
Method LSI y
IRDyn
score
Method LSI
M121 0.92 score
M64 0.89
Gold set method M15 0.86
M15 0.86 M7 0.74
M39 0.80 M234 0.56
M7 0.74 M12 0.54
M152 0.65
0 65
Effectiveness = 2
ect e ess
M234 0.56
M12 0.54
M78 0.52
0 52
Method 34
Executed method (from trace)
36. Software Systems
• Rhino 1 6R5
1.6R5
• 138 classes, 1,870 methods, 32K LOC
• Eaddy
E dd et al.’s d *
l ’ data*
• 2 datasets
Dataset Size Queries Gold Sets Execution
Information
RhinoFeatures 241 Sections of Eaddy et al.* Full Execution Traces
ECMAScript (from unit tests)
documentation
Rhino
Rhi Bugs 143 Bug title d
B titl and Eaddy t l *
E dd et al.* N/A
description (CVS)
36
* http://www.cs.columbia.edu/~eaddy/concerntagger/
37. Software Systems
• jEdit 4 3
4.3
• 483 classes, 6.4K methods, 109K LOC
•2d t t
datasets
Dataset Size Queries Gold Sets Execution
Information
jEditFeatures 64 Feature (or Patch) SVN Marked Execution
title and description Traces
jEditBugs 86 Bug title and SVN Marked Execution
description Traces
Datasets available at:
http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/ 37
40. Generating the jEdit
Datasets
Changed files
Previous Current
Version
V i Version
V i
Compare
using
Eclipse AST
c pse S
Modified methods
(gold set) 40
44. IR FLTs
Similar median
S
and average
RhinoFeatures
45. IR FLTs
Similar median
S
and average
RhinoFeatures RhinoBugs
jEditFeatures jEditBugs 45
46. IR FLTs
Similar median
S
and average
RhinoFeatures RhinoBugs
Datasets with
features have
better results
than datasets
with bugs
jEditFeatures jEditBugs 46
47. IRDyn FLTs
N/A
Similar median
S
and average
RhinoFeatures RhinoBugs
jEditFeatures jEditBugs 47
48. IRDyn FLTs
N/A
Similar median
S
and average
RhinoFeatures RhinoBugs
Datasets with
atasets t
features have
better results
than datasets
with bugs
jEditFeatures jEditBugs 48
52. IR
Datasets with features
RhinoFeatures vs. RhinoBugs
Datasets with bugs
52
jEditFeatures jEditBugs
53. IRDyn
N/A
Similar trend
RhinoFeatures RhinoBugs
53
jEditFeatures jEditBugs
54. Statistical Results
• Wilcoxon signed-rank test
signed rank
• Null hypothesis
– Th
There is no statistical significance diff
i t ti ti l i ifi difference
in terms of effectiveness between
IRSamurai/IROracle and IRCamelCase
l l
• Alternative hypothesis
– IRSamurai/IROracle h statistically significantly
has t ti ti ll i ifi tl
higher effectiveness than IRCamelCase
• alpha = 0 05
l h 0.05
54
55. IR
RhinoFeatures The only RhinoBugs
statistical
significant result
(p=0.05)
55
jEditFeatures
jEditB
56. Qualitative Results
• Vocabulary mismatch between queries
and code:
– Name of developers (e.g., Slava, Carlos)
– Id ifi
Identifiers specific to communication (
ifi i i (e.g.,
thanks, greetings, annoying)
56
59. Threats to Validity
• External
– 2 Java applications (different domains)
– More systems needed
• Construct
– Errors may be p
y present in Oracle and g
gold sets
– We used data produced by other researchers
• Internal
– Subjectivity and bias in building the Oracle
• Conclusion
– Non-parametric test: Wilcoxon signed-rank 59
60. Research Questions
• RQ1 Does IRSamurai outperform IRCamelCase in
terms of effectiveness? NO
• RQ2 Does IRSSamuraiDyn outperform IRC
i CamelCaseDyn
lC
in terms of effectiveness? NO
• RQ3 Does IROracle outperform IRCamelCase in terms
of effectiveness? I some cases (Rhi )
f ff ti ? In (Rhino)
• RQ4 Does IROracleDyn outperform IRCamelCaseDyn
60
in terms of effectiveness? NO
61. Future Work
• More systems and datasets
• Different maintenance tasks
– T
Traceability li k recovery
bilit link
• Consider other splitting algorithms
61
62. Conclusions
• Advanced splitting technique could
improve FLTs
– We found some empirical evidence
• Splitting has more impact on IR FLT
• If execution information is available, it is
f f l bl
not necessary to use an advance splitting
technique
h i
62
63. Thank you! Questions?
SEMERU @ William and Mary
http://www.cs.wm.edu/semeru/
bdit@cs.wm.edu
bdi d
SEMERU
63
64. References
• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The
Effects of Comments and Identifier Names on Program Comprehensibility:
An Experimental Investigation", Journal of Programming Languages, vol. 4,
no. 3, 1996, pp. 143-167
• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D
al. Lawrie, D., Morrell, C., Feild, H., Binkley D.,
"What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June
14-16 2006, pp. 3-12
• Binkley et al. (
y (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,
) y, , , , , , , ,
"To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009,
pp. 158-167
• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K.,
"Mining Source C d to A
"Mi i S Code Automatically S li Id ifi
i ll Split Identifiers f S f
for Software
Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80
• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and
Guéhéneuc, Y. G., TIDIER:
Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech
Recognition Techniques", JSME, vol. to appear, 2011
64