ICPC11c.ppt

Can Better Identifier Splitting
Techniques H l F
T h i Help Feature LLocation?
i ?
Bogdan Dit, Latifa Guerrouj, D
B d Di L if G j Denys P h
Poshyvanyk, Gi li
k Giuliano A
Antoniol
i l

SEMERU

19th IEEE International Conference on Program Comprehension
(ICPC’11) – Kingston, Ontario, Canada

Textual information embeds
domain k
d i knowledge
l d

3

domain k
d i knowledge
l d

About 70% of source code
consists of identifiers*

* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
4
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282

domain k
d i knowledge
l d

About 70% of source code
consists of identifiers*

Identifiers are important source of
information for maintenance tasks:
• traceability link recovery
• feature location
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
5
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282

# of Cumulative Feature Location
Papers based on Textual Information

6

Related Work on Identifiers
• Takang et al. (JPL 96)
al (JPL’96)
– programs with full-word identifiers are more
understandable than those with abbreviated
ones
• Lawrie et al. (ICPC 06)
al (ICPC’06)
– full words and recognizable abbreviations
lead to better comprehension
• Binkley et al. (ICPC’09)
– CamelCase style is easier to recognize than
underscore 7

Related Work on Identifiers
• Enslen et al. (MSR 09)
al (MSR’09)
– Samurai: algorithm for splitting identifiers
(using tables of identifier frequencies)
• Guerrouj et al. (JSME’11)
– TIDIER: algorithm for splitting identifiers
(using contextual information)
• Other related work
– Deissenboeck and Pizka (SQJ’06), Antoniol et
al. (ICSM’07),
al (ICSM’07) Haiduc and Marcus (ICPC’08)
(ICPC’08),
etc. 8

Splitting Identifiers Correctly is
Challenging
h ll

9

Identifier Splitting Algorithms
Original Identifier
userId
setGID
print_file2device
print file2device
SSLCertificate
MINstring
USERID
currentsize
readadapterobject
p j
tolocale
imitating
DEFMASKBit

10

Original Identifier Camel Case
userId user Id
setGID set GID
print_file2device
print file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject
p j readadapterobject
p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK Bit

11

Handles
Original Identifier Camel Case underscore and
userId user Id
digits
setGID set GID
print_file2device
USERID USERID
readadapterobject
p j readadapterobject Fails
p j at mixed cases
tolocale tolocale
imitating imitating
Fails at same case
identifiers 12

Original Identifier Camel Case
userId user Id
setGID set GID
print_file2device
USERID USERID
readadapterobject
p j
tolocale tolocale
imitating imitating

13

Original Identifier Camel Case Samurai
userId user Id user Id
setGID set GID set GID
print_file2device
print file2device print file 2 device print file 2 device
SSLCertificate SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
USERID USERID USER ID
currentsize currentsize current size
readadapterobject
p j read adapter object
p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK Bit

14

Original Identifier Camel Case Samurai
userId user Id
Splits some cases user Id
setGID set GID set GID
where CamelCase
print_file2device
print file2device print file 2 device print file 2 device
SSLCertificate
cannot
SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
USERID USERID USER ID
currentsize currentsize current size
readadapterobject
p j read adapter object
p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK Bit

Oversplits 15


16


Existing feature location techniques
use Camel Case splitting

17

Information Retrieval FLT
• Generate corpus synchronized void print(TestResult result,
• Preprocessing long runTime) throws IOE
l Ti ) h
printHeader(runTime);
IOException{
i {

– Remove non-literals printErrors(result);
– Remove stop words p
printFailures(result);
( );
printFooter(result);
– Split identifiers
}
– Stemming
synchronized void print TestResult result
• I d i
Indexing long runTime throws IOException
– Term-by-document matrix printHeader runTime printErrors result
printFailures result printFooter result
– Singular Value
g
Decomposition
• User formulate query print TestResult result runTime
IOException printHeader runTime
O cept o p t eade u e
• G
Generate results
t lt printErrors result printFailures result
• Ranked list printFooter result 18

• Generate corpus print Test Result result run Time IO
Exception print Header run Ti print
E ti i tH d Time i t
• Preprocessing Errors result print Failures result print
– Remove non-literals Footer result
– Remove stop words
print test result result run time io
– Stemming exception print head run time print error
• I d i
Indexing result print fail result print foot result
– Term-by-document matrix
– Singular Value
g
Decomposition print test result ...
• User formulate query
m1 5 1 3 ...
• G
Generate results
t lt
m2 ... ... ... ...
• Ranked list 19

print test result ...
• Generate corpus
• Preprocessing m1 5 1 3 ...

– Remove non-literals m2 ... ... ... ...
– Stemming
• I d i
Indexing
– Singular Value
g
Decomposition
• G
Generate results
t lt
• Ranked list 20

IR and Dynamic Information FLT
• Generate corpus
• Preprocessing
– Remove non-literals
– Stemming
• I d i
Indexing
– Singular Value Decomposition
g p
Collect execution
• User formulate query trace
• Generate results
• Ranked list of executed methods
21

Research Goal
R hG l
Evaluate how advanced splitting techniques impact
the
th performance of feature location techniques
f ff t l ti t h i

22

• Generate corpus
• Preprocessing
– Remove non-literals
Replace Camel Case with :
– Remove stop words •Samurai
– Split identifiers •“Perfect” Splitting
– Stemming algorithm (
g (Oracle) )
• I d i
Indexing
– Singular Value
g
Decomposition Better

• G
Generate results
t lt
Worst
W

• Ranked list 23

Extract Identifiers

All
Identifiers

Building
B ildi
the Oracle
24

Extract Identifiers

Same
All p
split?
Identifiers (CamelCase
Samurai
TIDIER)

YES

Concordant
Building
B ildi Split
Identifiers
the Oracle
25

Extract Identifiers

Same
All p
split?
Samurai • Assume they are
TIDIER)
correct
YES
• Manually verified a
Concordant sample
Building
B ildi Split
Identifiers
the Oracle
26
• Threat to validity

Manually
Split
Identifiers

Extract Identifiers Manual Split

Same Discordant
NO
All p
split? Split
Identifiers
Samurai
TIDIER)

YES

Concordant
Building
B ildi Split
Identifiers
the Oracle
27

Manually
Split
Identifiers

Consensus
between authors

Checked
Same
All p
split? source codeDiscordant
NO
Split
Identifiers
Samurai
TIDIER)

YES

Concordant
Building
B ildi Split
Identifiers
the Oracle
28

Identifiers Manually
• Examples: DT, i3, that could Split
not be split Identifiers
P754, zzz, etc.

• Left unchanged

Same Discordant
NO
All p
split? Split
Identifiers
Samurai
TIDIER)

YES

Concordant
Building
B ildi Split
Identifiers
the Oracle
29

Design of the Case Study

30

Design of the Case Study
• RQ: Does a FLT with an advanced
splitting algorithm produce better results
than the same FLT using the CamelCase
splitting algorithm?

31

How to Compare two FLTs?

32

• Effectiveness measure for each feature
IR
Method LSI
score
M121 0.92
M64 0.89
M15 0.86 Gold t
G ld set method
th d
M39 0.80
M7 0.74
M152 0.65
0 65
M234 0.56 Effectiveness = 5
M12 0.54
M78 0.52
0 52

33

• Effectiveness measure for each feature
IR
Method LSI y
IRDyn
score
Method LSI
M121 0.92 score
M64 0.89
Gold set method M15 0.86
M15 0.86 M7 0.74
M39 0.80 M234 0.56
M7 0.74 M12 0.54
M152 0.65
0 65
Effectiveness = 2
ect e ess
M234 0.56
M12 0.54
M78 0.52
0 52

Method 34
Executed method (from trace)

Which FLTs are we Comparing?

35

Software Systems
• Rhino 1 6R5
1.6R5
• 138 classes, 1,870 methods, 32K LOC
• Eaddy
E dd et al.’s d *
l ’ data*
• 2 datasets
Dataset Size Queries Gold Sets Execution
Information
RhinoFeatures 241 Sections of Eaddy et al.* Full Execution Traces
ECMAScript (from unit tests)
documentation
Rhino
Rhi Bugs 143 Bug title d
B titl and Eaddy t l *
E dd et al.* N/A
description (CVS)
36
* http://www.cs.columbia.edu/~eaddy/concerntagger/

Software Systems
• jEdit 4 3
4.3
• 483 classes, 6.4K methods, 109K LOC
•2d t t
datasets
Dataset Size Queries Gold Sets Execution
Information
jEditFeatures 64 Feature (or Patch) SVN Marked Execution
title and description Traces
jEditBugs 86 Bug title and SVN Marked Execution
description Traces

Datasets available at:
http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/ 37

Generating the jEdit SVN Commits between
Datasets v4.2-v4.3

38

Generating the jEdit
Datasets

SVN commit
message

Title

+
Description

=
Query

Generating the jEdit
Datasets

Changed files

Previous Current
Version
V i Version
V i
Compare
using
Eclipse AST
c pse S

Modified methods
(gold set) 40

Presenting the Results

41

Presenting the Results
Box plot of all effectiveness measure in
datasets
(e.g., 241 datapoints for RhinoFeatures)

Average

Median

42

IR FLTs

Similar median
S
and average
RhinoFeatures

IR FLTs

Similar median
S
and average
RhinoFeatures RhinoBugs

jEditFeatures jEditBugs 45

IR FLTs

Similar median
S
and average

Datasets with
features have
better results
than datasets
with bugs


IRDyn FLTs
N/A

Similar median
S
and average


IRDyn FLTs
N/A

Similar median
S
and average

Datasets with
atasets t
features have
better results
than datasets
with bugs


Compare FLTs by Percentages
IROracle IRCamelCase
(Baseline)
10 17
20 15
18 18
5 9
4 16
19 7
12 28
14 15
49

Compare FLTs by Percentages
IROracle IRCamelCase
5/8
(Baseline)
10 17
20 15
18 18
2/8
5 9
4 16
19 7
12 28
14 15
50

IR


51
jEditFeatures jEditBugs

IR

Datasets with features
RhinoFeatures vs. RhinoBugs
Datasets with bugs

52

IRDyn
N/A

Similar trend

53

Statistical Results
• Wilcoxon signed-rank test
signed rank
• Null hypothesis
– Th
There is no statistical significance diff
i t ti ti l i ifi difference
in terms of effectiveness between
IRSamurai/IROracle and IRCamelCase
l l

• Alternative hypothesis
– IRSamurai/IROracle h statistically significantly
has t ti ti ll i ifi tl
higher effectiveness than IRCamelCase
• alpha = 0 05
l h 0.05
54

IR

RhinoFeatures The only RhinoBugs
statistical
significant result
(p=0.05)

55
jEditFeatures
jEditB

Qualitative Results
• Vocabulary mismatch between queries
and code:
– Name of developers (e.g., Slava, Carlos)
– Id ifi
Identifiers specific to communication (
ifi i i (e.g.,
thanks, greetings, annoying)

56

Qualitative Results
• Features are more “descriptive” than
bugs

57

Qualitative Results
• Features are more “descriptive” than
bugs

Words “join” and
“line” are not
mentioned

58

Threats to Validity
• External
– 2 Java applications (different domains)
– More systems needed
• Construct
– Errors may be p
y present in Oracle and g
gold sets
– We used data produced by other researchers
• Internal
– Subjectivity and bias in building the Oracle
• Conclusion
– Non-parametric test: Wilcoxon signed-rank 59

Research Questions
• RQ1 Does IRSamurai outperform IRCamelCase in
terms of effectiveness? NO

• RQ2 Does IRSSamuraiDyn outperform IRC
i CamelCaseDyn
lC
in terms of effectiveness? NO

• RQ3 Does IROracle outperform IRCamelCase in terms
of effectiveness? I some cases (Rhi )
f ff ti ? In (Rhino)

• RQ4 Does IROracleDyn outperform IRCamelCaseDyn
60
in terms of effectiveness? NO

Future Work
• More systems and datasets
• Different maintenance tasks
– T
Traceability li k recovery
bilit link
• Consider other splitting algorithms

61

Conclusions
• Advanced splitting technique could
improve FLTs
– We found some empirical evidence
• Splitting has more impact on IR FLT
• If execution information is available, it is
f f l bl
not necessary to use an advance splitting
technique
h i

62

Thank you! Questions?
SEMERU @ William and Mary
http://www.cs.wm.edu/semeru/
bdit@cs.wm.edu
bdi d

SEMERU

63

References
• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The
Effects of Comments and Identifier Names on Program Comprehensibility:
An Experimental Investigation", Journal of Programming Languages, vol. 4,
no. 3, 1996, pp. 143-167
• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley, D
al. Lawrie, D., Morrell, C., Feild, H., Binkley D.,
"What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June
14-16 2006, pp. 3-12
• Binkley et al. (
y (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,
) y, , , , , , , ,
"To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009,
pp. 158-167
• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K.,
"Mining Source C d to A
"Mi i S Code Automatically S li Id ifi
i ll Split Identifiers f S f
for Software
Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80
• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and
Guéhéneuc, Y. G., TIDIER:
Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech
Recognition Techniques", JSME, vol. to appear, 2011
64

ICPC11c.ppt

Recommended

Recommended

More Related Content

Similar to ICPC11c.ppt

Similar to ICPC11c.ppt (20)

More from Ptidej Team

More from Ptidej Team (20)

Recently uploaded

Recently uploaded (20)

ICPC11c.ppt