SlideShare a Scribd company logo
Can Better Identifier Splitting
T h i H l F L i ?
B d Di L if G j D P h k Gi li A i l
Techniques Help Feature Location?
Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol
SEMERU
19th IEEE International Conference on Program Comprehension
(ICPC’11) – Kingston, Ontario, Canada
2
Textual information embeds
d i k l ddomain knowledge
3
Textual information embeds
d i k l ddomain knowledge
About 70% of source codeAbout 70% of source code
consists of identifiers*
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
4
Textual information embeds
d i k l ddomain knowledge
About 70% of source codeAbout 70% of source code
consists of identifiers*
Identifiers are important source ofIdentifiers are important source of
information for maintenance tasks:information for maintenance tasks:
• traceability link recovery
• feature location
* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software
Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
feature location
5
# of Cumulative Feature Location
Papers based on Textual InformationPapers based on Textual Information
6
Related Work on IdentifiersRelated Work on Identifiers
• Takang et al (JPL’96)• Takang et al. (JPL 96)
– programs with full-word identifiers are more
understandable than those with abbreviatedunderstandable than those with abbreviated
ones
• Lawrie et al (ICPC’06)• Lawrie et al. (ICPC 06)
– full words and recognizable abbreviations
lead to better comprehensionlead to better comprehension
• Binkley et al. (ICPC’09)
CamelCase style is easier to recognize than– CamelCase style is easier to recognize than
underscore 7
Related Work on IdentifiersRelated Work on Identifiers
• Enslen et al (MSR’09)• Enslen et al. (MSR 09)
– Samurai: algorithm for splitting identifiers
(using tables of identifier frequencies)(using tables of identifier frequencies)
• Guerrouj et al. (JSME’11)
TIDIER: algorithm for splitting identifiers– TIDIER: algorithm for splitting identifiers
(using contextual information)
Other related work• Other related work
– Deissenboeck and Pizka (SQJ’06), Antoniol et
al (ICSM’07) Haiduc and Marcus (ICPC’08)al. (ICSM’07), Haiduc and Marcus (ICPC’08),
etc. 8
Splitting Identifiers Correctly is
h llChallenging
9
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier
userId
setGID
print file2deviceprint_file2device
SSLCertificate
MINstring
USERID
currentsize
readadapterobjectp j
tolocale
imitating
DEFMASKBitDEFMASKBit
10
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier Camel Case
userId user Id
setGID set GID
print file2device print file 2 deviceprint_file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject readadapterobjectp j p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit
11
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier Camel Case
Handles
underscore and
userId user Id
setGID set GID
print file2device print file 2 device
underscore and
digits
print_file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject readadapterobject Fails at mixed casesp j p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit
12
Fails at same case
identifiers
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier Camel Case
userId user Id
setGID set GID
print file2device print file 2 deviceprint_file2device print file 2 device
SSLCertificate SSL Certificate
MINstring MI Nstring
USERID USERID
currentsize currentsize
readadapterobject readadapterobjectp j p j
tolocale tolocale
imitating imitating
DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit
13
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier Camel Case Samurai
userId user Id user Id
setGID set GID set GID
print file2device print file 2 device print file 2 deviceprint_file2device print file 2 device print file 2 device
SSLCertificate SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
USERID USERID USER ID
currentsize currentsize current size
readadapterobject readadapterobject read adapter objectp j p j p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit
14
Identifier Splitting AlgorithmsIdentifier Splitting Algorithms
Original Identifier Camel Case Samurai
userId user Id user Id
setGID set GID set GID
print file2device print file 2 device print file 2 device
Splits some cases
where CamelCase
print_file2device print file 2 device print file 2 device
SSLCertificate SSL Certificate SSL Certificate
MINstring MI Nstring MIN string
cannot
USERID USERID USER ID
currentsize currentsize current size
readadapterobject readadapterobject read adapter objectp j p j p j
tolocale tolocale tol ocal e
imitating imitating imi ta ting
DEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit
15
Oversplits
# of Cumulative Feature Location
Papers based on Textual InformationPapers based on Textual Information
16
# of Cumulative Feature Location
Papers based on Textual InformationPapers based on Textual Information
Existing feature location techniques
use Camel Case splittinguse Camel Case splitting
17
Information Retrieval FLTInformation Retrieval FLT
• Generate corpus synchronized void print(TestResult result,
l Ti ) h IOE i {
• Preprocessing
– Remove non-literals
– Remove stop words
long runTime) throws IOException{
printHeader(runTime);
printErrors(result);
printFailures(result);Remove stop words
– Split identifiers
– Stemming
I d i
p ( );
printFooter(result);
}
synchronized void print TestResult result
• Indexing
– Term-by-document matrix
– Singular Value
synchronized void print TestResult result
long runTime throws IOException
printHeader runTime printErrors result
printFailures result printFooter result
g
Decomposition
• User formulate query
G t lt
print TestResult result runTime
IOException printHeader runTime
• Generate results
• Ranked list 18
O cept o p t eade u e
printErrors result printFailures result
printFooter result
Information Retrieval FLTInformation Retrieval FLT
• Generate corpus print Test Result result run Time IO
E ti i t H d Ti i t
• Preprocessing
– Remove non-literals
– Remove stop words
Exception print Header run Time print
Errors result print Failures result print
Footer result
Remove stop words
– Split identifiers
– Stemming
I d i
print test result result run time io
exception print head run time print error
• Indexing
– Term-by-document matrix
– Singular Value
result print fail result print foot result
g
Decomposition
• User formulate query
G t lt
print test result ...
m1 5 1 3 ...
• Generate results
• Ranked list 19
m2 ... ... ... ...
Information Retrieval FLTInformation Retrieval FLT
• Generate corpus
print test result ...
• Preprocessing
– Remove non-literals
– Remove stop words
m1 5 1 3 ...
m2 ... ... ... ...
Remove stop words
– Split identifiers
– Stemming
I d i• Indexing
– Term-by-document matrix
– Singular Valueg
Decomposition
• User formulate query
G t lt• Generate results
• Ranked list 20
IR and Dynamic Information FLTIR and Dynamic Information FLT
• Generate corpus
• Preprocessing
– Remove non-literals
– Remove stop wordsRemove stop words
– Split identifiers
– Stemming
I d i• Indexing
– Term-by-document matrix
– Singular Value Decompositiong p
• User formulate query
Collect execution
trace
• Generate results
• Ranked list of executed methods
21
R h G lResearch Goal
Evaluate how advanced splitting techniques impact
th f f f t l ti t h ithe performance of feature location techniques
22
Information Retrieval FLTInformation Retrieval FLT
• Generate corpus
• Preprocessing
– Remove non-literals
– Remove stop words
Replace Camel Case with :
•SamuraiRemove stop words
– Split identifiers
– Stemming
I d i
g ( )
•“Perfect” Splitting
algorithm (Oracle)
• Indexing
– Term-by-document matrix
– Singular Value
B
g
Decomposition
• User formulate query
G t lt
etterW
• Generate results
• Ranked list 23
Worst
Extract Identifiers
AllAll
Identifiers
B ildiBuilding
the Oracle
24
Extract Identifiers
All
Same
split?All
Identifiers
p
(CamelCase
Samurai
TIDIER)
B ildi
Concordant
YES
Building
the Oracle
Concordant
Split
Identifiers 25
Extract Identifiers
All
Same
split?All
Identifiers
p
(CamelCase
Samurai
TIDIER)
• Assume they are
correct
B ildi
Concordant
YES
• Manually verified a
sample
Building
the Oracle
Concordant
Split
Identifiers 26
sample
• Threat to validity
Manually
Split
IdentifiersIdentifiers
Extract Identifiers Manual Split
All
Discordant
Split
Same
split?
NO
All
Identifiers
Split
Identifiers
p
(CamelCase
Samurai
TIDIER)
B ildi
Concordant
YES
Building
the Oracle
Concordant
Split
Identifiers 27
Manually
Split
IdentifiersIdentifiers
Extract Identifiers Manual Split
Consensus
between authors
All
Discordant
Split
Same
split?
NO
Checked
source codeAll
Identifiers
Split
Identifiers
p
(CamelCase
Samurai
TIDIER)
Concordant
YES
B ildi
Concordant
Split
Identifiers
Building
the Oracle
28
Manually
Split
Identifiers
Identifiers
that could
not be split
• Examples: DT, i3,
Identifiersnot be split
P754, zzz, etc.
• Left unchanged
Extract Identifiers Manual Split
Left unchanged
All
Discordant
Split
Same
split?
NO
All
Identifiers
Split
Identifiers
p
(CamelCase
Samurai
TIDIER)
Concordant
YES
B ildi
Concordant
Split
Identifiers
Building
the Oracle
29
Design of the Case StudyDesign of the Case Study
30
Design of the Case StudyDesign of the Case Study
• RQ: Does a FLT with an advanced• RQ: Does a FLT with an advanced
splitting algorithm produce better results
than the same FLT using the CamelCasethan the same FLT using the CamelCase
splitting algorithm?
31
How to Compare two FLTs?How to Compare two FLTs?
32
How to Compare two FLTs?
• Effectiveness measure for each feature
How to Compare two FLTs?
• Effectiveness measure for each feature
Method LSI
IR
G ld t th d
Method LSI
score
M121 0.92
M64 0.89
Gold set methodM15 0.86
M39 0.80
M7 0.74
M 0 65M152 0.65
M234 0.56
M12 0.54
M 0 52
Effectiveness = 5
M78 0.52
33
How to Compare two FLTs?
• Effectiveness measure for each feature
How to Compare two FLTs?
• Effectiveness measure for each feature
Method LSI
IR
IRDynMethod LSI
score
M121 0.92
M64 0.89
Method LSI
score
M15 0.86Gold set method
y
M15 0.86
M39 0.80
M7 0.74
M 0 65
M7 0.74
M234 0.56
M12 0.54
Effectiveness = 2M152 0.65
M234 0.56
M12 0.54
M 0 52
ect e ess
M78 0.52
34Method
Executed method (from trace)
Which FLTs are we Comparing?Which FLTs are we Comparing?
35
Software SystemsSoftware Systems
• Rhino 1 6R5• Rhino 1.6R5
• 138 classes, 1,870 methods, 32K LOC
E dd l ’ d *• Eaddy et al.’s data*
• 2 datasets
Dataset Size Queries Gold Sets Execution
Information
RhinoFeatures 241 Sections of
ECMAScript
documentation
Eaddy et al.* Full Execution Traces
(from unit tests)
Rhi 143 B titl d E dd t l * N/A
36
* http://www.cs.columbia.edu/~eaddy/concerntagger/
RhinoBugs 143 Bug title and
description
Eaddy et al.*
(CVS)
N/A
Software SystemsSoftware Systems
• jEdit 4 3• jEdit 4.3
• 483 classes, 6.4K methods, 109K LOC
2 d t t• 2 datasets
Dataset Size Queries Gold Sets Execution
Information
jEditFeatures 64 Feature (or Patch)
title and description
SVN Marked Execution
Traces
jEditBugs 86 Bug title and
description
SVN Marked Execution
Traces
37
Datasets available at:
http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/
Generating the jEdit
Datasets
SVN Commits between
v4.2-v4.3
38
Generating the jEdit
Datasets
SVN commit
message
Title
Description
+
Query
=
Query
Generating the jEdit
Datasets
Changed files
Previous
V i
Current
V i
Compare
using
Eclipse AST
Version Version
c pse S
Modified methods
(gold set) 40
Presenting the Results
41
Presenting the Results
Box plot of all effectiveness measure in
datasets
(e.g., 241 datapoints for RhinoFeatures)
Average
Median
42
IR FLTs
RhinoFeaturesRhinoFeatures
IR FLTs
S
RhinoFeatures
Similar median
and average
RhinoFeatures
IR FLTs
S
RhinoFeatures RhinoBugs
Similar median
and average
RhinoFeatures RhinoBugs
45jEditFeatures jEditBugs
IR FLTs
S
RhinoFeatures RhinoBugs
Similar median
and average
RhinoFeatures RhinoBugs
Datasets with
features have
better results
than datasets
with bugs
46jEditFeatures jEditBugs
IRDyn FLTs
S
N/A
RhinoFeatures RhinoBugs
Similar median
and average
RhinoFeatures RhinoBugs
47jEditFeatures jEditBugs
IRDyn FLTs
S
N/A
RhinoFeatures RhinoBugs
Similar median
and average
RhinoFeatures RhinoBugs
Datasets withatasets t
features have
better results
than datasetsthan datasets
with bugs
48jEditFeatures jEditBugs
Compare FLTs by PercentagesCompare FLTs by Percentages
IROracle IRCamelCase
(Baseline)(Baseline)
10 17
20 1520 15
18 18
5 9
4 16
19 7
12 28
49
14 15
Compare FLTs by PercentagesCompare FLTs by Percentages
IROracle IRCamelCase
(Baseline)
5/8
(Baseline)
10 17
20 1520 15
18 18
2/8
5 9
4 16
2/8
19 7
12 28
50
14 15
IRIR
RhinoFeatures RhinoBugs
51
jEditFeatures jEditBugs
IRIR
Datasets with features
RhinoFeatures RhinoBugs
Datasets with features
vs.
Datasets with bugs
52
jEditFeatures jEditBugs
IRDyn
N/A
Similar trend
RhinoFeatures RhinoBugs
53
jEditFeatures jEditBugs
Statistical ResultsStatistical Results
• Wilcoxon signed-rank test• Wilcoxon signed rank test
• Null hypothesis
Th i t ti ti l i ifi diff– There is no statistical significance difference
in terms of effectiveness between
IR /IR l and IR lIRSamurai/IROracle and IRCamelCase
• Alternative hypothesis
IR /IR h t ti ti ll i ifi tl– IRSamurai/IROracle has statistically significantly
higher effectiveness than IRCamelCase
l h 0 05• alpha = 0.05
54
IRIR
RhinoFeatures RhinoBugsThe only
statistical
significant resultsignificant result
(p=0.05)
55
jEditFeatures jEditB
Qualitative ResultsQualitative Results
• Vocabulary mismatch between queries
and code:and code:
– Name of developers (e.g., Slava, Carlos)
Id ifi ifi i i (– Identifiers specific to communication (e.g.,
thanks, greetings, annoying)
56
Qualitative ResultsQualitative Results
• Features are more “descriptive” than
bugsbugs
57
Qualitative ResultsQualitative Results
• Features are more “descriptive” than
bugsbugs
Words “join” and
“line” are not
mentioned
58
Threats to ValidityThreats to Validity
• External
– 2 Java applications (different domains)
– More systems needed
• Construct
– Errors may be present in Oracle and gold setsy p g
– We used data produced by other researchers
• Internal• Internal
– Subjectivity and bias in building the Oracle
Conclusion• Conclusion
– Non-parametric test: Wilcoxon signed-rank 59
Research Questions
• RQ1 Does IRSamurai outperform IRCamelCase in
terms of effectiveness? NO
• RQ2 Does IRS iDyn outperform IRC lC Dyn• RQ2 Does IRSamuraiDyn outperform IRCamelCaseDyn
in terms of effectiveness? NO
• RQ3 Does IROracle outperform IRCamelCase in terms
f ff ti ? I (Rhi )of effectiveness? In some cases (Rhino)
• RQ4 Does IROracleDyn outperform IRCamelCaseDyn
in terms of effectiveness? NO
60
Future WorkFuture Work
• More systems and datasets• More systems and datasets
• Different maintenance tasks
T bilit li k– Traceability link recovery
• Consider other splitting algorithms
61
ConclusionsConclusions
• Advanced splitting technique could• Advanced splitting technique could
improve FLTs
We found some empirical evidence– We found some empirical evidence
• Splitting has more impact on IR FLT
f f l bl• If execution information is available, it is
not necessary to use an advance splitting
h itechnique
62
Thank you! Questions?Thank you! Questions?
SEMERU @ William and MarySEMERU @ William and Mary
http://www.cs.wm.edu/semeru/
bdi dbdit@cs.wm.edu
SEMERU
63
ReferencesReferences
• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The
Effects of Comments and Identifier Names on Program Comprehensibility:
An Experimental Investigation", Journal of Programming Languages, vol. 4,
no. 3, 1996, pp. 143-167
• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley D• Lawrie et al. (2006) Lawrie, D., Morrell, C., Feild, H., and Binkley, D.,
"What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June
14-16 2006, pp. 3-12
• Binkley et al. (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,y ( ) y, , , , , , , ,
"To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009,
pp. 158-167
• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K.,
"Mi i S C d A i ll S li Id ifi f S f"Mining Source Code to Automatically Split Identifiers for Software
Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80
• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and
Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using SpeechGuéhéneuc, Y. G., TIDIER: An Identifier Splitting Approach using Speech
Recognition Techniques", JSME, vol. to appear, 2011
64

More Related Content

Similar to Icpc11b.ppt

C program compiler presentation
C program compiler presentationC program compiler presentation
C program compiler presentation
Rigvendra Kumar Vardhan
 
From logs to metrics
From logs to metricsFrom logs to metrics
From logs to metrics
Leonardo Di Donato
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987
乐群 陈
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
Abhishek Sinha
 
What uses for observing operations of Configuration Management?
What uses for observing operations of Configuration Management?What uses for observing operations of Configuration Management?
What uses for observing operations of Configuration Management?
RUDDER
 
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDSOverloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
Sumant Tambe
 
Clean Code 2
Clean Code 2Clean Code 2
Clean Code 2
Fredrik Wendt
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
jtdudley
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
Christopher Biow
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxyBlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
BlueHat Security Conference
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error predictionNIKHIL NAWATHE
 
Euro python 2015 writing quality code
Euro python 2015   writing quality codeEuro python 2015   writing quality code
Euro python 2015 writing quality code
radek_j
 
Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?
Precisely
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
Mithun Hunsur
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
Markus Scheidgen
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
RichardWarburton
 
Application Security
Application SecurityApplication Security
Application Securityflorinc
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006
pogil
 

Similar to Icpc11b.ppt (20)

ICPC11c.ppt
ICPC11c.pptICPC11c.ppt
ICPC11c.ppt
 
C program compiler presentation
C program compiler presentationC program compiler presentation
C program compiler presentation
 
From logs to metrics
From logs to metricsFrom logs to metrics
From logs to metrics
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
 
What uses for observing operations of Configuration Management?
What uses for observing operations of Configuration Management?What uses for observing operations of Configuration Management?
What uses for observing operations of Configuration Management?
 
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDSOverloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDS
 
Clean Code 2
Clean Code 2Clean Code 2
Clean Code 2
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxyBlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
BlueHat v18 || The hitchhiker's guide to north korea's malware galaxy
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error prediction
 
Euro python 2015 writing quality code
Euro python 2015   writing quality codeEuro python 2015   writing quality code
Euro python 2015 writing quality code
 
Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
Application Security
Application SecurityApplication Security
Application Security
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006
 

More from Yann-Gaël Guéhéneuc

Some Pitfalls with Python and Their Possible Solutions v1.0
Some Pitfalls with Python and Their Possible Solutions v1.0Some Pitfalls with Python and Their Possible Solutions v1.0
Some Pitfalls with Python and Their Possible Solutions v1.0
Yann-Gaël Guéhéneuc
 
Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5
Yann-Gaël Guéhéneuc
 
Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1
Yann-Gaël Guéhéneuc
 
Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22
Yann-Gaël Guéhéneuc
 
Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3
Yann-Gaël Guéhéneuc
 
Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9
Yann-Gaël Guéhéneuc
 
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
Yann-Gaël Guéhéneuc
 
An Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its ConsequencesAn Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its Consequences
Yann-Gaël Guéhéneuc
 
Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0
Yann-Gaël Guéhéneuc
 
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Yann-Gaël Guéhéneuc
 
Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2
Yann-Gaël Guéhéneuc
 
On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1
Yann-Gaël Guéhéneuc
 
On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6
Yann-Gaël Guéhéneuc
 
ICSOC'21
ICSOC'21ICSOC'21
Vissoft21.ppt
Vissoft21.pptVissoft21.ppt
Vissoft21.ppt
Yann-Gaël Guéhéneuc
 
Service computation20.ppt
Service computation20.pptService computation20.ppt
Service computation20.ppt
Yann-Gaël Guéhéneuc
 
Serp4 iot20.ppt
Serp4 iot20.pptSerp4 iot20.ppt
Serp4 iot20.ppt
Yann-Gaël Guéhéneuc
 
Msr20.ppt
Msr20.pptMsr20.ppt
Iwesep19.ppt
Iwesep19.pptIwesep19.ppt
Icsoc20.ppt
Icsoc20.pptIcsoc20.ppt

More from Yann-Gaël Guéhéneuc (20)

Some Pitfalls with Python and Their Possible Solutions v1.0
Some Pitfalls with Python and Their Possible Solutions v1.0Some Pitfalls with Python and Their Possible Solutions v1.0
Some Pitfalls with Python and Their Possible Solutions v1.0
 
Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5
 
Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1
 
Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22
 
Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3
 
Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9
 
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
 
An Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its ConsequencesAn Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its Consequences
 
Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0
 
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
 
Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2
 
On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1
 
On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6
 
ICSOC'21
ICSOC'21ICSOC'21
ICSOC'21
 
Vissoft21.ppt
Vissoft21.pptVissoft21.ppt
Vissoft21.ppt
 
Service computation20.ppt
Service computation20.pptService computation20.ppt
Service computation20.ppt
 
Serp4 iot20.ppt
Serp4 iot20.pptSerp4 iot20.ppt
Serp4 iot20.ppt
 
Msr20.ppt
Msr20.pptMsr20.ppt
Msr20.ppt
 
Iwesep19.ppt
Iwesep19.pptIwesep19.ppt
Iwesep19.ppt
 
Icsoc20.ppt
Icsoc20.pptIcsoc20.ppt
Icsoc20.ppt
 

Recently uploaded

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 

Recently uploaded (20)

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 

Icpc11b.ppt

  • 1. Can Better Identifier Splitting T h i H l F L i ? B d Di L if G j D P h k Gi li A i l Techniques Help Feature Location? Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol SEMERU 19th IEEE International Conference on Program Comprehension (ICPC’11) – Kingston, Ontario, Canada
  • 2. 2
  • 3. Textual information embeds d i k l ddomain knowledge 3
  • 4. Textual information embeds d i k l ddomain knowledge About 70% of source codeAbout 70% of source code consists of identifiers* * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282 4
  • 5. Textual information embeds d i k l ddomain knowledge About 70% of source codeAbout 70% of source code consists of identifiers* Identifiers are important source ofIdentifiers are important source of information for maintenance tasks:information for maintenance tasks: • traceability link recovery • feature location * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282 feature location 5
  • 6. # of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information 6
  • 7. Related Work on IdentifiersRelated Work on Identifiers • Takang et al (JPL’96)• Takang et al. (JPL 96) – programs with full-word identifiers are more understandable than those with abbreviatedunderstandable than those with abbreviated ones • Lawrie et al (ICPC’06)• Lawrie et al. (ICPC 06) – full words and recognizable abbreviations lead to better comprehensionlead to better comprehension • Binkley et al. (ICPC’09) CamelCase style is easier to recognize than– CamelCase style is easier to recognize than underscore 7
  • 8. Related Work on IdentifiersRelated Work on Identifiers • Enslen et al (MSR’09)• Enslen et al. (MSR 09) – Samurai: algorithm for splitting identifiers (using tables of identifier frequencies)(using tables of identifier frequencies) • Guerrouj et al. (JSME’11) TIDIER: algorithm for splitting identifiers– TIDIER: algorithm for splitting identifiers (using contextual information) Other related work• Other related work – Deissenboeck and Pizka (SQJ’06), Antoniol et al (ICSM’07) Haiduc and Marcus (ICPC’08)al. (ICSM’07), Haiduc and Marcus (ICPC’08), etc. 8
  • 9. Splitting Identifiers Correctly is h llChallenging 9
  • 10. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier userId setGID print file2deviceprint_file2device SSLCertificate MINstring USERID currentsize readadapterobjectp j tolocale imitating DEFMASKBitDEFMASKBit 10
  • 11. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier Camel Case userId user Id setGID set GID print file2device print file 2 deviceprint_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobjectp j p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit 11
  • 12. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier Camel Case Handles underscore and userId user Id setGID set GID print file2device print file 2 device underscore and digits print_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobject Fails at mixed casesp j p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit 12 Fails at same case identifiers
  • 13. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier Camel Case userId user Id setGID set GID print file2device print file 2 deviceprint_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobjectp j p j tolocale tolocale imitating imitating DEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit 13
  • 14. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier Camel Case Samurai userId user Id user Id setGID set GID set GID print file2device print file 2 device print file 2 deviceprint_file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter objectp j p j p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit 14
  • 15. Identifier Splitting AlgorithmsIdentifier Splitting Algorithms Original Identifier Camel Case Samurai userId user Id user Id setGID set GID set GID print file2device print file 2 device print file 2 device Splits some cases where CamelCase print_file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string cannot USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter objectp j p j p j tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit 15 Oversplits
  • 16. # of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information 16
  • 17. # of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information Existing feature location techniques use Camel Case splittinguse Camel Case splitting 17
  • 18. Information Retrieval FLTInformation Retrieval FLT • Generate corpus synchronized void print(TestResult result, l Ti ) h IOE i { • Preprocessing – Remove non-literals – Remove stop words long runTime) throws IOException{ printHeader(runTime); printErrors(result); printFailures(result);Remove stop words – Split identifiers – Stemming I d i p ( ); printFooter(result); } synchronized void print TestResult result • Indexing – Term-by-document matrix – Singular Value synchronized void print TestResult result long runTime throws IOException printHeader runTime printErrors result printFailures result printFooter result g Decomposition • User formulate query G t lt print TestResult result runTime IOException printHeader runTime • Generate results • Ranked list 18 O cept o p t eade u e printErrors result printFailures result printFooter result
  • 19. Information Retrieval FLTInformation Retrieval FLT • Generate corpus print Test Result result run Time IO E ti i t H d Ti i t • Preprocessing – Remove non-literals – Remove stop words Exception print Header run Time print Errors result print Failures result print Footer result Remove stop words – Split identifiers – Stemming I d i print test result result run time io exception print head run time print error • Indexing – Term-by-document matrix – Singular Value result print fail result print foot result g Decomposition • User formulate query G t lt print test result ... m1 5 1 3 ... • Generate results • Ranked list 19 m2 ... ... ... ...
  • 20. Information Retrieval FLTInformation Retrieval FLT • Generate corpus print test result ... • Preprocessing – Remove non-literals – Remove stop words m1 5 1 3 ... m2 ... ... ... ... Remove stop words – Split identifiers – Stemming I d i• Indexing – Term-by-document matrix – Singular Valueg Decomposition • User formulate query G t lt• Generate results • Ranked list 20
  • 21. IR and Dynamic Information FLTIR and Dynamic Information FLT • Generate corpus • Preprocessing – Remove non-literals – Remove stop wordsRemove stop words – Split identifiers – Stemming I d i• Indexing – Term-by-document matrix – Singular Value Decompositiong p • User formulate query Collect execution trace • Generate results • Ranked list of executed methods 21
  • 22. R h G lResearch Goal Evaluate how advanced splitting techniques impact th f f f t l ti t h ithe performance of feature location techniques 22
  • 23. Information Retrieval FLTInformation Retrieval FLT • Generate corpus • Preprocessing – Remove non-literals – Remove stop words Replace Camel Case with : •SamuraiRemove stop words – Split identifiers – Stemming I d i g ( ) •“Perfect” Splitting algorithm (Oracle) • Indexing – Term-by-document matrix – Singular Value B g Decomposition • User formulate query G t lt etterW • Generate results • Ranked list 23 Worst
  • 26. Extract Identifiers All Same split?All Identifiers p (CamelCase Samurai TIDIER) • Assume they are correct B ildi Concordant YES • Manually verified a sample Building the Oracle Concordant Split Identifiers 26 sample • Threat to validity
  • 27. Manually Split IdentifiersIdentifiers Extract Identifiers Manual Split All Discordant Split Same split? NO All Identifiers Split Identifiers p (CamelCase Samurai TIDIER) B ildi Concordant YES Building the Oracle Concordant Split Identifiers 27
  • 28. Manually Split IdentifiersIdentifiers Extract Identifiers Manual Split Consensus between authors All Discordant Split Same split? NO Checked source codeAll Identifiers Split Identifiers p (CamelCase Samurai TIDIER) Concordant YES B ildi Concordant Split Identifiers Building the Oracle 28
  • 29. Manually Split Identifiers Identifiers that could not be split • Examples: DT, i3, Identifiersnot be split P754, zzz, etc. • Left unchanged Extract Identifiers Manual Split Left unchanged All Discordant Split Same split? NO All Identifiers Split Identifiers p (CamelCase Samurai TIDIER) Concordant YES B ildi Concordant Split Identifiers Building the Oracle 29
  • 30. Design of the Case StudyDesign of the Case Study 30
  • 31. Design of the Case StudyDesign of the Case Study • RQ: Does a FLT with an advanced• RQ: Does a FLT with an advanced splitting algorithm produce better results than the same FLT using the CamelCasethan the same FLT using the CamelCase splitting algorithm? 31
  • 32. How to Compare two FLTs?How to Compare two FLTs? 32
  • 33. How to Compare two FLTs? • Effectiveness measure for each feature How to Compare two FLTs? • Effectiveness measure for each feature Method LSI IR G ld t th d Method LSI score M121 0.92 M64 0.89 Gold set methodM15 0.86 M39 0.80 M7 0.74 M 0 65M152 0.65 M234 0.56 M12 0.54 M 0 52 Effectiveness = 5 M78 0.52 33
  • 34. How to Compare two FLTs? • Effectiveness measure for each feature How to Compare two FLTs? • Effectiveness measure for each feature Method LSI IR IRDynMethod LSI score M121 0.92 M64 0.89 Method LSI score M15 0.86Gold set method y M15 0.86 M39 0.80 M7 0.74 M 0 65 M7 0.74 M234 0.56 M12 0.54 Effectiveness = 2M152 0.65 M234 0.56 M12 0.54 M 0 52 ect e ess M78 0.52 34Method Executed method (from trace)
  • 35. Which FLTs are we Comparing?Which FLTs are we Comparing? 35
  • 36. Software SystemsSoftware Systems • Rhino 1 6R5• Rhino 1.6R5 • 138 classes, 1,870 methods, 32K LOC E dd l ’ d *• Eaddy et al.’s data* • 2 datasets Dataset Size Queries Gold Sets Execution Information RhinoFeatures 241 Sections of ECMAScript documentation Eaddy et al.* Full Execution Traces (from unit tests) Rhi 143 B titl d E dd t l * N/A 36 * http://www.cs.columbia.edu/~eaddy/concerntagger/ RhinoBugs 143 Bug title and description Eaddy et al.* (CVS) N/A
  • 37. Software SystemsSoftware Systems • jEdit 4 3• jEdit 4.3 • 483 classes, 6.4K methods, 109K LOC 2 d t t• 2 datasets Dataset Size Queries Gold Sets Execution Information jEditFeatures 64 Feature (or Patch) title and description SVN Marked Execution Traces jEditBugs 86 Bug title and description SVN Marked Execution Traces 37 Datasets available at: http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/
  • 38. Generating the jEdit Datasets SVN Commits between v4.2-v4.3 38
  • 39. Generating the jEdit Datasets SVN commit message Title Description + Query = Query
  • 40. Generating the jEdit Datasets Changed files Previous V i Current V i Compare using Eclipse AST Version Version c pse S Modified methods (gold set) 40
  • 42. Presenting the Results Box plot of all effectiveness measure in datasets (e.g., 241 datapoints for RhinoFeatures) Average Median 42
  • 45. IR FLTs S RhinoFeatures RhinoBugs Similar median and average RhinoFeatures RhinoBugs 45jEditFeatures jEditBugs
  • 46. IR FLTs S RhinoFeatures RhinoBugs Similar median and average RhinoFeatures RhinoBugs Datasets with features have better results than datasets with bugs 46jEditFeatures jEditBugs
  • 47. IRDyn FLTs S N/A RhinoFeatures RhinoBugs Similar median and average RhinoFeatures RhinoBugs 47jEditFeatures jEditBugs
  • 48. IRDyn FLTs S N/A RhinoFeatures RhinoBugs Similar median and average RhinoFeatures RhinoBugs Datasets withatasets t features have better results than datasetsthan datasets with bugs 48jEditFeatures jEditBugs
  • 49. Compare FLTs by PercentagesCompare FLTs by Percentages IROracle IRCamelCase (Baseline)(Baseline) 10 17 20 1520 15 18 18 5 9 4 16 19 7 12 28 49 14 15
  • 50. Compare FLTs by PercentagesCompare FLTs by Percentages IROracle IRCamelCase (Baseline) 5/8 (Baseline) 10 17 20 1520 15 18 18 2/8 5 9 4 16 2/8 19 7 12 28 50 14 15
  • 52. IRIR Datasets with features RhinoFeatures RhinoBugs Datasets with features vs. Datasets with bugs 52 jEditFeatures jEditBugs
  • 54. Statistical ResultsStatistical Results • Wilcoxon signed-rank test• Wilcoxon signed rank test • Null hypothesis Th i t ti ti l i ifi diff– There is no statistical significance difference in terms of effectiveness between IR /IR l and IR lIRSamurai/IROracle and IRCamelCase • Alternative hypothesis IR /IR h t ti ti ll i ifi tl– IRSamurai/IROracle has statistically significantly higher effectiveness than IRCamelCase l h 0 05• alpha = 0.05 54
  • 55. IRIR RhinoFeatures RhinoBugsThe only statistical significant resultsignificant result (p=0.05) 55 jEditFeatures jEditB
  • 56. Qualitative ResultsQualitative Results • Vocabulary mismatch between queries and code:and code: – Name of developers (e.g., Slava, Carlos) Id ifi ifi i i (– Identifiers specific to communication (e.g., thanks, greetings, annoying) 56
  • 57. Qualitative ResultsQualitative Results • Features are more “descriptive” than bugsbugs 57
  • 58. Qualitative ResultsQualitative Results • Features are more “descriptive” than bugsbugs Words “join” and “line” are not mentioned 58
  • 59. Threats to ValidityThreats to Validity • External – 2 Java applications (different domains) – More systems needed • Construct – Errors may be present in Oracle and gold setsy p g – We used data produced by other researchers • Internal• Internal – Subjectivity and bias in building the Oracle Conclusion• Conclusion – Non-parametric test: Wilcoxon signed-rank 59
  • 60. Research Questions • RQ1 Does IRSamurai outperform IRCamelCase in terms of effectiveness? NO • RQ2 Does IRS iDyn outperform IRC lC Dyn• RQ2 Does IRSamuraiDyn outperform IRCamelCaseDyn in terms of effectiveness? NO • RQ3 Does IROracle outperform IRCamelCase in terms f ff ti ? I (Rhi )of effectiveness? In some cases (Rhino) • RQ4 Does IROracleDyn outperform IRCamelCaseDyn in terms of effectiveness? NO 60
  • 61. Future WorkFuture Work • More systems and datasets• More systems and datasets • Different maintenance tasks T bilit li k– Traceability link recovery • Consider other splitting algorithms 61
  • 62. ConclusionsConclusions • Advanced splitting technique could• Advanced splitting technique could improve FLTs We found some empirical evidence– We found some empirical evidence • Splitting has more impact on IR FLT f f l bl• If execution information is available, it is not necessary to use an advance splitting h itechnique 62
  • 63. Thank you! Questions?Thank you! Questions? SEMERU @ William and MarySEMERU @ William and Mary http://www.cs.wm.edu/semeru/ bdi dbdit@cs.wm.edu SEMERU 63
  • 64. ReferencesReferences • Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Investigation", Journal of Programming Languages, vol. 4, no. 3, 1996, pp. 143-167 • Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley D• Lawrie et al. (2006) Lawrie, D., Morrell, C., Feild, H., and Binkley, D., "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June 14-16 2006, pp. 3-12 • Binkley et al. (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C.,y ( ) y, , , , , , , , "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009, pp. 158-167 • Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K., "Mi i S C d A i ll S li Id ifi f S f"Mining Source Code to Automatically Split Identifiers for Software Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80 • Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using SpeechGuéhéneuc, Y. G., TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques", JSME, vol. to appear, 2011 64