SlideShare a Scribd company logo
1 of 69
Download to read offline
Context-Aware Source Code Vocabulary Normalization
for
Software Maintenance
Presentation of the Ph.D. Defense
August 19, 2013
DGIGL - SOCCER Lab, Ptidej Team
École Polytechnique de Montréal, Québec, Canada

Latifa GUERROUJ
latifa.guerrouj@polymtl.ca
Outline



Thesis



Context-Awareness for Source Code Vocabulary Normalization



Conext-Aware Approaches for Vocabulary Normalization



Impact of Advanced Identifier Splitting on Traceability recovery



Impact of Advanced Identifier Splitting on Feature Location



2/59

Research Context & Problem Statement

Conclusion and Future Work
Textual information embeds domain
knowledge

3/59

* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming",
Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
Textual information embeds domain
knowledge

About 70% of source code consists of
identifiers*

Identifiers are important source of
information for maintenance tasks such as:
- Traceability link recovery
- Feature location

4/59

* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming",
Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
Enslen et al. (MSR’09):
Samurai: splits identifiers by mining terms frequencies in
a large corpus of programs.
Lawrie et al. (WCRE’10, ICSM’11):
GenTest : generates all splittings and evaluates a scoring
function against each one.
Nomalize: a refinement of GenTest towards expansion based
on a machine-translation technique.

Example of Java code using
meaningful identifiers - ibatis

Example of Feature Location results - ibatis
Research Context & Problem Statement

Voc
abul
ar

y mi

s ma

tch
Requirements

Example of C code
identifiers - (gcl-2.6.7)

Normalizing Source
Code Vocabulary !?

Normalization:
- Splitting: bfd abs section ptr
6/59

- Expansion: binary file descriptor absolute section pointer
Thesis

Overarching Research Question of the Thesis
Can we automatically resolve the
mismatch between source code
software artifacts, using context,
software maintenance tasks such
location and traceability recovery?

7/59

vocabulary
and other
to support
as feature
Thesis
Thesis Phases

Context-Awareness
for Source Code
Vocabulary
Normalization

Context-Aware
Normalization
Approaches
(TIDIER & TRIS)

Context is relevant
(EMSE’13)

TIDIER: Inspired by
Speech Recognition
(CSMR10, JSEP’13)
TRIS: Fast Solution Dealing
with normalization as an
Optimization Problem
(WCRE’12)

Impact of Advanced
Identifier
Splitting on
Traceability Recovery

Advanced
Identifier Splitting
Can Help Traceability
Recovery

Impact of Advanced
Identifier
Splitting on
Feature Location

Advanced
Identifier Splitting
Can Help Feature
Location
(ICPC’11)
Contribution 1:
Context-Awareness for Source
Code Vocabulary Normalization
Context-Awareness for Normalization
Experiments’ Definition and Planning

Two experiments (Exp I and II) with 63 participants asked to split/expand
identifiers from C programs with different contexts to investigate:




Accuracy in dealing with identifiers’ terms consisting of plain English
words, abbreviations, and acronyms;



10/59

Effect of contextual information;

Effect of factors: participants’ background, programming expertise,
domain knowledge, and English proficiency.
Context-Awareness for Normalization
Exp I & II Subjects
Characteristic

Master

9

6

Ph.D.

28

10

1

2

Basic

11

6

Medium

23

5

9

10

Bad

8

1

Good

8

9

18

6

Excellent

8 (7)

11(6)

Occasional

12

10

Basic usage

13

6

Knowledgeable but
not expert

17

5

Expert

11/59

3

Very good

Linux Knowledge

5

Expert
English
Proficiency

# of participants
Exp II (21)

Post-doc
C Programming
Experience

# of participants
Exp I (42)

Bachelor
Program of
studies

Level

0

0

Participants’ characteristics and background (63 participants in total).
Context-Awareness for Normalization
Objects: identifiers from # open-source C applications &…
GNU Projects (337 Projects)

FreeBSD

C

C++

.h

Files

57, 268

13,445

39,257

Size
(KLOCs)

25,442

2,846

Identifiers

1,154,280

Oracle

927

C

C++

.h

Files

13,726

128

7,846

6,062

Size
(KLOCs)

1,800

128

8,016

-

619,652

Identifiers

634,902

-

278,659

-

26

Oracle

20

-

0

Linux Kernel

Apache Web Server

C
12,581

-

11,166

Size
(KLOCs)

8,474

-

1,994

Identifiers

845,335

-

352,850

Oracle

73

-

4

C

C++

.h

Files

559

-

254

Size
(KLOCs)

293

-

44

Identifiers

33,062

-

11,549

Oracle

11

-

0

.h

Files

12/59

C++

Main characteristics of the 340 projects for the sampled identifiers.
Context-Awareness for Normalization
Context (Internal & External) made available to participants.
Context Levels

Exp I

Exp II

no context (control group)





function



file





file plus AF





application



application plus AF



Context levels provided during Exp I and Exp II (AF = Acronym Finder).

Experimental Design: Randomized Block Procedure
13/59
Context-Awareness for Normalization
Research Questions


RQ1: To what extent does context impact splitting/expansion of
identifiers?



RQ2: To what extent do the characteristics of identifiers’ terms
affect the normalization performances?



RQ3: To what extent do level of experience, programming language
(C), domain knowledge, and English proficiency impact the
normalization.

14/59
Context-Awareness for Normalization

F-measure

Experiments’ Results – RQ1 (Context Relevance)

file

file+AF function

Exp I

noContext

app

app+AF

file

file+AF noContext

Exp II

Boxplots of F-measure: Exp I and II context levels.
15/59
Context-Awareness for Normalization
Experiments’ Results – RQ1 (Context Relevance)

Exp I
- Context significantly increases
participants’ performances.
- File level exhibits better performances
than the function-level context.

- Application-level context does not
improve further.

16/59

Exp II
Context-Awareness for Normalization
Experiments’ Results – RQ2 (Effect of Kind of Terms)
Exp I
Context

Kind of Terms

#Matched

#Unmatched

Accuracy (%)

file plus AF

abbreviation
acronyms
plain

523
112
336

169
31
50

75.58
78.32
87.05

file

abbreviation
acronyms
plain

542
94
346

164
32
50

76.77
74.60
87.37

function

abbreviation
acronyms
plain

582
97
374

161
36
52

78.33
72.93
87.79

no context

abbreviation
acronyms
plain

467
82
326

248
47
75

65.31
63.57
81.30

OVERALL

abbreviation
acronym
plain

2114
385
1382

742
146
227

74.02
72.50
85.89

Exp I: Proportions of kind of identifiers’ terms correctly expanded per context level.
17/59
Context-Awareness for Normalization
Experiments’ Results – RQ2 (Effect of Kind of Terms)
Exp II
Kind of Terms

application plus
AF

abbreviation
acronyms
plain

274
57
181

69
13
17

79.88
81.43
91.41

application

abbreviation
acronyms
plain

542
94
346

164
32
50

75.35
82.61
90.45

file plus AF

abbreviation
acronyms
plain

582
97
374

161
36
52

82.87
86.30
91.67

file

abbreviation
acronyms
plain

467
82
326

248
47
75

76.60
85.07
92.57

no context

abbreviation
acronym
plain

2114
385
1382

742
146
227

67.98
76.12
83.94

OVERALL

18/59

#Matched

#Unmatched

Accuracy
(%)

Context

abbreviation
acronym
plain

1349
285
861

415
61
96

76.47
82.37
89.97

Exp II: Proportions of kind of identifiers’ terms correctly expanded per context level.
Context-Awareness for Normalization
Experiments’ Results – RQ3 (Effect of Part. Characteristics)
Exp II

p-value
Context

<0.001

Linux

0.037

Context:Linux

0.988

F-measure: two-way permutation test by context &
knowledge of Linux.

Exp II
Exp I

Exp II

p-value

p-value

Context

<0.001

<0.001

English

0.032

0.044

Context:English

0.054

0.698

F-measure: two-way permutation test by context & English
Proficiency.
19/59
Context-Awareness for Normalization
Conclusion
 Context is relevant for vocabulary normalization;
 No significant difference in the accuracy of splitting/expanding
abbreviations and acronyms;
 Participants exploit better context when having a good level of English;
 English is used beside the domain knowledge (Exp II) to normalize
identifiers.

Context is useful for
source code vocabulary normalization
20/59
Contribution 2:
Context-Aware Source Code
Vocabulary Normalization
Approaches: TIDIER & TRIS
TIDIER Overview
Developers generate identifiers and contractions using:


Terms and words reflecting domain concepts, developers’ experience
or knowledge;



A finite set of transformation rules:

- Dropping all vowels

pointer → pntr

- Dropping a random vowel

user → usr

- Dropping a random character

pntr → ptr

- Dropping suffix (ing, tion, ment...)
- Dropping the last m characters
22/59

available → avail
rectangle → rect
TIDIER Overview
TIDIER is novel and uses context in the form of:
Context-aware dictionaries enriched by the use of domain
knowledge.

TIDIER relies on a search-based technique to normalize
identifiers:
 It relies on a distance using Dynamic Time Warping (DTW)
for continuous speech recognition (Ney, IEE TSE’84);
 Hill Climbing.
23/59
TIDIER Normalization Strategy
Identifier

DTW Match

Best Matching

Success!
Zero Dist?

No

Select randomly a
word with a minimal
distance <> 0

Apply a random
transformation to the
chosen word

Add transf word to
temporary dictionary

Current dictionary
yes

24/59

Discard word
from temporary
dictionary

Best Matching
red Dist ?

No
If other transf to apply

DTW
Match
TIDIER Case Study
Research Questions
 RQ1: How does TIDIER compare with alternatives when C
RQ1
identifiers must be split?

 RQ2: How sensitive are the performances of TIDIER to the use of
RQ2
context and specialized knowledge?

 RQ3: What percentage of identifiers with abbreviations is TIDIER
able to map dictionary words?
Analyzed Systems (Benchmark used in Context study)
25/59
Identifier Splitting for Traceability Recovery
Camel Case & Samurai Techniques
Original Identifier
userId

user Id

setGID

set GID

print_file2device

print file 2 device

SSLCertificate

SSL Certificate

MINstring

MI Nstring

USERID

USERID

currentsize

currentsize

readadapterobject

readadapterobject

tolocale

tolocale

imitating

imitating

DEFMASKBit
26/59

Camel Case

DEFMASK Bit
Identifier Splitting for Traceability Recovery
Camel Case & Samurai Techniques
Original Identifier

Samurai

userId

user Id

user Id

setGID

set GID

set GID

print_file2device

print file 2 device

print file 2 device

SSLCertificate

SSL Certificate

SSL Certificate

MINstring

MI Nstring

MIN string

USERID

USERID

USER ID

currentsize

currentsize

current size

readadapterobject

readadapterobject

read adapter object

tolocale

tolocale

tol ocal e

imitating

imitating

imi ta ting

DEFMASKBit
27/59

Camel Case

DEFMASK Bit

DEF MASK Bit
Identifier Splitting for Traceability Recovery
Camel Case & Samurai Techniques
Original Identifier

Camel Case

Samurai

userId

user Id

user Id

setGID
print_file2device

Splits some cases
set GID
where CamelCase
print file 2 device
cannot

set GID
print file 2 device

SSLCertificate

SSL Certificate

SSL Certificate

MINstring

MI Nstring

MIN string

USERID

USERID

USER ID

currentsize

currentsize

current size

readadapterobject

readadapterobject

read adapter object

tolocale

tolocale

tol ocal e

imitating

imitating

imi ta ting

DEFMASKBit

DEFMASK Bit

DEF MASK Bit

Oversplits
28/59
TIDIER Results
Results

Performances of Camel Case, Samurai, and TIDIER when using different dictionaries.

29/59

TIDIER outperforms previous ones on C and it is the first to
produce a correct mapping of 48% (35/73) for abbreviations.
Contribution 2:
Context-Aware Source Code
Vocabulary Normalization
Approaches: TIDIER & TRIS
TRIS Overview
TRIS is a novel approach dealing with normalization as an
optimization (minimization) problem:
The aim is to minimize the following cost function:

C(wOrigw) = α*Freq(wOrig) + C(type(wOrigw))

- Freq(wOrig): frequency of wOrig in the source code
- C(type(wOrig w): cost of the transformation type

31/59
TRIS Normalization Strategy
1. Source Code
2. Dictionaries

Phase 1:
Building Transformation

Computation of dictionary words frequenceis

Building the set of possible transformations

Construction of arborescence of
transformations

1. Identifier
2. Arborescence

Phase 2:
Identifier Processing
32/59

Identifier auxiliary graph creation

Optimal split/expansion search
TRIS Case Study
Research Question
RQ: What is the accuracy of the TRIS compared with alternative stateof-the art approaches?

Analyzed Systems
JHotDraw – Java
Files

Size (KLOC)
155

Identifiers

16

2,348

Oracle
957

Lynx - C
Files

Size (KLOC)
247

174

Identifiers
12,194

Oracle
3,085

Lawrie et al. Data Set
Programs
186

C
(MLOC)
26

C++
(MLOC)
15

Java
(MLOC)
7

489 C/C++ Sampled the Projects used in TIDIER
33/59

Main characteristics of the systems analyzed using TRIS.
TRIS Results
Results

Mean of F-measure on Lynx (C system).
Adj p-value
<0.001

Cliff's d
0.743

Samurai

<0.001

0.684

TIDIER

<0.001

0.204

Approach 1
TRIS

Approach 2
Camel Case

TRIS
TRIS

Results of Wilcoxon paired test & Cliff’s Delta effect size on Lynx.
Cliff’s delta Interpretation:
34/59

- small: 0.148 <= d <0.33, medium: 0.33 <= d < 0.474 and large: d >= 0.474
TRIS Results
Results

Identifier splitting correctness on the data set from Lawrie et al.

- TRIS performs better than others with medium to large effect
size on C;
35/59

- TRIS is better than Samurai of 16% and GenTest of 4%.
TRIS Results
Results

Mean of F-measure on the 489 C sampled identifiers.

Statistically significant difference using Wilcoxon:
- p-value < 0.001;
- Cliff’s d effect size is medium (d = 0.456).
36/59
Contribution 3:
Impact of Advanced Identifier Splitting on
Traceability Recovery
Identifier Splitting for Traceability Recovery

Research Question
RQ: How do different identifiers splitting strategies (CamelCase,
Samurai and Oracle) impact Traceability Recovery?

Traceability Recovery Techniques Configurations
Splitting strategy

LSI

VSM

LSICamelCase

VSMCamelCase

Samurai

LSISamurai

VSMSamurai

Oracle

LSIOracle

VSMOracle

CamelCase

Configurations of the studied Traceability Recovery techniques.

38/59
Identifier Splitting for Traceability Recovery

Analyzed Systems
Systems
(Java)

Version

# Requirements

# Classes

iTrust

10

35

218

Pooka

2.0

90

298

System (C) Version
Lynx

2.8.5

# Files
247

Size
(KLOCs)
174

Main characteristics of the studied systems.

39/59

# Methods
2,067
Identifier Splitting for Traceability Recovery
Results (%)
Systems

Precision
LSICamelCase

Recall

LSISamurai

LSIOracle

LSICamelCase

LSISamurai

LSIOracle

iTrust

36.49

36.49

28.39

36.61

36.61

34.23

Pooka

14.06

14.14

15.64

22.81

22.37

22.36

Lynx

45.43

39.08

39.40

41.99

40.82

41.55

Systems

Precision
VSMCamelCase

VSMSamurai

Recall
VSMOracle

VSMCamelCase

VSMSamurai

VSM
Oracle

iTrust

48.99

25.81

23.77

23.77

23.07

Pooka

40.54

40.54

42.07

11.59

11.63

12.19

Lynx

40/59

48.99

64.26

57.84

49.91

37.66

37.05

40.16

Precision and Recall of the Traceability Recovery techniques configurations
for iTrust, Pooka, and Lynx.
Identifier Splitting for Traceability Recovery
Results and Discussion
 Potential benefits of developing advanced vocabulary normalization
approaches.
 Mismatch resulting from the requirements (presence of acronyms in
requirements).
 Case of Lynx (noise in data) : requirement 534 is “the browser should be able to
manage store erase session I information”. Whereas a C method
LYMain.c.i__nobrowse_fun is related to browse directories functionality.
 Baseline splitting: “nobrowse” and thus no link between requirement 534 and
LYMain.c.i_nobrowse_fun.txt.
 Samurai and manual oracle split the identifier “nobrowse” into “no browse” and link
the file LYMain.c.i__nobrowse_fun.txt.

Potential benefits of developing advanced
normalization approaches
41/59
Contribution 4:
Impact of Advanced Identifier Splitting on
Feature Location
Identifier Splitting for Feature Location
Research Question
RQ: How do different identifiers splitting strategies (CamelCase,
Samurai and Oracle) impact Feature Location?

Feature Location Techniques (FLTs) Configurations
Splitting strategy

IR FLT

IRDyn FLT

CamelCase

IRCamelCase

IRCamelCaseDyn

Samurai

IRSamurai

IRSamuraiDyn

Oracle

IROracle

IROracleDyn

Feature Location techniques configurations studied.

43/59
Identifier Splitting for Feature Location
Analyzed Systems
System
Version

Size
(KLOC)

Classes

Methods

# Data Sets

32

138

1,870

Eaddy et al.’s data* (2)

109

483

6.4

2

Rhino 1.6R5
jEdit 4.3
Dataset

Size

Queries

Gold Sets

Execution Information

RhinoFeatures

241

Sections of
ECMAScript

Eaddy et al.*

Full Execution Traces
(from unit tests)

RhinoBugs

143

Bug title and
description

Eaddy et al.*
(CVS)

N/A

jEditFeatures

64

Feature (or Patch)
title and
description

SVN

Marked Execution Traces

jEditBugs

86

Bug title and
description

SVN

Marked Execution Traces

Characteristics of the main analyzed systems.
44/59

* http://www.cs.columbia.edu/~eaddy/concerntagger/
Identifier Splitting for Feature Location
Results
Similar median &
average of

effectiveness
measure
IR FLTs
RhinoBugs

RhinoFeatures
Datasets with
features have better
results than
datasets with bugs

45/59

jEditFeatures

jEditBugs
Identifier Splitting for Feature Location
Results

RhinoFeatures

46/59

RhinoBugs

jEditFeatures

jEditBugs
Identifier Splitting for Feature Location
Results

RhinoFeatures

47/59

jEditFeatures

Statistical
significant result
(p=0.05)

RhinoBugs

jEditBugs
Identifier Splitting for Feature Location

Results and Discussion
 Samurai and CamelCase produced similar results;

 IROracle outperforms IRCamelCase in terms of the effectiveness measure,
on the RhinoFeatures dataset;
 When only textual information is available, an improved splitting technique
can help improve effectiveness of feature location.

 Samurai ovesplits identifiers into many meaningless terms. In Rhino:
debugAccelerators to debug Ac ce le r at o rs (CamelCase better in such cases).

48/59
Identifier Splitting for Feature Location
Vocabulary mismatch between queries and code
 Inconsistencies between the identifiers used in the queries, and the
identifiers used in the code.

 The mismatch is less noticeable for features and more severe for bugs.

 jEdit’s feature #16084869 (“Support “thick” caret”) contained in its
description identifiers found in the name of the methods (e.g., thick,
caret, text, area, etc.).
 Name of developers (e.g., Slava,Carlos- Identifiers specific to
communication (e.g., thanks, greetings, annoying).
 Appeared only in the query vocabulary, and did not appear in the source
code vocabulary.
49/59
Identifier Splitting for Feature Location
Features are more “descriptive” than bugs

Words “join”
and “line” are
not mentioned
Example of query (bugs)

Binkley et al. (ICSM’12): Normalization improves Feature Location

Potential benefits of developing advanced
normalization approaches
50/59
.

Conclusion
Context is relevant for source code
vocabulary normalization.
Source code files are the most helpful
A limited context such as functions does not
help
A wider context such as applications does
not improve further.
Domain knowledge improves normalization.
TRIS is novel and brings improvements
on state-of-the-art approaches on C:
92.06% vs. 85.25% for TIDIER (Lynx- C)
vs. 46.34% for Samurai
vs. 38.51% for CamelCase
86% vs. 82% for GenTest on Lawrie et al. data
vs. 70% for Samurai.
87.90% vs. 64.09% for TIDIER on the
identifiers from the 340 projects.
51/59

TIDIER is novel and performs better
than its previous approaches (CamelCase &
Samurai):
54.29% of splitting correctness vs. 31.14%
for (Samurai) & 30.08% (Camel Case)
with an application level dictionary
augmented with domain knowledge
TIDIER was the first to produce a correct
mapping for 48% of abbreviations.
Advanced identifier splitting strategies
improves the average of precision and recall
of some systems: Pooka & Lynx.
Advanced splitting improves feature
location using LSI: Rhino (features).
The quality of the requirements and
expressiveness of the queries impact too.
Future Work
Impact of Vocabulary Normalization on Maintenance Tasks

 Evaluate our work on other systems such as C, C++ or COBOL;
 Compare it to other works such as Normalize (Lawrie et al, ICSM’11);
 Study the impact of IR queries quality (Haiduc et al. (ICSE’13)).

Context-Aware Vocabulary Normalization Approaches
 Extend the evaluation of TIDIER and TRIS on larger systems;
 Compare the results to more recent approaches such as Normalize (Lawrie et al.,
ICSM’11) and LINSEN (Corazza et al., ICSM’12).

52/59
Future Work
Context-Awareness for Vocabulary Normalization
 Replicate our studies using eye-tracking tools;
 Implement a context model that within an IDE support program
understanding;
 Involve participants from industry.

Mining Software Repositories to Study the Impact of
Identifier Style on Software Quality
 Infer the identifier styles in open-source projects using HMM;
 Analyze whether open-source developers adapt/bring their style;
 Analyze whether identifier style can introduce bugs and--or impacts internal
quality metrics such as semantic coupling & cohesion.
53/59
Publications
Articles in journals
1. Latifa Guerrouj, Massimilano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. An
Experimental Investigation on the Effects of Contexts on Source Code Identifiers Splitting
and Expansion. Empirical Software Engineering Journal (EMSE’13).
2. Latifa Guerrouj, Massimilano Di Penta, Giuliano Antoniol, and Yann-Gaël Guéhéneuc.
TIDIER: An Identifier Splitting Approach Using Speech Recognition Techniques. Journal of
Software Evolution and Process (JSEP’13). 25(6): 569-661.

Conference Articles
3. Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and
Massimiliano Di Penta. TRIS: a Fast and Accurate Identifiers Splitting and Expansion
Algorithm. Proceedings of the 19th IEEE Working Conference on Reverse Engineering
(WCRE), October 2012.

54/59

4. Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol. Can Better Identifier
Splitting Techniques Help Feature Location? Proceedings of the 19 IEEE International
Conference on Program Comprehension (ICPC), June 2011.
Publications
Conference Articles
5. Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta, Yann-Gaël Guéhéneuc,
Giuliano Antoniol. Recognizing Words from Source Code Identifiers Using Speech
Recognition Techniques. Proceedings of the 14th IEEE European Conference on
Software Maintenance and Reengineering (CSMR), Mars 2010. Best Paper award
of CSMR’10.
6. Latifa Guerrouj. Normalizing Source Code Vocabulary to Enhance Program
Comprehension and Software Quality. Proceedings of the 35th ACM International
Conference on Software Engineering (ICSE), May 2013.
7. Latifa Guerrouj. Automatic Derivation of Concepts Based on the Analysis of Source
Code Identifiers. Proceedings of the 17th Working Conference on Reverse Engineering
(WCRE), October 2012.

55/59

8. Alberto Bacchelli, Nicolas Bettenburg, Latifa Guerrouj. Mining Unstructured Data
because “Mining Unstructured Data is Like Fishing in Muddy Waters!”. Proceedings of
the 19th Working Conference on Reverse Engineering (WCRE), October 2012.
.

Conclusion
Context is relevant for source code
vocabulary normalization.
Source code files are the most helpful
A limited context such as functions does not
help
A wider context such as applications does
not improve further.
Domain knowledge improves normalization.
TRIS is novel and brings improvements
on state-of-the-art approaches on C:
92.06% vs. 85.25% for TIDIER (Lynx- C)
vs. 46.34% for Samurai
vs. 38.51% for CamelCase
86% vs. 82% for GenTest on Lawrie et al. data
vs. 70% for Samurai.
87.90% vs. 64.09% for TIDIER on the
identifiers from the 340 projects.
56/59

TIDIER is novel and performs better
than its previous approaches (CamelCase &
Samurai):
54.29% of splitting correctness vs. 31.14%
for (Samurai) & 30.08% (Camel Case)
with an application level dictionary
augmented with domain knowledge
TIDIER was the first to produce a correct
mapping for 48% of abbreviations.
Advanced identifier splitting strategies
improves the average of precision and recall
of some systems: Pooka & Lynx.
Advanced splitting improves feature
location using LSI: Rhino (features).
The quality of the requirements and
expressiveness of the queries impact too.
References
LAWRIE, D., FEILD, H. et BINKLEY, D. (2006). Syntactic Identifier Conciseness and Consistency. Proceedings of the 6th
International Workshop on Source Code Analysis and Manipulation. pp. 139–148.
MAYRHAUSER, A. V. et VANS, A. M. (1995). Program Comprehension During Software Maintenance and Evolution. Computer, vol.
28, pp. 44–55
M-A.D. STOREY, F.D. FRACCHIA, H. M. (1999). Cognitive Design Elements to Support the Construction of a Mental Model During
Software Exploration. Journal of Systems and Software, vol. 44, pp. 171–185.
ROBILLARD, M. P., COELHO, W. et MURPHY, G. C. (2004). How Effective Developers Investigate Source Code: An Exploratory
Study. IEEE Transactions on Software Engineering, vol. 30, pp. 889–903.
KERSTEN, M. et MURPHY, G. C. (2006). Using Task Context to Improve Programmer Productivity. Proceedings of the 14th
International Symposium on Foundations of Software Engineering. pp. 1–11.
SILLITO, J., MURPHY, G. C. et VOLDER, K. D. (2008). Asking and Answering Questions during a Programming Change Task.
IEEE Transactions on Software Engineering, vol. 34,pp. 434–451.
BINKLEY, D., DAVIS, M., LAWRIE, D. et MORRELL, C. (2009). To Camelcase or Under score. Proceedings of the 17th
International Conference on Program Comprehension. pp. 158–167.
ENSLEN, E., HILL, E., POLLOCK, L. et SHANKER, K. V. (2009). Mining Source Code to Automatically Split Identifiers for
Software Analysis. Proceedings of the 6th International Working Conference on Mining Software Repositories. pp. 16–17.
LAWRIE, D. J., BINKLEY, D. et MORRELL, C. (2010). Normalizing Source Code Vocabulary. Proceedings of the 17th Working
Conference on Reverse Engineering. pp. 112–122.

57/59
References
LAWRIE, D. et BINKLEY, D. (2011). Expanding Identifiers to Normalize Source Code Vocabulary. Proceedings of the 27th
International Conference on Software Maintenance. pp. 113–122.
CORAZZA, A., MARTINO, S. D. et MAGGIO, V. (2012). LINSEN: An Efficient Approach to Split Identifiers and Expand
Abbreviations. Proceedings of the 28th International Conference of Software Maintenance. pp. 233–242.
EISENBARTH, T., KOSCHKE, R. et SIMON, D. (2003). Locating Features in Source Code. IEEE Transactions on Software
Engineering, vol. 29, pp. 210–224.
POSHYVANYK, D., GU´EH´ENEUC, Y.-G., MARCUS, A., ANTONIOL, G. et RAJLICH, V. (2007). Feature Location Using
Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval. IEEE Transactions on Software Engineering,
vol. 33, pp. 420–432.
EADDY, M., AHO, A., ANTONIOL, G. et GU´EH´ENEUC, Y.-G. (2008a). CERBERUS: Tracing Requirements to Source Code Using
Information Retrieval, Dynamic Analysis, and Program Analysis. Proceedings of 16th International Conference on Program
Comprehension. pp. 53–62.
BINKLEY, D., DAWN, D. L. et UEHLINGER, C. (2012). Vocabulary Normalization Improves
IR-Based Concept Location. Proceedings of the 28th International Conference on Software Maintenance, vol. 41, pp. 588–591.
ANTONIOL, G., CANFORA, G., CASAZZA, G., LUCIA, A. D., et MERLO, E. (2002). Recovering Traceability Links Between Code
and Documentation. IEEE Transactions on Software Engineering, vol. 28, pp. 970–983.
MALETIC, J. I. et COLLARD, M. L. (2009). Tql: A Query Language to Support Traceability. Proceedings of the 2009 ICSE Workshop
on Traceability in Emerging Forms of Software Engineering. pp. 16–20

58/59
References
DE LUCIA, A., DI PENTA, M. et OLIVETO, R. (2010). Improving Source Code Lexicon via Traceability and Information
Retrieval. IEEE Transactions on Software Engineering, vol. 37, pp. 205–226.
GUERROUJ, L., DI PENTA, M., GU´EH´ENEUC, Y.-G. et ANTONIOL, G. (2013b). An Experimental Investigation on
the Effects of Context on Source Code Identifiers Splitting and Expansion. Empirical Software Engineering. Doi:
10.1016/S0164-1212(00)00029-7.
GUERROUJ, L., DI PENTA, M., ANTONIOL, G. et GU´EH´ENEUC, Y.-G. (2013a). TIDIER: An Identifier Splitting
Approach using Speech Recognition Techniques. Journal of Software Evolution and Process, vol. 25, pp. 569–661.
DIT, B., GUERROUJ, L., POSHYVANYK, D. et ANTONIOL, G. (2011). Can Better Identifier Splitting Techniques Help
Feature Location? Proceedings of the 19th International Conference on Program Comprehension. pp. 11–20.
GUERROUJ, L., GALINIER, P., GU´EH´ENEUC, Y.-G., ANTONIOL, G. et DI PENTA, M. (2012). TRIS: A Fast and
Accurate Identifiers Splitting and Expansion Algorithm. Proceedings of the 19th Working Conference on Reverse Engineering.
pp. 103–112.
MADANI, N., GUERROUJ, L., DI PENTA, M., GU´EH´ENEUC, Y.-G. et ANTONIOL, G. (2010). Recognizing Words
from Source Code Identifiers using Speech Recognition Techniques. Proceedings of the 14th European Conference on
Software Maintenance and Reengineering. pp. 68–77.
NEY, H. (1984). The Use of a One-stage Dynamic Programming Algorithm for Connected Word Recognition. IEEE
Transactions on Acoustics Speech and Signal Processing, vol. 32, pp. 263–271.

59/59
60/59
U s r

5
4
3

4
3
2

5
4
3

4
3
2

3
3
2

2
1
0

1
0
1

0
1
2

r

3
2
1

4
3
2

5
4
3

4
3
2

2
1
0

1
0
1

0
1
2

3
2
1

4
3
2

5
4
3

3
2
1
0

3
2
0
1
n

2
0
1
2
t

0
1
2
3

3
2
2
1
r c

3
2
3
2
t

2
3
3
2
r

4
3
2
1
u

5
4
3
2
s

4
4
3
2
r

t

4
3
2

C

Dictionary of 3 words

3
2
1

P n t r

TIDIER Normalization Strategy

p
61/59

Identifier to split : pntrctrusr
TRIS Normalization Strategy
Information for the Dictionary Transformations Building Phase
Dictionary Words
(D)

Word Frequencies

Transformations Set

d1=“able”

f2=0.2

t3=(d2, cal, 0.55)
t4=(d2, call, -0.2)
t5=(d2, cll, -0.2)

d3=“callable”

f3=0.6

t6=(d3, calla, 0.55)
t7=(d3, callable, -0.2)
t8=(d3, cllbl, -0.2)

d4=“interface”

62/59

t1=(d1, abl, 0.55)
t2=(d1, able, -0.2)

d2=“call”

Arborescence of Transformations for the Dictionary D

f1=0.1

f4=0.1

t9=(d4, int, 0.55)
t10=(d4, inte, -0.2)
t11=(d4, inter, -0.2)
t12=(d4, interface, -0.2)
t13=(d4, intrfc, 0.8)

Dictionary Transformations Building Information for the
Identifier callableint

Auxiliary Graph for the Identifier callableint
Related Work & Contributions
Context Relevance for Program Comprehension
Mayrhauser et Vans (Computer’95)
Cognitive models rely on the programmers’ knowledge, source code and
software documentation.
M-A.D. Storey (JSSE’99)
Disparities BTW comprehension models can be explained in terms of
differences in programmer characteristics, program characteristics &
task characteristics.
Robillard et al. (IEEE TSE’04)
A methodical and structured approach to program investigation when
performing a software change task is the most effective.
Kersten et Murphy (FSE’06)
Maylar to reduce information overload and focus a programmers’ work.
Sillito et al. (IEEE TSE’08)
Need more contexts and support to work with larger groups of entities
and relationships.

63/59

Lack of empirical evidence on the extent to which
context helps normalizing source code vocabulary
Related Work & Contributions
Vocabulary Normalization Approaches
CamelCase (ICPC’09)
CamelCase: splits identifiers based on naming conventions.
Enslen et al. (MSR’09)
Samurai: splits identifiers by mining terms frequencies in a large corpus of programs.

CamelCase and Samurai have the inconvenient of relying
on naming conventions and term frequencies respectively
Lawrie et al. (WCRE’10)
GenTest : generates all splittings and evaluates a scoring function against each one.
Lawrie et Binkley (ICSM’2011)
Nomalize: a refinement of GenTest towards expansion based on machine-translation.

64/59

Corazza et al. (ICSM’12)
LINSEN: expands identifiers based on a graph model and an approximate string
matching algorithm, and exploits context.
Related Work & Contributions
Feature Location
Eisenbarth et al. (IEEE TSE’03)
A technique that applies formal concept analysis to traces to generate a
mapping BTW features and methods.
Poshyvanyk et al. (IEEE TSE’07)
Feature location finds source code element that implement a feature.
Eaddy et al. (ICPC’08)
Cerberus: hybrid as it combines static, dynamic and textual analysis.
Binkley et al. (ICSM’12)
Normalization improves the ranks of relevant docs as it recovers key
domain terms. This improvement is for shorter, more natural, queries.

Little empirical evidence on the impact of identifier
splitting/expansion on feature location
65/59
Related Work & Contributions
Traceability recovery
Antoniol et al. (IEEE TSE’02)
Approaches to recover links BTW requirements and source code.
Maletic et al. (ICSE’09)
TQL, an XML-based traceability query language that supports queries across multiple
artefacts and multiple traceability link types.
De Lucia et al. (IEEE TSE’10)
An approach to help developers maintain identifiers and comments consistent with
high-level artifacts.

Little empirical evidence on the impact of identifier
splitting/expansion on traceability recovery
66/59
Impact of Normalization on Feature Location
LSI- based Feature Location

Worst

67/59

Splitting algorithms:
- Camel Case
- Samurai
- “Perfect” (Oracle using TIDIER)

Better

 Generate corpus
 Preprocessing
- Remove non-literals
- Remove stop words
- Split identifiers
- Stemming
 Indexing
- Term-by-document
matrix
- Singular Value
Decomposition
 User formulate query
 Generate results
 Ranked list
Impact of Normalization on Feature Location
Manually Split
Identifiers

Identifiers
that could not
be split

Manual Split

Extract Identifiers

All
Identifiers

Same split?
(CamelCase
Samurai
TIDIER)

Discordant
Split
Identifiers

YES

Building
the Oracle

• Assume they are correct
Concordant
Split
Identifiers

68/59

NO

• Manually verified a sample
• Threat to validity
• Examples: DT, i3,
Impactzzz, etc.
of Normalization
P754,
• Left unchanged

on Feature Location
Manually Split
Identifiers

Identifiers
that could not
be split

Extract Identifiers

All
Identifier
s

Consensus BTW
authors
Checked source
NO
Same split?
code
(CamelCase
Samurai
TIDIER)

Discordant
Split
Identifiers

YES

Building
the Oracle

• Assume they are correct
Concordant
Split
Identifiers

69/59

Manual Split

• Manually verified a sample
• Threat to validity

More Related Content

What's hot

06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_designMajong DevJfu
 
Performance analysis of machine learning approaches in software complexity pr...
Performance analysis of machine learning approaches in software complexity pr...Performance analysis of machine learning approaches in software complexity pr...
Performance analysis of machine learning approaches in software complexity pr...Sayed Mohsin Reza
 
Deep learning based code smell detection - Qualifying Talk
Deep learning based code smell detection - Qualifying TalkDeep learning based code smell detection - Qualifying Talk
Deep learning based code smell detection - Qualifying TalkSayed Mohsin Reza
 
Empirical study of programming to an interface
Empirical study of programming to an interfaceEmpirical study of programming to an interface
Empirical study of programming to an interfacefuhrmanator
 
Dupressoir
DupressoirDupressoir
Dupressoiranesah
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
 
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...Ali Ouni
 
Privacy Requirements Engineering in Agile Software Development
Privacy Requirements Engineering in Agile Software DevelopmentPrivacy Requirements Engineering in Agile Software Development
Privacy Requirements Engineering in Agile Software DevelopmentRequirementsEngineeringLaboratory
 
Recommending Software Refactoring Using Search-based Software Enginnering
Recommending Software Refactoring Using Search-based Software EnginneringRecommending Software Refactoring Using Search-based Software Enginnering
Recommending Software Refactoring Using Search-based Software EnginneringAli Ouni
 
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALE
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALEDETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALE
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALEijseajournal
 
When, why and for whom do practitioners detect technical debts?: An experienc...
When, why and for whom do practitioners detect technical debts?: An experienc...When, why and for whom do practitioners detect technical debts?: An experienc...
When, why and for whom do practitioners detect technical debts?: An experienc...Norihiro Yoshida
 
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...Ali Ouni
 
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...CS, NcState
 
Summarization Techniques for Code, Changes, and Testing
Summarization Techniques for Code, Changes, and TestingSummarization Techniques for Code, Changes, and Testing
Summarization Techniques for Code, Changes, and TestingSebastiano Panichella
 

What's hot (19)

06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_design
 
Performance analysis of machine learning approaches in software complexity pr...
Performance analysis of machine learning approaches in software complexity pr...Performance analysis of machine learning approaches in software complexity pr...
Performance analysis of machine learning approaches in software complexity pr...
 
Dialogue System Iso 24617 2
Dialogue System Iso 24617 2Dialogue System Iso 24617 2
Dialogue System Iso 24617 2
 
Deep learning based code smell detection - Qualifying Talk
Deep learning based code smell detection - Qualifying TalkDeep learning based code smell detection - Qualifying Talk
Deep learning based code smell detection - Qualifying Talk
 
Modest Formalization of Software Design Patterns
Modest Formalization of Software Design PatternsModest Formalization of Software Design Patterns
Modest Formalization of Software Design Patterns
 
Empirical study of programming to an interface
Empirical study of programming to an interfaceEmpirical study of programming to an interface
Empirical study of programming to an interface
 
Dupressoir
DupressoirDupressoir
Dupressoir
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...
A Multi-Objective Refactoring Approach to Introduce Design Patterns and Fix A...
 
Privacy Requirements Engineering in Agile Software Development
Privacy Requirements Engineering in Agile Software DevelopmentPrivacy Requirements Engineering in Agile Software Development
Privacy Requirements Engineering in Agile Software Development
 
Recommending Software Refactoring Using Search-based Software Enginnering
Recommending Software Refactoring Using Search-based Software EnginneringRecommending Software Refactoring Using Search-based Software Enginnering
Recommending Software Refactoring Using Search-based Software Enginnering
 
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALE
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALEDETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALE
DETECTION AND REFACTORING OF BAD SMELL CAUSED BY LARGE SCALE
 
When, why and for whom do practitioners detect technical debts?: An experienc...
When, why and for whom do practitioners detect technical debts?: An experienc...When, why and for whom do practitioners detect technical debts?: An experienc...
When, why and for whom do practitioners detect technical debts?: An experienc...
 
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
 
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
 
ThesisPresentation
ThesisPresentationThesisPresentation
ThesisPresentation
 
Pavan kumar k
Pavan kumar kPavan kumar k
Pavan kumar k
 
Summarization Techniques for Code, Changes, and Testing
Summarization Techniques for Code, Changes, and TestingSummarization Techniques for Code, Changes, and Testing
Summarization Techniques for Code, Changes, and Testing
 

Similar to 130817 latifa guerrouj - context-aware source code vocabulary normalization for software maintenance

Applying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesApplying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesIván Ruiz-Rube
 
Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLsAnkica Barisic
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationSai Kiran Kadam
 
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...University of Hawai‘i at Mānoa
 
2010.hari_kannan.phd_thesis.slides.pdf
2010.hari_kannan.phd_thesis.slides.pdf2010.hari_kannan.phd_thesis.slides.pdf
2010.hari_kannan.phd_thesis.slides.pdfAlexKarasulu1
 
Multi-agent applications in a context-aware global software development envir...
Multi-agent applications in a context-aware global software development envir...Multi-agent applications in a context-aware global software development envir...
Multi-agent applications in a context-aware global software development envir...Helio Henrique L. C. Monte-Alto
 
Usability evaluation of Domain-Specific Languages
Usability evaluation of Domain-Specific LanguagesUsability evaluation of Domain-Specific Languages
Usability evaluation of Domain-Specific LanguagesAnkica Barisic
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal
 
Supporting program comprehension with source code summarization
Supporting program comprehension with source code summarizationSupporting program comprehension with source code summarization
Supporting program comprehension with source code summarizationMasud Rahman
 
Candidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone SwarmCandidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone Swarmkevig
 
Bikram kishor rout
Bikram kishor routBikram kishor rout
Bikram kishor routBikram Rout
 
Bikram kishor rout
Bikram kishor routBikram kishor rout
Bikram kishor routBikram Rout
 

Similar to 130817 latifa guerrouj - context-aware source code vocabulary normalization for software maintenance (20)

Applying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesApplying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languages
 
Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLs
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Ramakeerthi_1+yr_resume
Ramakeerthi_1+yr_resumeRamakeerthi_1+yr_resume
Ramakeerthi_1+yr_resume
 
A Primer on High-Quality Identifier Naming [ASE 2022]
A Primer on High-Quality Identifier Naming [ASE 2022]A Primer on High-Quality Identifier Naming [ASE 2022]
A Primer on High-Quality Identifier Naming [ASE 2022]
 
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
 
2010.hari_kannan.phd_thesis.slides.pdf
2010.hari_kannan.phd_thesis.slides.pdf2010.hari_kannan.phd_thesis.slides.pdf
2010.hari_kannan.phd_thesis.slides.pdf
 
Chandra_CV 3 8Yr Exp
Chandra_CV 3 8Yr Exp Chandra_CV 3 8Yr Exp
Chandra_CV 3 8Yr Exp
 
Multi-agent applications in a context-aware global software development envir...
Multi-agent applications in a context-aware global software development envir...Multi-agent applications in a context-aware global software development envir...
Multi-agent applications in a context-aware global software development envir...
 
Usability evaluation of Domain-Specific Languages
Usability evaluation of Domain-Specific LanguagesUsability evaluation of Domain-Specific Languages
Usability evaluation of Domain-Specific Languages
 
B034205010
B034205010B034205010
B034205010
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 
Icpc16.ppt
Icpc16.pptIcpc16.ppt
Icpc16.ppt
 
Icpc16.ppt
Icpc16.pptIcpc16.ppt
Icpc16.ppt
 
Supporting program comprehension with source code summarization
Supporting program comprehension with source code summarizationSupporting program comprehension with source code summarization
Supporting program comprehension with source code summarization
 
Candidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone SwarmCandidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone Swarm
 
Bikram kishor rout
Bikram kishor routBikram kishor rout
Bikram kishor rout
 
Bikram kishor rout
Bikram kishor routBikram kishor rout
Bikram kishor rout
 

More from Ptidej Team

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software MiniaturisationPtidej Team
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel BriandPtidej Team
 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel AbdellatifPtidej Team
 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh KermansaraviPtidej Team
 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel GrichiPtidej Team
 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano PolitowskiPtidej Team
 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisisPtidej Team
 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptPtidej Team
 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptPtidej Team
 
Thesis+of+étienne+duclos.ppt
Thesis+of+étienne+duclos.pptThesis+of+étienne+duclos.ppt
Thesis+of+étienne+duclos.pptPtidej Team
 

More from Ptidej Team (20)

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software Miniaturisation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel Abdellatif
 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh Kermansaravi
 
Mouna Abidi
Mouna AbidiMouna Abidi
Mouna Abidi
 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel Grichi
 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano Politowski
 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisis
 
MIPA
MIPAMIPA
MIPA
 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.ppt
 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.ppt
 
Medicine15.ppt
Medicine15.pptMedicine15.ppt
Medicine15.ppt
 
Qrs17b.ppt
Qrs17b.pptQrs17b.ppt
Qrs17b.ppt
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Icsoc15.ppt
Icsoc15.pptIcsoc15.ppt
Icsoc15.ppt
 
Thesis+of+étienne+duclos.ppt
Thesis+of+étienne+duclos.pptThesis+of+étienne+duclos.ppt
Thesis+of+étienne+duclos.ppt
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

130817 latifa guerrouj - context-aware source code vocabulary normalization for software maintenance

  • 1. Context-Aware Source Code Vocabulary Normalization for Software Maintenance Presentation of the Ph.D. Defense August 19, 2013 DGIGL - SOCCER Lab, Ptidej Team École Polytechnique de Montréal, Québec, Canada Latifa GUERROUJ latifa.guerrouj@polymtl.ca
  • 2. Outline   Thesis  Context-Awareness for Source Code Vocabulary Normalization  Conext-Aware Approaches for Vocabulary Normalization  Impact of Advanced Identifier Splitting on Traceability recovery  Impact of Advanced Identifier Splitting on Feature Location  2/59 Research Context & Problem Statement Conclusion and Future Work
  • 3. Textual information embeds domain knowledge 3/59 * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 4. Textual information embeds domain knowledge About 70% of source code consists of identifiers* Identifiers are important source of information for maintenance tasks such as: - Traceability link recovery - Feature location 4/59 * Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282
  • 5. Enslen et al. (MSR’09): Samurai: splits identifiers by mining terms frequencies in a large corpus of programs. Lawrie et al. (WCRE’10, ICSM’11): GenTest : generates all splittings and evaluates a scoring function against each one. Nomalize: a refinement of GenTest towards expansion based on a machine-translation technique. Example of Java code using meaningful identifiers - ibatis Example of Feature Location results - ibatis
  • 6. Research Context & Problem Statement Voc abul ar y mi s ma tch Requirements Example of C code identifiers - (gcl-2.6.7) Normalizing Source Code Vocabulary !? Normalization: - Splitting: bfd abs section ptr 6/59 - Expansion: binary file descriptor absolute section pointer
  • 7. Thesis Overarching Research Question of the Thesis Can we automatically resolve the mismatch between source code software artifacts, using context, software maintenance tasks such location and traceability recovery? 7/59 vocabulary and other to support as feature
  • 8. Thesis Thesis Phases Context-Awareness for Source Code Vocabulary Normalization Context-Aware Normalization Approaches (TIDIER & TRIS) Context is relevant (EMSE’13) TIDIER: Inspired by Speech Recognition (CSMR10, JSEP’13) TRIS: Fast Solution Dealing with normalization as an Optimization Problem (WCRE’12) Impact of Advanced Identifier Splitting on Traceability Recovery Advanced Identifier Splitting Can Help Traceability Recovery Impact of Advanced Identifier Splitting on Feature Location Advanced Identifier Splitting Can Help Feature Location (ICPC’11)
  • 9. Contribution 1: Context-Awareness for Source Code Vocabulary Normalization
  • 10. Context-Awareness for Normalization Experiments’ Definition and Planning Two experiments (Exp I and II) with 63 participants asked to split/expand identifiers from C programs with different contexts to investigate:   Accuracy in dealing with identifiers’ terms consisting of plain English words, abbreviations, and acronyms;  10/59 Effect of contextual information; Effect of factors: participants’ background, programming expertise, domain knowledge, and English proficiency.
  • 11. Context-Awareness for Normalization Exp I & II Subjects Characteristic Master 9 6 Ph.D. 28 10 1 2 Basic 11 6 Medium 23 5 9 10 Bad 8 1 Good 8 9 18 6 Excellent 8 (7) 11(6) Occasional 12 10 Basic usage 13 6 Knowledgeable but not expert 17 5 Expert 11/59 3 Very good Linux Knowledge 5 Expert English Proficiency # of participants Exp II (21) Post-doc C Programming Experience # of participants Exp I (42) Bachelor Program of studies Level 0 0 Participants’ characteristics and background (63 participants in total).
  • 12. Context-Awareness for Normalization Objects: identifiers from # open-source C applications &… GNU Projects (337 Projects) FreeBSD C C++ .h Files 57, 268 13,445 39,257 Size (KLOCs) 25,442 2,846 Identifiers 1,154,280 Oracle 927 C C++ .h Files 13,726 128 7,846 6,062 Size (KLOCs) 1,800 128 8,016 - 619,652 Identifiers 634,902 - 278,659 - 26 Oracle 20 - 0 Linux Kernel Apache Web Server C 12,581 - 11,166 Size (KLOCs) 8,474 - 1,994 Identifiers 845,335 - 352,850 Oracle 73 - 4 C C++ .h Files 559 - 254 Size (KLOCs) 293 - 44 Identifiers 33,062 - 11,549 Oracle 11 - 0 .h Files 12/59 C++ Main characteristics of the 340 projects for the sampled identifiers.
  • 13. Context-Awareness for Normalization Context (Internal & External) made available to participants. Context Levels Exp I Exp II no context (control group)   function  file   file plus AF   application  application plus AF  Context levels provided during Exp I and Exp II (AF = Acronym Finder). Experimental Design: Randomized Block Procedure 13/59
  • 14. Context-Awareness for Normalization Research Questions  RQ1: To what extent does context impact splitting/expansion of identifiers?  RQ2: To what extent do the characteristics of identifiers’ terms affect the normalization performances?  RQ3: To what extent do level of experience, programming language (C), domain knowledge, and English proficiency impact the normalization. 14/59
  • 15. Context-Awareness for Normalization F-measure Experiments’ Results – RQ1 (Context Relevance) file file+AF function Exp I noContext app app+AF file file+AF noContext Exp II Boxplots of F-measure: Exp I and II context levels. 15/59
  • 16. Context-Awareness for Normalization Experiments’ Results – RQ1 (Context Relevance) Exp I - Context significantly increases participants’ performances. - File level exhibits better performances than the function-level context. - Application-level context does not improve further. 16/59 Exp II
  • 17. Context-Awareness for Normalization Experiments’ Results – RQ2 (Effect of Kind of Terms) Exp I Context Kind of Terms #Matched #Unmatched Accuracy (%) file plus AF abbreviation acronyms plain 523 112 336 169 31 50 75.58 78.32 87.05 file abbreviation acronyms plain 542 94 346 164 32 50 76.77 74.60 87.37 function abbreviation acronyms plain 582 97 374 161 36 52 78.33 72.93 87.79 no context abbreviation acronyms plain 467 82 326 248 47 75 65.31 63.57 81.30 OVERALL abbreviation acronym plain 2114 385 1382 742 146 227 74.02 72.50 85.89 Exp I: Proportions of kind of identifiers’ terms correctly expanded per context level. 17/59
  • 18. Context-Awareness for Normalization Experiments’ Results – RQ2 (Effect of Kind of Terms) Exp II Kind of Terms application plus AF abbreviation acronyms plain 274 57 181 69 13 17 79.88 81.43 91.41 application abbreviation acronyms plain 542 94 346 164 32 50 75.35 82.61 90.45 file plus AF abbreviation acronyms plain 582 97 374 161 36 52 82.87 86.30 91.67 file abbreviation acronyms plain 467 82 326 248 47 75 76.60 85.07 92.57 no context abbreviation acronym plain 2114 385 1382 742 146 227 67.98 76.12 83.94 OVERALL 18/59 #Matched #Unmatched Accuracy (%) Context abbreviation acronym plain 1349 285 861 415 61 96 76.47 82.37 89.97 Exp II: Proportions of kind of identifiers’ terms correctly expanded per context level.
  • 19. Context-Awareness for Normalization Experiments’ Results – RQ3 (Effect of Part. Characteristics) Exp II p-value Context <0.001 Linux 0.037 Context:Linux 0.988 F-measure: two-way permutation test by context & knowledge of Linux. Exp II Exp I Exp II p-value p-value Context <0.001 <0.001 English 0.032 0.044 Context:English 0.054 0.698 F-measure: two-way permutation test by context & English Proficiency. 19/59
  • 20. Context-Awareness for Normalization Conclusion  Context is relevant for vocabulary normalization;  No significant difference in the accuracy of splitting/expanding abbreviations and acronyms;  Participants exploit better context when having a good level of English;  English is used beside the domain knowledge (Exp II) to normalize identifiers. Context is useful for source code vocabulary normalization 20/59
  • 21. Contribution 2: Context-Aware Source Code Vocabulary Normalization Approaches: TIDIER & TRIS
  • 22. TIDIER Overview Developers generate identifiers and contractions using:  Terms and words reflecting domain concepts, developers’ experience or knowledge;  A finite set of transformation rules: - Dropping all vowels pointer → pntr - Dropping a random vowel user → usr - Dropping a random character pntr → ptr - Dropping suffix (ing, tion, ment...) - Dropping the last m characters 22/59 available → avail rectangle → rect
  • 23. TIDIER Overview TIDIER is novel and uses context in the form of: Context-aware dictionaries enriched by the use of domain knowledge. TIDIER relies on a search-based technique to normalize identifiers:  It relies on a distance using Dynamic Time Warping (DTW) for continuous speech recognition (Ney, IEE TSE’84);  Hill Climbing. 23/59
  • 24. TIDIER Normalization Strategy Identifier DTW Match Best Matching Success! Zero Dist? No Select randomly a word with a minimal distance <> 0 Apply a random transformation to the chosen word Add transf word to temporary dictionary Current dictionary yes 24/59 Discard word from temporary dictionary Best Matching red Dist ? No If other transf to apply DTW Match
  • 25. TIDIER Case Study Research Questions  RQ1: How does TIDIER compare with alternatives when C RQ1 identifiers must be split?  RQ2: How sensitive are the performances of TIDIER to the use of RQ2 context and specialized knowledge?  RQ3: What percentage of identifiers with abbreviations is TIDIER able to map dictionary words? Analyzed Systems (Benchmark used in Context study) 25/59
  • 26. Identifier Splitting for Traceability Recovery Camel Case & Samurai Techniques Original Identifier userId user Id setGID set GID print_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobject tolocale tolocale imitating imitating DEFMASKBit 26/59 Camel Case DEFMASK Bit
  • 27. Identifier Splitting for Traceability Recovery Camel Case & Samurai Techniques Original Identifier Samurai userId user Id user Id setGID set GID set GID print_file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter object tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit 27/59 Camel Case DEFMASK Bit DEF MASK Bit
  • 28. Identifier Splitting for Traceability Recovery Camel Case & Samurai Techniques Original Identifier Camel Case Samurai userId user Id user Id setGID print_file2device Splits some cases set GID where CamelCase print file 2 device cannot set GID print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter object tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit Oversplits 28/59
  • 29. TIDIER Results Results Performances of Camel Case, Samurai, and TIDIER when using different dictionaries. 29/59 TIDIER outperforms previous ones on C and it is the first to produce a correct mapping of 48% (35/73) for abbreviations.
  • 30. Contribution 2: Context-Aware Source Code Vocabulary Normalization Approaches: TIDIER & TRIS
  • 31. TRIS Overview TRIS is a novel approach dealing with normalization as an optimization (minimization) problem: The aim is to minimize the following cost function: C(wOrigw) = α*Freq(wOrig) + C(type(wOrigw)) - Freq(wOrig): frequency of wOrig in the source code - C(type(wOrig w): cost of the transformation type 31/59
  • 32. TRIS Normalization Strategy 1. Source Code 2. Dictionaries Phase 1: Building Transformation Computation of dictionary words frequenceis Building the set of possible transformations Construction of arborescence of transformations 1. Identifier 2. Arborescence Phase 2: Identifier Processing 32/59 Identifier auxiliary graph creation Optimal split/expansion search
  • 33. TRIS Case Study Research Question RQ: What is the accuracy of the TRIS compared with alternative stateof-the art approaches? Analyzed Systems JHotDraw – Java Files Size (KLOC) 155 Identifiers 16 2,348 Oracle 957 Lynx - C Files Size (KLOC) 247 174 Identifiers 12,194 Oracle 3,085 Lawrie et al. Data Set Programs 186 C (MLOC) 26 C++ (MLOC) 15 Java (MLOC) 7 489 C/C++ Sampled the Projects used in TIDIER 33/59 Main characteristics of the systems analyzed using TRIS.
  • 34. TRIS Results Results Mean of F-measure on Lynx (C system). Adj p-value <0.001 Cliff's d 0.743 Samurai <0.001 0.684 TIDIER <0.001 0.204 Approach 1 TRIS Approach 2 Camel Case TRIS TRIS Results of Wilcoxon paired test & Cliff’s Delta effect size on Lynx. Cliff’s delta Interpretation: 34/59 - small: 0.148 <= d <0.33, medium: 0.33 <= d < 0.474 and large: d >= 0.474
  • 35. TRIS Results Results Identifier splitting correctness on the data set from Lawrie et al. - TRIS performs better than others with medium to large effect size on C; 35/59 - TRIS is better than Samurai of 16% and GenTest of 4%.
  • 36. TRIS Results Results Mean of F-measure on the 489 C sampled identifiers. Statistically significant difference using Wilcoxon: - p-value < 0.001; - Cliff’s d effect size is medium (d = 0.456). 36/59
  • 37. Contribution 3: Impact of Advanced Identifier Splitting on Traceability Recovery
  • 38. Identifier Splitting for Traceability Recovery Research Question RQ: How do different identifiers splitting strategies (CamelCase, Samurai and Oracle) impact Traceability Recovery? Traceability Recovery Techniques Configurations Splitting strategy LSI VSM LSICamelCase VSMCamelCase Samurai LSISamurai VSMSamurai Oracle LSIOracle VSMOracle CamelCase Configurations of the studied Traceability Recovery techniques. 38/59
  • 39. Identifier Splitting for Traceability Recovery Analyzed Systems Systems (Java) Version # Requirements # Classes iTrust 10 35 218 Pooka 2.0 90 298 System (C) Version Lynx 2.8.5 # Files 247 Size (KLOCs) 174 Main characteristics of the studied systems. 39/59 # Methods 2,067
  • 40. Identifier Splitting for Traceability Recovery Results (%) Systems Precision LSICamelCase Recall LSISamurai LSIOracle LSICamelCase LSISamurai LSIOracle iTrust 36.49 36.49 28.39 36.61 36.61 34.23 Pooka 14.06 14.14 15.64 22.81 22.37 22.36 Lynx 45.43 39.08 39.40 41.99 40.82 41.55 Systems Precision VSMCamelCase VSMSamurai Recall VSMOracle VSMCamelCase VSMSamurai VSM Oracle iTrust 48.99 25.81 23.77 23.77 23.07 Pooka 40.54 40.54 42.07 11.59 11.63 12.19 Lynx 40/59 48.99 64.26 57.84 49.91 37.66 37.05 40.16 Precision and Recall of the Traceability Recovery techniques configurations for iTrust, Pooka, and Lynx.
  • 41. Identifier Splitting for Traceability Recovery Results and Discussion  Potential benefits of developing advanced vocabulary normalization approaches.  Mismatch resulting from the requirements (presence of acronyms in requirements).  Case of Lynx (noise in data) : requirement 534 is “the browser should be able to manage store erase session I information”. Whereas a C method LYMain.c.i__nobrowse_fun is related to browse directories functionality.  Baseline splitting: “nobrowse” and thus no link between requirement 534 and LYMain.c.i_nobrowse_fun.txt.  Samurai and manual oracle split the identifier “nobrowse” into “no browse” and link the file LYMain.c.i__nobrowse_fun.txt. Potential benefits of developing advanced normalization approaches 41/59
  • 42. Contribution 4: Impact of Advanced Identifier Splitting on Feature Location
  • 43. Identifier Splitting for Feature Location Research Question RQ: How do different identifiers splitting strategies (CamelCase, Samurai and Oracle) impact Feature Location? Feature Location Techniques (FLTs) Configurations Splitting strategy IR FLT IRDyn FLT CamelCase IRCamelCase IRCamelCaseDyn Samurai IRSamurai IRSamuraiDyn Oracle IROracle IROracleDyn Feature Location techniques configurations studied. 43/59
  • 44. Identifier Splitting for Feature Location Analyzed Systems System Version Size (KLOC) Classes Methods # Data Sets 32 138 1,870 Eaddy et al.’s data* (2) 109 483 6.4 2 Rhino 1.6R5 jEdit 4.3 Dataset Size Queries Gold Sets Execution Information RhinoFeatures 241 Sections of ECMAScript Eaddy et al.* Full Execution Traces (from unit tests) RhinoBugs 143 Bug title and description Eaddy et al.* (CVS) N/A jEditFeatures 64 Feature (or Patch) title and description SVN Marked Execution Traces jEditBugs 86 Bug title and description SVN Marked Execution Traces Characteristics of the main analyzed systems. 44/59 * http://www.cs.columbia.edu/~eaddy/concerntagger/
  • 45. Identifier Splitting for Feature Location Results Similar median & average of effectiveness measure IR FLTs RhinoBugs RhinoFeatures Datasets with features have better results than datasets with bugs 45/59 jEditFeatures jEditBugs
  • 46. Identifier Splitting for Feature Location Results RhinoFeatures 46/59 RhinoBugs jEditFeatures jEditBugs
  • 47. Identifier Splitting for Feature Location Results RhinoFeatures 47/59 jEditFeatures Statistical significant result (p=0.05) RhinoBugs jEditBugs
  • 48. Identifier Splitting for Feature Location Results and Discussion  Samurai and CamelCase produced similar results;  IROracle outperforms IRCamelCase in terms of the effectiveness measure, on the RhinoFeatures dataset;  When only textual information is available, an improved splitting technique can help improve effectiveness of feature location.  Samurai ovesplits identifiers into many meaningless terms. In Rhino: debugAccelerators to debug Ac ce le r at o rs (CamelCase better in such cases). 48/59
  • 49. Identifier Splitting for Feature Location Vocabulary mismatch between queries and code  Inconsistencies between the identifiers used in the queries, and the identifiers used in the code.  The mismatch is less noticeable for features and more severe for bugs.  jEdit’s feature #16084869 (“Support “thick” caret”) contained in its description identifiers found in the name of the methods (e.g., thick, caret, text, area, etc.).  Name of developers (e.g., Slava,Carlos- Identifiers specific to communication (e.g., thanks, greetings, annoying).  Appeared only in the query vocabulary, and did not appear in the source code vocabulary. 49/59
  • 50. Identifier Splitting for Feature Location Features are more “descriptive” than bugs Words “join” and “line” are not mentioned Example of query (bugs) Binkley et al. (ICSM’12): Normalization improves Feature Location Potential benefits of developing advanced normalization approaches 50/59
  • 51. . Conclusion Context is relevant for source code vocabulary normalization. Source code files are the most helpful A limited context such as functions does not help A wider context such as applications does not improve further. Domain knowledge improves normalization. TRIS is novel and brings improvements on state-of-the-art approaches on C: 92.06% vs. 85.25% for TIDIER (Lynx- C) vs. 46.34% for Samurai vs. 38.51% for CamelCase 86% vs. 82% for GenTest on Lawrie et al. data vs. 70% for Samurai. 87.90% vs. 64.09% for TIDIER on the identifiers from the 340 projects. 51/59 TIDIER is novel and performs better than its previous approaches (CamelCase & Samurai): 54.29% of splitting correctness vs. 31.14% for (Samurai) & 30.08% (Camel Case) with an application level dictionary augmented with domain knowledge TIDIER was the first to produce a correct mapping for 48% of abbreviations. Advanced identifier splitting strategies improves the average of precision and recall of some systems: Pooka & Lynx. Advanced splitting improves feature location using LSI: Rhino (features). The quality of the requirements and expressiveness of the queries impact too.
  • 52. Future Work Impact of Vocabulary Normalization on Maintenance Tasks  Evaluate our work on other systems such as C, C++ or COBOL;  Compare it to other works such as Normalize (Lawrie et al, ICSM’11);  Study the impact of IR queries quality (Haiduc et al. (ICSE’13)). Context-Aware Vocabulary Normalization Approaches  Extend the evaluation of TIDIER and TRIS on larger systems;  Compare the results to more recent approaches such as Normalize (Lawrie et al., ICSM’11) and LINSEN (Corazza et al., ICSM’12). 52/59
  • 53. Future Work Context-Awareness for Vocabulary Normalization  Replicate our studies using eye-tracking tools;  Implement a context model that within an IDE support program understanding;  Involve participants from industry. Mining Software Repositories to Study the Impact of Identifier Style on Software Quality  Infer the identifier styles in open-source projects using HMM;  Analyze whether open-source developers adapt/bring their style;  Analyze whether identifier style can introduce bugs and--or impacts internal quality metrics such as semantic coupling & cohesion. 53/59
  • 54. Publications Articles in journals 1. Latifa Guerrouj, Massimilano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. An Experimental Investigation on the Effects of Contexts on Source Code Identifiers Splitting and Expansion. Empirical Software Engineering Journal (EMSE’13). 2. Latifa Guerrouj, Massimilano Di Penta, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. TIDIER: An Identifier Splitting Approach Using Speech Recognition Techniques. Journal of Software Evolution and Process (JSEP’13). 25(6): 569-661. Conference Articles 3. Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Massimiliano Di Penta. TRIS: a Fast and Accurate Identifiers Splitting and Expansion Algorithm. Proceedings of the 19th IEEE Working Conference on Reverse Engineering (WCRE), October 2012. 54/59 4. Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol. Can Better Identifier Splitting Techniques Help Feature Location? Proceedings of the 19 IEEE International Conference on Program Comprehension (ICPC), June 2011.
  • 55. Publications Conference Articles 5. Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano Antoniol. Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques. Proceedings of the 14th IEEE European Conference on Software Maintenance and Reengineering (CSMR), Mars 2010. Best Paper award of CSMR’10. 6. Latifa Guerrouj. Normalizing Source Code Vocabulary to Enhance Program Comprehension and Software Quality. Proceedings of the 35th ACM International Conference on Software Engineering (ICSE), May 2013. 7. Latifa Guerrouj. Automatic Derivation of Concepts Based on the Analysis of Source Code Identifiers. Proceedings of the 17th Working Conference on Reverse Engineering (WCRE), October 2012. 55/59 8. Alberto Bacchelli, Nicolas Bettenburg, Latifa Guerrouj. Mining Unstructured Data because “Mining Unstructured Data is Like Fishing in Muddy Waters!”. Proceedings of the 19th Working Conference on Reverse Engineering (WCRE), October 2012.
  • 56. . Conclusion Context is relevant for source code vocabulary normalization. Source code files are the most helpful A limited context such as functions does not help A wider context such as applications does not improve further. Domain knowledge improves normalization. TRIS is novel and brings improvements on state-of-the-art approaches on C: 92.06% vs. 85.25% for TIDIER (Lynx- C) vs. 46.34% for Samurai vs. 38.51% for CamelCase 86% vs. 82% for GenTest on Lawrie et al. data vs. 70% for Samurai. 87.90% vs. 64.09% for TIDIER on the identifiers from the 340 projects. 56/59 TIDIER is novel and performs better than its previous approaches (CamelCase & Samurai): 54.29% of splitting correctness vs. 31.14% for (Samurai) & 30.08% (Camel Case) with an application level dictionary augmented with domain knowledge TIDIER was the first to produce a correct mapping for 48% of abbreviations. Advanced identifier splitting strategies improves the average of precision and recall of some systems: Pooka & Lynx. Advanced splitting improves feature location using LSI: Rhino (features). The quality of the requirements and expressiveness of the queries impact too.
  • 57. References LAWRIE, D., FEILD, H. et BINKLEY, D. (2006). Syntactic Identifier Conciseness and Consistency. Proceedings of the 6th International Workshop on Source Code Analysis and Manipulation. pp. 139–148. MAYRHAUSER, A. V. et VANS, A. M. (1995). Program Comprehension During Software Maintenance and Evolution. Computer, vol. 28, pp. 44–55 M-A.D. STOREY, F.D. FRACCHIA, H. M. (1999). Cognitive Design Elements to Support the Construction of a Mental Model During Software Exploration. Journal of Systems and Software, vol. 44, pp. 171–185. ROBILLARD, M. P., COELHO, W. et MURPHY, G. C. (2004). How Effective Developers Investigate Source Code: An Exploratory Study. IEEE Transactions on Software Engineering, vol. 30, pp. 889–903. KERSTEN, M. et MURPHY, G. C. (2006). Using Task Context to Improve Programmer Productivity. Proceedings of the 14th International Symposium on Foundations of Software Engineering. pp. 1–11. SILLITO, J., MURPHY, G. C. et VOLDER, K. D. (2008). Asking and Answering Questions during a Programming Change Task. IEEE Transactions on Software Engineering, vol. 34,pp. 434–451. BINKLEY, D., DAVIS, M., LAWRIE, D. et MORRELL, C. (2009). To Camelcase or Under score. Proceedings of the 17th International Conference on Program Comprehension. pp. 158–167. ENSLEN, E., HILL, E., POLLOCK, L. et SHANKER, K. V. (2009). Mining Source Code to Automatically Split Identifiers for Software Analysis. Proceedings of the 6th International Working Conference on Mining Software Repositories. pp. 16–17. LAWRIE, D. J., BINKLEY, D. et MORRELL, C. (2010). Normalizing Source Code Vocabulary. Proceedings of the 17th Working Conference on Reverse Engineering. pp. 112–122. 57/59
  • 58. References LAWRIE, D. et BINKLEY, D. (2011). Expanding Identifiers to Normalize Source Code Vocabulary. Proceedings of the 27th International Conference on Software Maintenance. pp. 113–122. CORAZZA, A., MARTINO, S. D. et MAGGIO, V. (2012). LINSEN: An Efficient Approach to Split Identifiers and Expand Abbreviations. Proceedings of the 28th International Conference of Software Maintenance. pp. 233–242. EISENBARTH, T., KOSCHKE, R. et SIMON, D. (2003). Locating Features in Source Code. IEEE Transactions on Software Engineering, vol. 29, pp. 210–224. POSHYVANYK, D., GU´EH´ENEUC, Y.-G., MARCUS, A., ANTONIOL, G. et RAJLICH, V. (2007). Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval. IEEE Transactions on Software Engineering, vol. 33, pp. 420–432. EADDY, M., AHO, A., ANTONIOL, G. et GU´EH´ENEUC, Y.-G. (2008a). CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis. Proceedings of 16th International Conference on Program Comprehension. pp. 53–62. BINKLEY, D., DAWN, D. L. et UEHLINGER, C. (2012). Vocabulary Normalization Improves IR-Based Concept Location. Proceedings of the 28th International Conference on Software Maintenance, vol. 41, pp. 588–591. ANTONIOL, G., CANFORA, G., CASAZZA, G., LUCIA, A. D., et MERLO, E. (2002). Recovering Traceability Links Between Code and Documentation. IEEE Transactions on Software Engineering, vol. 28, pp. 970–983. MALETIC, J. I. et COLLARD, M. L. (2009). Tql: A Query Language to Support Traceability. Proceedings of the 2009 ICSE Workshop on Traceability in Emerging Forms of Software Engineering. pp. 16–20 58/59
  • 59. References DE LUCIA, A., DI PENTA, M. et OLIVETO, R. (2010). Improving Source Code Lexicon via Traceability and Information Retrieval. IEEE Transactions on Software Engineering, vol. 37, pp. 205–226. GUERROUJ, L., DI PENTA, M., GU´EH´ENEUC, Y.-G. et ANTONIOL, G. (2013b). An Experimental Investigation on the Effects of Context on Source Code Identifiers Splitting and Expansion. Empirical Software Engineering. Doi: 10.1016/S0164-1212(00)00029-7. GUERROUJ, L., DI PENTA, M., ANTONIOL, G. et GU´EH´ENEUC, Y.-G. (2013a). TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques. Journal of Software Evolution and Process, vol. 25, pp. 569–661. DIT, B., GUERROUJ, L., POSHYVANYK, D. et ANTONIOL, G. (2011). Can Better Identifier Splitting Techniques Help Feature Location? Proceedings of the 19th International Conference on Program Comprehension. pp. 11–20. GUERROUJ, L., GALINIER, P., GU´EH´ENEUC, Y.-G., ANTONIOL, G. et DI PENTA, M. (2012). TRIS: A Fast and Accurate Identifiers Splitting and Expansion Algorithm. Proceedings of the 19th Working Conference on Reverse Engineering. pp. 103–112. MADANI, N., GUERROUJ, L., DI PENTA, M., GU´EH´ENEUC, Y.-G. et ANTONIOL, G. (2010). Recognizing Words from Source Code Identifiers using Speech Recognition Techniques. Proceedings of the 14th European Conference on Software Maintenance and Reengineering. pp. 68–77. NEY, H. (1984). The Use of a One-stage Dynamic Programming Algorithm for Connected Word Recognition. IEEE Transactions on Acoustics Speech and Signal Processing, vol. 32, pp. 263–271. 59/59
  • 60. 60/59
  • 61. U s r 5 4 3 4 3 2 5 4 3 4 3 2 3 3 2 2 1 0 1 0 1 0 1 2 r 3 2 1 4 3 2 5 4 3 4 3 2 2 1 0 1 0 1 0 1 2 3 2 1 4 3 2 5 4 3 3 2 1 0 3 2 0 1 n 2 0 1 2 t 0 1 2 3 3 2 2 1 r c 3 2 3 2 t 2 3 3 2 r 4 3 2 1 u 5 4 3 2 s 4 4 3 2 r t 4 3 2 C Dictionary of 3 words 3 2 1 P n t r TIDIER Normalization Strategy p 61/59 Identifier to split : pntrctrusr
  • 62. TRIS Normalization Strategy Information for the Dictionary Transformations Building Phase Dictionary Words (D) Word Frequencies Transformations Set d1=“able” f2=0.2 t3=(d2, cal, 0.55) t4=(d2, call, -0.2) t5=(d2, cll, -0.2) d3=“callable” f3=0.6 t6=(d3, calla, 0.55) t7=(d3, callable, -0.2) t8=(d3, cllbl, -0.2) d4=“interface” 62/59 t1=(d1, abl, 0.55) t2=(d1, able, -0.2) d2=“call” Arborescence of Transformations for the Dictionary D f1=0.1 f4=0.1 t9=(d4, int, 0.55) t10=(d4, inte, -0.2) t11=(d4, inter, -0.2) t12=(d4, interface, -0.2) t13=(d4, intrfc, 0.8) Dictionary Transformations Building Information for the Identifier callableint Auxiliary Graph for the Identifier callableint
  • 63. Related Work & Contributions Context Relevance for Program Comprehension Mayrhauser et Vans (Computer’95) Cognitive models rely on the programmers’ knowledge, source code and software documentation. M-A.D. Storey (JSSE’99) Disparities BTW comprehension models can be explained in terms of differences in programmer characteristics, program characteristics & task characteristics. Robillard et al. (IEEE TSE’04) A methodical and structured approach to program investigation when performing a software change task is the most effective. Kersten et Murphy (FSE’06) Maylar to reduce information overload and focus a programmers’ work. Sillito et al. (IEEE TSE’08) Need more contexts and support to work with larger groups of entities and relationships. 63/59 Lack of empirical evidence on the extent to which context helps normalizing source code vocabulary
  • 64. Related Work & Contributions Vocabulary Normalization Approaches CamelCase (ICPC’09) CamelCase: splits identifiers based on naming conventions. Enslen et al. (MSR’09) Samurai: splits identifiers by mining terms frequencies in a large corpus of programs. CamelCase and Samurai have the inconvenient of relying on naming conventions and term frequencies respectively Lawrie et al. (WCRE’10) GenTest : generates all splittings and evaluates a scoring function against each one. Lawrie et Binkley (ICSM’2011) Nomalize: a refinement of GenTest towards expansion based on machine-translation. 64/59 Corazza et al. (ICSM’12) LINSEN: expands identifiers based on a graph model and an approximate string matching algorithm, and exploits context.
  • 65. Related Work & Contributions Feature Location Eisenbarth et al. (IEEE TSE’03) A technique that applies formal concept analysis to traces to generate a mapping BTW features and methods. Poshyvanyk et al. (IEEE TSE’07) Feature location finds source code element that implement a feature. Eaddy et al. (ICPC’08) Cerberus: hybrid as it combines static, dynamic and textual analysis. Binkley et al. (ICSM’12) Normalization improves the ranks of relevant docs as it recovers key domain terms. This improvement is for shorter, more natural, queries. Little empirical evidence on the impact of identifier splitting/expansion on feature location 65/59
  • 66. Related Work & Contributions Traceability recovery Antoniol et al. (IEEE TSE’02) Approaches to recover links BTW requirements and source code. Maletic et al. (ICSE’09) TQL, an XML-based traceability query language that supports queries across multiple artefacts and multiple traceability link types. De Lucia et al. (IEEE TSE’10) An approach to help developers maintain identifiers and comments consistent with high-level artifacts. Little empirical evidence on the impact of identifier splitting/expansion on traceability recovery 66/59
  • 67. Impact of Normalization on Feature Location LSI- based Feature Location Worst 67/59 Splitting algorithms: - Camel Case - Samurai - “Perfect” (Oracle using TIDIER) Better  Generate corpus  Preprocessing - Remove non-literals - Remove stop words - Split identifiers - Stemming  Indexing - Term-by-document matrix - Singular Value Decomposition  User formulate query  Generate results  Ranked list
  • 68. Impact of Normalization on Feature Location Manually Split Identifiers Identifiers that could not be split Manual Split Extract Identifiers All Identifiers Same split? (CamelCase Samurai TIDIER) Discordant Split Identifiers YES Building the Oracle • Assume they are correct Concordant Split Identifiers 68/59 NO • Manually verified a sample • Threat to validity
  • 69. • Examples: DT, i3, Impactzzz, etc. of Normalization P754, • Left unchanged on Feature Location Manually Split Identifiers Identifiers that could not be split Extract Identifiers All Identifier s Consensus BTW authors Checked source NO Same split? code (CamelCase Samurai TIDIER) Discordant Split Identifiers YES Building the Oracle • Assume they are correct Concordant Split Identifiers 69/59 Manual Split • Manually verified a sample • Threat to validity