CSMR10c.ppt

Recognizing Words from Source Code
Identifiers using Speech Recognition
Techniques
CSMR 2010, Madrid

Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta,
Yann-Gaël Guéhéneuc, and Giuliano Antoniol

Content
Problem Statement

Aligning Strings and Words

Meta-
Meta-heuristic Inspired Approach

Technologies

Case Study – Research Questions

Case Study – Results

CSMR 2010, Madrid Conclusion and Future Work
2/24

Problem Statement

The Challenge
A few years after deployment, documentation may
no longer exist.

If it exists, it will be almost surely outdated.

My customers desire to change the system, add
new functionalities or fix a defect.

The only available source of information is the
code:
Identifiers;
Comments.
CSMR 2010, Madrid
3/24

Problem Statement

Identifiers Semantic

Researchers agree that the identifier semantics are
important:
Help program comprehension;

Suggest clues.

Composed identifiers:
Camel Case: MyLocalAccount , User_Address

Contraction based: pntrctr , usrAdrss , imagEdge

Good and possibly known to the developers:
hmmm, ixoth , pqrstuvwxyz
CSMR 2010, Madrid
4/24

Problem Statement

Words, Terms, Soft, and Hard Words

Term: any substring in a compound identifier.

Word: an entry in a dictionary (e.g., the English
dictionary).

Hard words: terms composing an identifier reflecting
domain concepts, clearly demarked:
baseAddress,
baseAddress, user_file
Soft words: terms different from the identifier and not
clearly demarked (e.g., abbreviation, contraction,
etc.):
CSMR 2010, Madrid userarea, ptrcntr,
userarea, ptrcntr, userGid
5/24

Problem Statement

Current Practices

Camel Case-based approaches plus greedy
Case-
algorithms, e.g., Lawrie et al. 2006, 2007.

Samurai by Enslen et al, 2009:
Lexicon plus a greedy algorithm;

If a contraction is used somewhere in the code then it is
likely used in the same context than the original term;

Frequency tables of contractions and terms to split
composed identifiers.

Limitations : Abbreviations not treated, no
quantification of how close the match is to the
CSMR 2010, Madrid
unknown string.
6/24

Problem Statement

Our Approach in Essence
Developers compose identifiers:
Using terms and words reflecting domain concepts,

developer’s experience, knowledge.

Developers generate contraction via a finite set of
transformation rules:
Drop all vowels, drop prefix, drop suffix, etc.

Mimics developer’s identifiers generation process:
Dictionaries capturing terms and words;

A search-based technique to split exactly any unknown
string;
A distance using Dynamic Time Warping (DTW) for
CSMR 2010, Madrid
continuous speech recognition [H. Ney, 1984].
7/24

Aligning Strings and
Words
Modified H. Ney DTW
3 5 4 0

U s r
4 5 4 3 2 1
2 3 4 3 4 3 3 1 0 1
3
Dictionary of 3 words
r 1 2 2 3 2 2 0 1 2

3 4 5 4 2 1 0 3 4 5
t

2 3 4 3 1 0 1 2 3 4
1 2 3 2 0 1 2 1 2 3
C

3 0
r

3 2 3 3 2 4 5 4
2 2 0 1 2 2 3 3 4 4
P n t

1 0 1 2 2 3 3 2 3 3
0 1 2 3 1 2 2 1 2 2
CSMR 2010, Madrid p n t r c t r u s r
Identifier to split : pntrctrusr
8/24

Meta-heuristic Inspired
Approach
Word Transformation Rules
Constraint: String must remain longer or equal to 3 chars

Drop all vowels pointer → pntr

Drop a random vowel user → usr

Drop a random character pntr → ptr

Drop suffix (ing, tion, ed, available → avail
ment, able)

Drop the last m characters rectangle → rect
CSMR 2010, Madrid
9/24

- Meta-heuristic Inspired
Approach

-Technologies Overall Splitting (Hill Climbing) Procedure

Identifier DTW Match

Best Matching
Success!
Zero Dist?

No

Select randomly a
word with a minimal
distance <> 0

Apply a random
transformation to the Add transf word to
chosen word temporary
dictionary

Current dictionary

yes
Discard word Best Matching DTW
red Dist ?
CSMR 2010, Madrid from temporary Match
No
dictionary If other transf to apply
10/24

Case Study – Research
Questions
Case Study - Research Questions

RQ1: What is the percentage of identifiers
correctly split by the proposed approach?

RQ2: How does the proposed approach perform
compared with the Camel Case splitter?

RQ3: What percentage of identifiers containing
word abbreviations is the approach able to
CSMR 2010, Madrid map to dictionary words?
11/24


Case Study - Results

JHotDraw – Java

16 KLOC
155 files
2,348 identifiers (longer than 2 chars)
957 manually segmented identifiers

Lynx – C

174 KLOC
247 files
12,194 identifiers (longer than 2 chars)
3,085 manually segmented identifiers

CSMR 2010, Madrid
12/24


RQ1 - Percentage of Correct Classifications

Splits Ids Single Multiple Errors
Systems iteration iterations
JHotDraw 957 891 (93%) 920 (95%) 37

Lynx 3,085 2,169 (70%) 2,901 (94%) 271

Typical cases where the approach failed:

afaik, ihmo, foobar, fsize …

CSMR 2010, Madrid
13/24


RQ2 - Camel Case Split

Splits Ids Correct Split Errors
Systems
JHotDraw 957 874 (91%) 83

Lynx 3,085 561 (18%) 2,524

Statistical comparison (Fisher’s exact test) with our approach:

Null Hypothesis (H0) : The propotions of correct splittings
obtained by the approaches are not significantly <>.

• JHotDraw: Odds Ratio = 1.3, p-value = 0.1

CSMR 2010, Madrid • Lynx: Odds Ratio = 60, p-value < 0.001
14/24


RQ3 - Percentage of Correctly Split Id (s)

Splits Ids Correct Split Errors
Systems
JHotDraw 957 920 (95%) 37

Lynx 3,085 2,901 (94%) 271

The novel identifier splitting approach perfoms
better than the Camel Case splitter.

CSMR 2010, Madrid
15/24


Multiple Possible Splits - Successes

borddec bord decimal bord decision
anchorlen anchor length anchor lender
drawrect draw rectangle
drawroundrect draw round rectangle
fillrect fill rectangle
javadrawapp java draw apply java draw append
netapp net apply net append
newlen new length new lender
nothingapp nothing apply nothing application
addcolumninfo add column information add column inform
addlbl add label
casecomp case compare case complete

Max of 10000 iterations
CSMR 2010, Madrid
16/24


Multiple Possible Splits - Failures

serialversionuid serial version did
selectionzordered selection ordered
removefrfigurerequestremove remove figure request remove
jhotdraw hot draw
getvadjustable get bad just able
fimagewidth him age width
fimageheight him age height
writeref write red

Max of 10000 iterations

DTW does not account for context, syntax or semantic
CSMR 2010, Madrid
17/24


Discussion - Challenges

How can we expand fwrite or pdraw?
pdraw?

How can we avoid expanding FileLen into File
Lender rather than File Length?
Length?

How can we recognize that ImagEdit has a correct
split at distance 1 and not 0?

How can we expand/split pqrstuvwxyz?
pqrstuvwxyz?
CSMR 2010, Madrid
18/24


Threats to Validity
External validity:
We analyzed only two systems;
However: different domains, different programming languages.

Construct validity: errors may be present in the oracle!
We detected 1% error in the first oracle release;
We did the best to guess programmer intention but we cannot
exclude errors.

Reliability validity: replication package available.

Internal validity: subjectivity and bias in building the oracle:

The same researcher built both oracles;
Oracles were validated by other two researchers;
Size of oracle large enough to avoid a few percent errors change
CSMR 2010, Madrid conclusions.
19/24

Conclusion and Future
Work
Conclusion

We presented a search-based approach to
search-
automatically segment source code identifiers.

The novel approach is inspired by the developer
behavior when composing identifiers.

The approach uses a dictionary, a distance computed
via DTW, and a set of word transformations.

Results on JHotDraw and Lynx show the superiority
of the approach over a simple Camel Case splitter.
CSMR 2010, Madrid
20/24

Conclusion and Future
Work
Future Work

We plan to:
to:

Expand the evaluation to other systems.

Introduce enhanced heuristics for term selection
and word transformations.

Contextualize our search by coupling our
algorithm with the approach of Enslen et al.
[ELK, 2009](restrict the search to the words used
2009](restrict

CSMR 2010, Madrid
in the same method, class, or package).
21/24

Finally… Questions

Thank you for your attention

CSMR 2010, Madrid
22/24

References

[ELK, 2009] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker,
“Mining source code to automatically split identifiers for software
analysis,” Mining Software Repositories, International Workshop on,
vol. 0, pp. 71 - 80, 2009.

[H. Ney, 1984] H. Ney, “The use of a one-stage dynamic programming
algorithm for connected word recognition,” Acoustics, Speech and
Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263 - 271,
Apr 1984.

D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effective identifier
names for comprehension and memory,” Innovations in Systems and
Software Engineering, vol. 3, no. 4, pp. 303 - 318, 2007.

D. Lawrie, C. Morrel, H. Feild, and D. Binkley, “What’s in a name? a
study of identifiers,” in Proc. of the International Conference on
Program Comprehension (ICPC), 2006, pp. 3 - 12.
CSMR 2010, Madrid
23/24

Overall Splitting (Hill Climbing) Procedure

Best Matching Success!
Zero Dist?
Identifier DTW
Match
No

Ranked
Word List No Yes
Improved?
Discard word
and create new
dictionary
Temporary
Dictionary Dictionary
Save word and
create new
dictionary
CSMR 2010, Madrid
24/24

CSMR10c.ppt

Recommended

Recommended

More Related Content

Similar to CSMR10c.ppt

Similar to CSMR10c.ppt (9)

More from Ptidej Team

More from Ptidej Team (20)

Recently uploaded

Recently uploaded (20)

CSMR10c.ppt