Your SlideShare is downloading. ×
0
LINSEN
AN EFFICIENT APPROACH
TO SPLIT IDENTIFIERS AND
EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio Maggio...
MOTIVA
TIONS IR FOR SE
MOTIVA
TIONS IR FOR SE
IRFORNATURALLANGUAGE
1. Tokenization
IRFORNATURALLANGUAGE
1. Tokenization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
di...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
di...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
di...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
di...
Implicit assumption: The “same” words are used whenever a
particular concept is described
1. Tokenization
2. Normalization...
Implicit assumption: The “same” words are used whenever a
particular concept is described
1. Tokenization
2. Normalization...
1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identifier
Splitting
1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identifier
Splitting
• snake_case Splitter: r’(?<=w)_’
• display_box =...
IRFORSOURCECODE
IRFORSOURCECODE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE ...
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
IRF...
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r...
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r...
1. Tokenization
IDENTIFIERMAPPING
2. Normalization
1.5 Identifier
Mapping
• SAMURAI (Enslen, et.al , 2011)
• TIDIER (Guerro...
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching...
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching...
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching...
There could be multiple and equally correct splitting or
expansion solutions
THEAMBIGUITYPROBLEM
There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBL...
There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBL...
DICTIONARIES
DICTIONARIES
DICTIONARIES
Application-aware
Dictionaries
DICTIONARIES
Application-aware
Dictionaries(108,315 Entries)
DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)
DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)
GRAPHMODELModel: Weighted Directed Graph Example: drawXORRect identifier
GRAPHMODEL
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect ...
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• NODES correspond to charac...
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String ...
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String ...
GRAPHMODEL
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
...
GRAPHMODEL
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra ...
STRING MATCHING
• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
...
echo,
testing,...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
...
abort
absolute...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-M...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-M...
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-M...
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“...
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“...
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“...
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“...
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“...
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for t...
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for t...
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for t...
CASE STUDIES
DTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 and
RQ2}
RQ1 only
LUDISO Dataset
(2...
EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al ,...
EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al ,...
EVALUATION METRICS
Comparability of Results:
Accuracy rate
• Identifier Level evaluation:
Each mapping result must be comp...
RQ1: SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
D...
Accuracy Rates for the
comparison with GenTest
(Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Iden...
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW
LINSEN
RQ2: ...
Accuracy Rates for the
comparison with Normalize
(Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Ident...
Accuracy Rates for the
comparison with
AMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAP
LINSEN
...
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
FUTURE WORKS
• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionar...
THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio Maggio
Ph.D. Student, University of Naples “Fede...
Upcoming SlideShare
Loading in...5
×

LINSEN an efficient approach to split identifiers and expand abbreviations

137

Published on

"Linsen an efficient approach to split identifiers and expand abbreviations"

Slides presented at the International Conference of Software Maintenance (ICSM) 2012, Riva del Garda (TN), Italy

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
137
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "LINSEN an efficient approach to split identifiers and expand abbreviations"

  1. 1. LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS Anna Corazza, Sergio Di Martino, Valerio Maggio Università di Napoli “Federico II” 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
  2. 2. MOTIVA TIONS IR FOR SE
  3. 3. MOTIVA TIONS IR FOR SE
  4. 4. IRFORNATURALLANGUAGE
  5. 5. 1. Tokenization IRFORNATURALLANGUAGE
  6. 6. 1. Tokenization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
  7. 7. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
  8. 8. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
  9. 9. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords
  10. 10. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  11. 11. Implicit assumption: The “same” words are used whenever a particular concept is described 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  12. 12. Implicit assumption: The “same” words are used whenever a particular concept is described 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  13. 13. 1. Tokenization IRFORSOURCECODE 2. Normalization 1.5 Identifier Splitting
  14. 14. 1. Tokenization IRFORSOURCECODE 2. Normalization 1.5 Identifier Splitting • snake_case Splitter: r’(?<=w)_’ • display_box ==> display | box • camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’ • displayBox ==> display | Box draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  15. 15. IRFORSOURCECODE
  16. 16. IRFORSOURCECODE
  17. 17. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT IRFORSOURCECODE
  18. 18. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT IRFORSOURCECODE Splitting algorithms based on naming conventions are not robust enough
  19. 19. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code IRFORSOURCECODE
  20. 20. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IRFORSOURCECODE
  21. 21. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IRFORSOURCECODE
  22. 22. 1. Tokenization IDENTIFIERMAPPING 2. Normalization 1.5 Identifier Mapping • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • AMAP (Hill and Pollock, 2008) • ... • LINSEN draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  23. 23. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping
  24. 24. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
  25. 25. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model
  26. 26. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model • Able to both Split Identifiers and Expand possible occurring abbreviations
  27. 27. There could be multiple and equally correct splitting or expansion solutions THEAMBIGUITYPROBLEM
  28. 28. There could be multiple and equally correct splitting or expansion solutions • r as for Rectangle OR red THEAMBIGUITYPROBLEM
  29. 29. There could be multiple and equally correct splitting or expansion solutions • r as for Rectangle OR red THEAMBIGUITYPROBLEM •nsISupport ==> ns|IS|up|ports OR ==> ns|I|Supports
  30. 30. DICTIONARIES
  31. 31. DICTIONARIES
  32. 32. DICTIONARIES Application-aware Dictionaries
  33. 33. DICTIONARIES Application-aware Dictionaries(108,315 Entries)
  34. 34. DICTIONARIES Application-aware Dictionaries (22,940 Entries) (108,315 Entries)
  35. 35. DICTIONARIES Application-aware Dictionaries (22,940 Entries) (108,315 Entries) (588 Entries)
  36. 36. GRAPHMODELModel: Weighted Directed Graph Example: drawXORRect identifier
  37. 37. GRAPHMODEL • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  38. 38. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  39. 39. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • Application of the String Matching Algorithm (BYP) • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  40. 40. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • Application of the String Matching Algorithm (BYP) • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  41. 41. GRAPHMODEL • Every Arc is Labelled with the corresponding dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  42. 42. GRAPHMODEL • The final Mapping Solution corresponds to the sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  43. 43. STRING MATCHING • Application of the Baeza-Yates and Perlberg (BYP) Algorithm • Signature: BYP(identifier, word, φ(word)) • identifier: target string • word: string to match • φ(·): Tolerance (Error) function • Bounds the length of acceptable matchings Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
  44. 44. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  45. 45. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 5 7 8 2 63 10 DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  46. 46. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “R”,C-MAX DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  47. 47. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  48. 48. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  49. 49. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  50. 50. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  51. 51. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  52. 52. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglishDFile Identifier: drawXORRect
  53. 53. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglishDFile Identifier: drawXORRect
  54. 54. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  55. 55. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  56. 56. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  57. 57. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
  58. 58. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
  59. 59. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words? • RQ3: What is the ability of the LINSEN approach in dealing with different types of abbreviations?
  60. 60. CASE STUDIES DTW (Madani et. al 2010) GenTest+Normalize (Lawrie and Binkley, 2011) RQ1 and RQ2} RQ1 only LUDISO Dataset (2012) AMAP (Hill and Pollock, 2008) RQ3 only 15 out of 750 software systems Covering the 58% of total identifiers EMPIRI CAL E V A L U A T I O N
  61. 61. EVALUATION METRICS Comparability of Results: Accuracy rate Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] EMPIRI CAL E V A L U A T I O N
  62. 62. EVALUATION METRICS Comparability of Results: Accuracy rate Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] EMPIRI CAL E V A L U A T I O N
  63. 63. EVALUATION METRICS Comparability of Results: Accuracy rate • Identifier Level evaluation: Each mapping result must be completely correct • Soft-word Level evaluation: “Partial credit” given to each word correctly mapped Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011) EMPIRI CAL E V A L U A T I O N
  64. 64. RQ1: SPLITTING Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS R Q 1
  65. 65. Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011) 0 0.175 0.35 0.525 0.7 which 2.20 a2ps 4.14 Identifier Level 0 0.2 0.4 0.6 0.8 which 2.20 a2ps 4.14 Soft-word Level GenTest LINSEN GenTest LINSEN RE SULTS R Q 1 RQ1: SPLITTING
  66. 66. Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN RQ2: MAPPING RE SULTS R Q 2
  67. 67. Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011) 0 0.15 0.3 0.45 0.6 which 2.20 a2ps 4.14 Identifier Level 0 0.225 0.45 0.675 0.9 which 2.20 a2ps 4.14 Soft-word Level Normalize LINSEN Normalize LINSEN RE SULTS R Q 2 RQ2: MAPPING
  68. 68. Accuracy Rates for the comparison with AMAP (Hill and Pollock, 2008) 0 0.225 0.45 0.675 0.9 CW DL OO AC PR SL AMAP LINSEN RQ3: EXPANSION RE SULTS R Q 3 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  69. 69. CONCLUSIONS
  70. 70. CONCLUSIONS
  71. 71. CONCLUSIONS
  72. 72. CONCLUSIONS
  73. 73. CONCLUSIONS
  74. 74. FUTURE WORKS • Evaluation of the impact of each adopted dictionary on the performance • Improve or change or add dictionaries • Improve the implementation of the prototype to speed up the computation • Make use of parallel computation to process each identifier in isolation
  75. 75. THANK YOU 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy Valerio Maggio Ph.D. Student, University of Naples “Federico II” valerio.maggio@unina.it
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×