Who’s who? (in GNOME)Erik KoutersBogdan VasilescuAlexander SerebrenikMark G.J. van den Brand
It’s all about communication                                              The error should             Test #14 fails     ...
One person - multiple aliases      Contributors sign off using a <name, email> alias.<John Doe, john.doe@gmail.com>       ...
One person - multiple aliases        Names:                            Emails:        •  John Doe                       • ...
Latent Semantic Analysis                                                                          johnd@domainA:	  	   <Jo...
Latent Semantic Analysis  <John	  Smith,	  john@domainA>  <John	  Brown,	  john@domainB>	                                 ...
s performed on one third of the data and testing on the othero thirds. All algorithms have been implemented in Python     ...
d varying the remaining. After the sensitivity analysis algorithm based on LSA, robustestricted the range of minLen to {2,...
Upcoming SlideShare
Loading in …5
×

ICSM 2012 ERA

391 views
334 views

Published on

Kouters, E, Vasilescu, B, Serebrenik, A and van den Brand, MGJ (2012), "Who's who in GNOME: using LSA to merge software repository identities", In Proceedings of the 28th IEEE International Conference on Software Maintenance---Early Research Achievements (ICSM 2012 ERA), pp. 592-595. IEEE.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
391
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ICSM 2012 ERA

  1. 1. Who’s who? (in GNOME)Erik KoutersBogdan VasilescuAlexander SerebrenikMark G.J. van den Brand
  2. 2. It’s all about communication The error should Test #14 fails be somewhere sometimes here… I know how to fix it!/ Mathematics and Computer Science 9/26/12 PAGE 2
  3. 3. One person - multiple aliases Contributors sign off using a <name, email> alias.<John Doe, john.doe@gmail.com> <John J. Doe, john.doe@yahoo.com>/ Mathematics and Computer Science 9/26/12 PAGE 3
  4. 4. One person - multiple aliases Names: Emails: •  John Doe •  j.doe@domainA •  Doe John •  john.doe@domainB •  J Doe •  john DOT doe AT domainC •  John •  jdoe@domainD •  John Dooe •  john@domainE •  John J. Doe •  John Joseph Doe •  John “Bone” Doe Identity merge algorithms: •  the “noisier” the data, the worse they perform!/ Mathematics and Computer Science 9/26/12 PAGE 4
  5. 5. Latent Semantic Analysis johnd@domainA:     <John  Doe,                johnd@domainA>    {john,  johnd,     <John  Joseph  Doe,  johnd@domainA>    joseph,  doe} Document-term matrix max similarity(jdoe, john   1 .. .. .. {john,  johnd,  joseph,  doe}) johnd   1 .. .. .. = similarity(jdoe, doe) = 1 – Levenshtein(jdoe, doe) / joseph   1 .. .. .. max( length(jdoe), length(doe)) jdoe   3/4 .. .. .. = 1 – 1/4 = 3/4 doe   1 .. .. ../ Mathematics and Computer Science 9/26/12 PAGE 5
  6. 6. Latent Semantic Analysis <John  Smith,  john@domainA> <John  Brown,  john@domainB>   Inverse document frequency Singular value decomposition Rank (noise) reduction john   1 .. .. .. johnd   1 .. .. .. Cosine between documents joseph   1 .. .. .. jdoe   3/4 .. .. .. Merge similar documents doe   1 .. .. ../ Mathematics and Computer Science 9/26/12 PAGE 6
  7. 7. s performed on one third of the data and testing on the othero thirds. All algorithms have been implemented in Python Empirical evaluationd, as well as the data, can be made available upon request. Average case Worst case 1.00 1.00 GNOME – all projects (Git) ● ● 0.95 0.95 •  8618 different aliases •  only 4989 unique! ● 0.90 0.90F−measure F−measure ● ● ● ● ● 0.85 0.85 ● ● ● ● 0.80 0.80 ● 0.75 0.75 Simple Bird et al. LSA Simple Bird et al. LSA 1. The f -measures for the competing approaches. The f -measure rangesween 0 andComputer Sciencehigher the value, the better). LSA performs as well as / Mathematics and 1 (the 9/26/12 PAGE 7
  8. 8. d varying the remaining. After the sensitivity analysis algorithm based on LSA, robustestricted the range of minLen to {2, 3, 4}, levThr to crepancies in VCS aliases. Empir Summary 0.75}, cosThr to {0.65, 0.70, 0.75}, and k was fixed to Git repositories has shown equallof the number of terms. In the average case, for each of algorithm as the state of the art inen repetitions, training was performed on one tenth of the performance in the worst case. ME aliases ( 860), and testing on ten random subsets approach presented in FRASR, o the same size from the remaining aliases. Samples were software repositories [10].en instead of the entire remaining data for computational R EFEREN ency reasons. In the worst case, because of fewer aliasese dataset (673), for each of the ten repetitions, training john   C. 1 [1] Bird et al. “Mining emai .. .. ..performed on one third of the data and testing on the other ACM, 2006, pp. 137–143. hirds. All algorithms have been implemented in Pythonjohnd   A. 1 [2] Capiluppi, .. Serebrenik .. A. .. as well as the data, can be made available upon request. oping an H-Index for OSS D joseph   2012, pp. 251–254. .. 1 .. .. jdoe   P.3/4 .. “A comparison [3] Christen. .. Average case Worst case .. 1.00 1.00 ● ● Techniques and practical 0.95 0.95 doe   2006, pp. 290–294. .. 1 .. .. ● [4] S.T. Dumais. “Improving 0.90 0.90 F−measure ● ● from external sources”. In: ● ● ● 0.85 0.85 23.2 (1991), pp. 229–236. ● ● ● [5] D.M. German. “The GNO ● 0.80 0.80 of open source, global softw ● 0.75 0.75 ware Process 8.4 (2003), p / Mathematics and Computer Science 9/26/12 PAGE 8 Simple Bird et al. LSA Simple Bird et al. LSA [6] M. Goeminne and T. Mens.

×