Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using IR methods for labeling source code artifacts: Is it worthwhile?

536 views

Published on

Information Retrieval (IR) techniques have been used for various software engineering tasks, including the labeling of software artifacts by extracting “keywords” from them. Such techniques include Vector Space Models, Latent Semantic Indexing, Latent Dirichlet Allocation, as well as customized heuristics extracting words from specific source code elements. This paper investigates how source code artifact labeling performed by IR techniques would overlap (and differ) from labeling performed by humans. This has been done by asking a group of subjects to label 20 classes from two Java software systems, JHotDraw and eXVantage. Results indicate that, in most cases, automatic labeling would be more similar to human-based labeling if using simpler techniques - e.g., using words from class and method names - that better reflect how humans behave. Instead, clustering-based approaches (LSI and LDA) are much more worthwhile to be used on source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled.

  • Be the first to comment

  • Be the first to like this

Using IR methods for labeling source code artifacts: Is it worthwhile?

  1. 1. Using IR Methods for Labeling Source Code Artifacts: is it Worthwhile? Andrea De Lucia Massimiliano Di Penta Rocco Oliveto Annibale Panichella Sebastiano Panichella
  2. 2. Context • Source code is text too! • Lexicon quality impacts software quality • IR techniques used to analyze software • Emerging application: label software artifacts • Labeling packages [Kuhn et al., 2007] • Labeling changes [Thomas et al., 2010] • Relate topics in high-level artifacts an source code [Gethers et al., 2011]
  3. 3. ok... but...
  4. 4. Are these automatic labelings meaningful?
  5. 5. Related study: Haiduc et al., 2010
  6. 6. Empirical Study Goal: compare human-generated source code labeling with automatically generated ones Quality focus: quality of automatically generated source code labelings Perspective: researchers interested to develop source code labeling techniques
  7. 7. Research Questions RQ1: How much is the overlap between the keywords identified by developers when describing a source code artifact and those identified by an automatic technique? RQ2:Which are the characteristics of source code artifacts that affect the overlap of automatic labeling techniques with the human-generated labels?
  8. 8. Context Objects: 10 classes from eXVantage (industrial test data generation tool) 10 classes from JHotDraw Subjects: 17 software engineering students (Bachelor degree in CS, Univ. of Molise, second year)
  9. 9. Study Procedure
  10. 10. Procedure Overview 1.Participants’ training on the system 2.Presentation of the experiment procedure 3.Manual labeling by participants 4.Automatic labeling 5.Comparison
  11. 11. Manual Labeling • Subjects label each class by selecting 10 words from it • Time spent on each class annotated • Offline study, lasted 2 weeks book hotel room reservation arrival departure smoking double card breakfast source code file
  12. 12. Aggregating manual labeling Each artifact is labeled using terms selected by at least 50% of the subjects book hotel room reservation arrival departure smoking double card breakfast book hotel room refund arrival check parking double suite group confirmation room reservation arrival departure date bed card payment spa room 3 arrival 3 book 2 hotel 2 reservation 2 departure 2 double 2 card 2 book hotel room reservation arrival departure smoking double card breakfast book hotel room refund arrival check parking double suite group confirmation room reservation arrival departure date bed card payment spa
  13. 13. Automatic Labeling
  14. 14. Text Processing • Extracted words from • source code + comments • comments only • Identifier splitting (camel case) • Pruned stop words and programming language keywords • Stemming (Porter) • Term indexing using: tf or tf-idf
  15. 15. Labeling techniques • Simple signature: words from class name, method name and params, attribute names • VSM: terms ranked according to tf or tf-idf • Latent Semantic Indexing (LSI) • Class methods considered as documents • Words having the highest weight in the LSI space • Latent Dirichlet Allocation (LDA) • Different number of topics: 2, #Methods/2, #Methods • Core words: having highest probability on the overall set of topics • Core topics: words from the topic with highest probability
  16. 16. Measurements: RQ1 Asymmetric Jaccard to avoid penalizing automatic approaches K(Ci) = {t1 . . . tm} Kmi (Ci) = {t1 . . . th} overlapmi (Ci) = |K(Ci) Kmi (Ci)| Kmi (Ci) manual labeling of Ci automatic labeling of Ci by technique mi
  17. 17. Measurements: RQ2 A) Ability of LDA and LSI to cluster related classes Open your book, page 0
  18. 18. while (windowEnd+1 < sessions.Count) { if (WindowError(sessions,windowStart,windowEnd+1) > maxWind && WindowLength(sessions,windowStart,windowEnd+1) > mWi { var bubble = new SessionBubbleContract(); bubble.Start = sessions[windowStart].Start; bubble.End = sessions[windowEnd].End.Value; list.Add(bubble); windowStart = windowEnd+1; } windowEnd++; } if (windowStart < sessions.Count) Proceedings of the 2012 20th IEEE International Conference on Program Comprehension ICPC 2012 AlarmVal AlarmVal != ErrValue AlarmVal > -0.0001 AlarmVal < + 0.0001 C Alarm Val = Param->Alarm Val foo = Alarm Val bar = Fun$Result1 Alarm Val = bar IF Alaram Val > -0.0001 Fun$Result1 int Alarm Val = Param->Alarm Val; int foo = Alarm Val; int bar = fun(); Alarm Val = ba r; if ( Alaram Val > -0.0001){ ... call Fun Celebrating 20 Years Sponsered by JUNE 11-13
  19. 19. Measurements: RQ2 B) Correlation between overlap and time spent by subjects to label artifacts H(Ci) = mX j=1 tfj n · log ✓ n tfj ◆ n = Pm k=1 tfk A) Ability of LDA and LSI to cluster related classes measured as Entropy of terms in a class eH(Ci) = H(Ci)/log(m)normalized as:
  20. 20. Results
  21. 21. RQ1: eXVantage Signature (tf) Signature (tf-idf) VSM (tf) VSM (tf-idf) LSI (tf) LSI (tf-idf) LDA (n=M, core_tp) LDA (n=M, core_ts) LDA(n=M/2, core_tp) LDA (n=M/2, core_ts) LDA (n=2, core_tp) LDA (n=2, core_ts) 0.00 20.00 40.00 60.00 80.00 59 61 59 57 50 46 52 53 59 61 0 0 59 60 56 56 53 52 63 58 69 70 77 76 Comments+Code Comments only Overlap
  22. 22. RQ1: JHotDraw Signature (tf) Signature (tf-idf) VSM (tf) VSM (tf-idf) LSI (tf) LSI (tf-idf) LDA (n=M, core_tp) LDA (n=M, core_ts) LDA(n=M/2, core_tp) LDA (n=M/2, core_ts) LDA (n=2, core_tp) LDA (n=2, core_ts) 0.00 20.00 40.00 60.00 80.00 52 53 52 53 46 58 44 48 54 53 0 0 62 59 52 60 55 59 55 54 65 60 75 74 Comments+Code Comments only Overlap
  23. 23. Why LDA does not work well Class distance in eXVantage topic 1 topic2
  24. 24. RQ2: Entropy Comments do not contain clearly dominant words !"#$%&"$ !''#&() !''#&()$*&+, -./--./0-.1--.10 2&(3!4,$!5$(#3') !"#$%&"$ !''#&() !''#&()$*&+, -./--./0-.1--.10 2&(3!4,$!5$(#3') eXVantage JHotDraw Entropyofterms Entropyofterms Code+comments Code+comments Comments onlyComments only
  25. 25. RQ2: Effort to label artifacts System Class size Comment verbosity JHotDraw 0.6 -0.25 eXVantage 0 -0.13 Pearson Correlation Different comment verbosity in JHotDraw (6) and eXVantage (14)
  26. 26. RQ2:VSM vs. LSI VSM LSI 0 20 40 60 80 72 59 42 61 Low High JHotDraw VSM LSI 0 20 40 60 80 68 68 52 72 Low High exVantage Overlap vs. effort needed to label a class
  27. 27. Conclusions

×