ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search


Published on

Paper: A Comparison of Stemmers on Source Code Identifiers for Software

Authors: Andrew Wiese, Valerie Ho, Emily Hill.

Session: ERA1 - Linguistic Analysis of Software Artifacts

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

  1. 1. A Comparison of Stemmers on Source Code Identifiers for Software Search Andrew Wiese,Valerie Ho, Emily Hill Montclair State UniversityThursday, October 6, 2011
  2. 2. Problem: Source Code Search • Challenge: Query words may not exactly match source code words & can hurt search • Example: “add item” query should match • add, adds, adding, added • item, items • Stemming used by Information Retrieval (IR) systems to strip suffixes • reduce all words to root form, or stem • a.k.a. word conflationThursday, October 6, 2011
  3. 3. What makes stemming source code different from traditional IR? • Word choice more restrictive in naming identifiers than in natural language (NL) documents • NL: stem, stems, stemmer, stemming, stemmed • Code: stem, stemmer • Classes that encapsulate actions have names with nominalized verbs: • play → player • compile → compiler • Tradtional IR prefer light Porter’s • tends not to stem across parts of speech • E.g., noun ‘player’ will not stem to verb ‘play’Thursday, October 6, 2011
  4. 4. Stemming Challenges • Understemming • stemmer assigns different stems to words in the same concept • reduces number of relevant results in search (i.e., reduces recall) • Overstemming • stemmer assigns the same stem for words with different meanings (e.g., business conflated with busy, university with universe) • increases number of irrelevant results (i.e., reduces precision) • Stemmers categorized by type of error • Light stemmers: understem • Heavy stemmers: overstemThursday, October 6, 2011
  5. 5. A Brief History of Stemming • Light Stemmers (tend not to stem across parts of speech) • Porter (1980): rule-based, simple & efficient • Most popular stemmer in IR & SE • Snowball (2001): minor rule improvements • KStem (1993): morphology-based • based on word’s structure & hand-tuned dictionary • in experiments shown to outperform porter’s • Heavy Stemmers • Lovins (1968): rule-based • Paice (1990): rule-based • MStem: morphological (PC-Kimmo), specialized for source code using word frequenciesThursday, October 6, 2011
  6. 6. Our Contribution • Compare performance of 5 stemmers on source code identifiers • Evaluation 1: compare conflated word classes • started from 100 most frequently occurring words in 9,000 open source Java programs • analyzed by 2 human Java programmers in terms of accuracy & completeness • Evaluation 2: compare effect of using 5 stemmers vs not stemming on 8 search tasksThursday, October 6, 2011
  7. 7. Stemmer Word Classes Comparison • accurate: word class contains no unrelated words • complete: word class not missing related words (rely on greediness & diversity of stemmers) • context sensitive (CS): multiple senses or disagreement 100 90 No. Accurate & Complete 80 70 60 58% 50 53% 40 37% 32% 30 29% 20 10 e CS er e ll m m Non ort Paic w ba Ste Ste P no K M S None Context PORTER PAICE SNOWBALL KSTEM MSTEM SensitiveThursday, October 6, 2011
  8. 8. element KStem element (MStem) MStem element, elemental, elements stemmers Paice el, ela, ele, element, elemental, elementary, and inaccu Word Classes Example elemente, elementen, elements, elen, eles, eli, elif, elise, elist, ell, elle, ellen, eller, els, words. Fo ‘method’ w • Stemmer comparison for 2 examples else, elseif, elses, elsif Porter import, importable, importance, important, with Span Table I and, in the • Underlined words in all stemmer classes imported, importer, importers, importing, the adverb S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED imports WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS ) quently we KStem con Snowbl import, importable, importance, important, importantly, imported, importer, importers, word frequ with ‘else’ Word Stemmer Word Class uses an En (A & C) importing, imports ‘stationary’ import KStem import, importable, imported, importer, The ann Porter element, elemental, elemente, elements (Kstem) importers, importing, imports C. Threats Snwbl MStem element, elemental, elemente, elements phological element KStem element importable, importance, important, import, Because (MStem) MStem importantly, imported, importer, importers, element, elemental, elements stemmers Paice el, ela, ele,imports elemental, elementary, importing, element, programs, and inaccu Paice elemente, elementen,importance, elen, eles, import, importable, elements, important, words.lang ming For eli, elif, elise,importar, elle, ellen, eller, els, importantly, elist, ell, imported, importer, 9,000+ Jav else, elseif,importing, imports importers, elses, elsif ‘method’ w add, adde, addes, adds frequent w with Spani Porter import, importable, importance, important, Snwbl imported, addes, adds add, adde, importer, importers, importing, and,large s the in the add KStem add, addable, added, addes, adding, adds imports it is unlik KStem wer (CS) MStem Snowbl import, importable, adder, adding, addition, add, addable, added, importance, important, of 100 wo word frequ importantly,additionally,importer, importers, additional, imported, additions, additive, importing, adds additivity, imports of word cl uses an En import Paice KStem import, add, addable, imported, importer, ad, ada, importable, adde, added, adder, may not g (Kstem) importers, importing, ade, ads addes, adding, adds, imports C. Threats stemmers. Porter MStem import,named, namely, names, naming name, importable, importance, important, Snwbl name, named, namely, names, naming can be am Because importantly, imported, importer, importers,Thursday, October 6, 2011 name KStem name, nameable, named, namer, names, the ‘contex
  9. 9. Stemming and Source Code Search • search technique: tf-idf • search tasks: 8 with 48 queries from prior study [Shepherd, et al. ’07] • Paice: overstemming & understemming mistakes improved results for 2 tasks (e.g., textfield report element) 1.0 Area Under the Curve 0.9 0.8 0.7 0.6 0.5 NoStem Porter ! ! Snowbl ! ! KStem ! ! MStem ! ! Paice ! !Thursday, October 6, 2011
  10. 10. Conclusion • Morphological stemmers appear to be more accurate & complete than rule-based • In search, stemming more consistently produces relevant results than not stemming • Heavy stemmers like MStem & Paice appear to be more effective in searching source code than light stemmers like Porter • Future work: more examples (less frequent & more domain-specific), more human judgements, more search tasks, other SE tasks beyond searchThursday, October 6, 2011