Assisting Code Search with Automatic Query Reformulation for Bug Localization


Published on

MSR 2013, San Francisco, CA

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Despite the naming conventions in programming languages, the textual content of software is made up by using arbitrary abbreviations and concatenations.
  • Among different approaches to QR, Relevance Feedback has been shown to be an effective method. It basically has three steps
  • It has been shown that Relevance Feedback is an effective method for Query Reformulation.
  • First of all, we index the source code. When a query comes, we do an initial first pass retrieval to get the highest ranked documents.Then the QR module takes the initial query and the first pass retrieval results to reformulate the query.Then with the reformulated query we perform the second retrieval to obtain the final results.
  • There are several previous studies on QR, the most important of which are …
  • The fact that the developer concatenated those terms to declare a variable or class name indicates that those terms are conceptually related.So for a given term nearby terms are more likely relevant
  • There is nothing to do with a query at this point
  • I am showing you an indexed file. Lets consider a query that consists of two terms. We will use a window of size 3 to capture the spatial proximity effects in this file vis-à-vis this particular query.We have a query term browser in the first line. So using a window we increase the weights of the nearby terms!
  • While all the three methods degrade about the same number of queries, SCP improves the largest number of queries.For more than 200 queries SCP does better than the next best model RM. In the paper, you will also find an analysis of the queries for which QR does not lead to improvements.
  • Assisting Code Search with Automatic Query Reformulation for Bug Localization

    1. 1. Assisting Code Search withAutomatic Query Reformulationfor Bug LocalizationBunyamin Sisman & Avinash | kak@purdue.eduPurdue UniversityMSR’13
    2. 2. OutlineI. MotivationII. Past Work on Automatic Query ReformulationIII. Proposed Approach: Spatial Code Proximity (SCP)IV. Data PreparationV. ResultsVI. ConclusionsMSR13
    3. 3. Main MotivationIt is extremely common for software developers to usearbitraryabbreviations & concatenationsin software. These are generally difficult to predict whensearching the code base of a project.The question is “Is there some way to automaticallyreformulate a user’s query so that all such relevant termsare also used in retrieval?”MSR13
    4. 4.  We show how a query can be automaticallyreformulated for superior retrieval accuracy We propose a new framework for QueryReformulation, which leverages the spatial proximityof the terms in files The approach leads to significant improvements overthe baseline and the competing Query ReformulationapproachesMSR13Summary of Our Contribution
    5. 5.  Our approach preserves or improves the retrievalaccuracy for 76% of the 4,393 bugs we analyzed forEclipse and Chrome projects Our approach improves the retrieval accuracy for 42%of the 4,393 bugs Improvements are 66% for Eclipse and 90% for Chromein terms of MAP (Mean Average Precision) We also describe the conditions under which QueryReformulation may perform poorly.MSR13Summary of Our Contribution
    6. 6. Query Reformulation withRelevance Feedback1. Perform an initial retrieval with the original query2. Analyze the set of top retrieved documents vis-à-vis the query3. Reformulate the queryMSR13
    7. 7. Acquiring RelevanceFeedback Implicitly: infer feedback from user interactions Explicitly: user provides feedback [Gay et al.2009] Pseudo Relevance Feedback (PRF): AutomaticQR This is our work!MSR13
    8. 8. Data Flow in the ProposedRetrieval FrameworkMSR13
    9. 9. Automatic Query Reformulation No user involvement! It takes less than a second to reformulate a query onordinary desktop hardware! It is cheap! It is effective!MSR13
    10. 10. Previous Work on AutomaticQR (for Text Retrieval)Rocchio’s Formula (ROCC)Relevance Model (RM)MSR13
    11. 11. The Proposed Approach to QR:Spatial Code Proximity (SCP) Spatial Code Proximity is an elegant approach togiving greater weights to terms in source code thatoccur in the vicinity of the terms in a users’ query Proximities may be created through commonly usedconcatenations Punctuation characters Camel Casing etc… Underscores: tab_strip_gtk Camel casing: kPinnedTabAnimationDurationMsMSR13
    12. 12. Spatial Code Proximity (SCP)(Cont’d) Tokenize source files and index the positions of theterms in each source file: Use the distance between terms to find relevant termsvis-à-vis a query!MSR13
    13. 13. SCP: Bringing the Query into the PictureMSR13 Example Query: “Browser Animation” First perform an initial retrieval with the original query Increase the weights of those nearby terms!
    14. 14. Research Questions Question 1: Does the proposed QR approach improvethe accuracy of source code retrieval. If so, to whatextent? Question 2: How do the QR techniques that arecurrently in the literature perform for source coderetrieval? Question 3: How does the initial retrieval performanceaffect the performance of QR? Question 4: What are the conditions under which QRmay perform poorly?MSR13
    15. 15. Data Preparation For evaluation, we need a set of queriesand the relevant files We use the titles of the bug reports asqueries We have to link the repository commitsto the bug tracking database! Used regular expressions to detect Bug Fixcommits based on commit messagesMSR13
    16. 16. Data Preparation (Cont’d)Eclipse v3.1 Chrome v4.0#Bugs 4,035 358Avg. # Relevant Files 2.76 3.82Avg. #Commits 1.36 1.23MSR131Resulting dataset: BUGLinks1
    17. 17. Evaluation Framework We use Precision and Recall based metrics to evaluatethe retrieval accuracy. Determine the query sets for which the proposed QRapproaches lead to1. improvements in the retrieval accuracy2. degradation in the retrieval accuracy3. no change in the retrieval accuracy Analyze these sets to understand the characteristics ofthe queries each set containsMSR13
    18. 18. Evaluation Framework (Cont’d) For comparison of these sets, we used the following QueryPerformance Prediction (QPP) metrics [Haiduc et al. 2012, Heet al. 2004]: Average Inverse Document Frequency (avgIDF) Average Inverse Collection Term Frequency (avgICTF) Query Scope (QS) Simplified Clarity Score (SCS) Additionally, we analyzed Query Lengths Number of Relevant files per bugMSR13
    19. 19. QR with Bug Report TitlesROCCRMSCP (Proposed)0500100015002000#BugsROCC RM SCP (Proposed)MSR13
    20. 20. Improvements in RetrievalAccuracy (% Increase in MAP)ROCCRMSCP (Proposed)0%20%40%60%80%100%Eclipse ChromeROCC RM SCP (Proposed)MSR13
    21. 21. Conclusions & Future Work Our framework can use a weak initial queryas a jumping off point for a better query. No user input is necessary We obtained significant improvements overthe baseline and the well-known AutomaticQR methods. Future Work includes evaluation of differentterm proximity metrics in source code for QRMSR13
    22. 22. References [1] B. Sisman and A. Kak, “Incorporating versionhistories in information retrieval based buglocalization,” in Proceedings of the 9th WorkingConference on Mining Software Repositories (MSR’12).IEEE, 2012, pp. 50–59 [2] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “Onthe use of relevance feedback in IR-based conceptlocation,” in International Conference on SoftwareMaintenance (ICSM’09), sept. 2009, pp. 351 –360. [3] A. Marcus, A. Sergeyev, V. Rajlich, and J. I.Maletic, “An information retrieval approach toconcept location in source code,” in Proceedings ofthe 11th Working Conference on Reverse Engineering(WCRE’04). IEEE Computer Society, 2004, pp. 214–223MSR13
    23. 23. References [4] S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, andA. Marcus, “Automatic query performance assessmentduring the retrieval of software artifacts,” inProceedings of the 27th International Conference onAutomated Software Engineering (ASE’12) .ACM, 2012, pp. 90–99 [5] B. He and I. Ounis, “Inferring query performanceusing pre-retrieval predictors,” in Proc. Symposium onString Processing and Information Retrieval . SpringerVerlag, 2004, pp. 43–54MSR13