Despite the naming conventions in programming languages, the textual content of software is made up by using arbitrary abbreviations and concatenations.
Among different approaches to QR, Relevance Feedback has been shown to be an effective method. It basically has three steps
It has been shown that Relevance Feedback is an effective method for Query Reformulation.
First of all, we index the source code. When a query comes, we do an initial first pass retrieval to get the highest ranked documents.Then the QR module takes the initial query and the first pass retrieval results to reformulate the query.Then with the reformulated query we perform the second retrieval to obtain the final results.
There are several previous studies on QR, the most important of which are …
The fact that the developer concatenated those terms to declare a variable or class name indicates that those terms are conceptually related.So for a given term nearby terms are more likely relevant
There is nothing to do with a query at this point
I am showing you an indexed file. Lets consider a query that consists of two terms. We will use a window of size 3 to capture the spatial proximity effects in this file vis-à-vis this particular query.We have a query term browser in the first line. So using a window we increase the weights of the nearby terms!
While all the three methods degrade about the same number of queries, SCP improves the largest number of queries.For more than 200 queries SCP does better than the next best model RM. In the paper, you will also find an analysis of the queries for which QR does not lead to improvements.
Assisting Code Search with Automatic Query Reformulation for Bug Localization
OutlineI. MotivationII. Past Work on Automatic Query ReformulationIII. Proposed Approach: Spatial Code Proximity (SCP)IV. Data PreparationV. ResultsVI. ConclusionsMSR13
Main MotivationIt is extremely common for software developers to usearbitraryabbreviations & concatenationsin software. These are generally difficult to predict whensearching the code base of a project.The question is “Is there some way to automaticallyreformulate a user’s query so that all such relevant termsare also used in retrieval?”MSR13
We show how a query can be automaticallyreformulated for superior retrieval accuracy We propose a new framework for QueryReformulation, which leverages the spatial proximityof the terms in files The approach leads to significant improvements overthe baseline and the competing Query ReformulationapproachesMSR13Summary of Our Contribution
Our approach preserves or improves the retrievalaccuracy for 76% of the 4,393 bugs we analyzed forEclipse and Chrome projects Our approach improves the retrieval accuracy for 42%of the 4,393 bugs Improvements are 66% for Eclipse and 90% for Chromein terms of MAP (Mean Average Precision) We also describe the conditions under which QueryReformulation may perform poorly.MSR13Summary of Our Contribution
Query Reformulation withRelevance Feedback1. Perform an initial retrieval with the original query2. Analyze the set of top retrieved documents vis-à-vis the query3. Reformulate the queryMSR13
Acquiring RelevanceFeedback Implicitly: infer feedback from user interactions Explicitly: user provides feedback [Gay et al.2009] Pseudo Relevance Feedback (PRF): AutomaticQR This is our work!MSR13
Data Flow in the ProposedRetrieval FrameworkMSR13
Automatic Query Reformulation No user involvement! It takes less than a second to reformulate a query onordinary desktop hardware! It is cheap! It is effective!MSR13
Previous Work on AutomaticQR (for Text Retrieval)Rocchio’s Formula (ROCC)Relevance Model (RM)MSR13
The Proposed Approach to QR:Spatial Code Proximity (SCP) Spatial Code Proximity is an elegant approach togiving greater weights to terms in source code thatoccur in the vicinity of the terms in a users’ query Proximities may be created through commonly usedconcatenations Punctuation characters Camel Casing etc… Underscores: tab_strip_gtk Camel casing: kPinnedTabAnimationDurationMsMSR13
Spatial Code Proximity (SCP)(Cont’d) Tokenize source files and index the positions of theterms in each source file: Use the distance between terms to find relevant termsvis-à-vis a query!MSR13
SCP: Bringing the Query into the PictureMSR13 Example Query: “Browser Animation” First perform an initial retrieval with the original query Increase the weights of those nearby terms!
Research Questions Question 1: Does the proposed QR approach improvethe accuracy of source code retrieval. If so, to whatextent? Question 2: How do the QR techniques that arecurrently in the literature perform for source coderetrieval? Question 3: How does the initial retrieval performanceaffect the performance of QR? Question 4: What are the conditions under which QRmay perform poorly?MSR13
Data Preparation For evaluation, we need a set of queriesand the relevant files We use the titles of the bug reports asqueries We have to link the repository commitsto the bug tracking database! Used regular expressions to detect Bug Fixcommits based on commit messagesMSR13
Evaluation Framework We use Precision and Recall based metrics to evaluatethe retrieval accuracy. Determine the query sets for which the proposed QRapproaches lead to1. improvements in the retrieval accuracy2. degradation in the retrieval accuracy3. no change in the retrieval accuracy Analyze these sets to understand the characteristics ofthe queries each set containsMSR13
Evaluation Framework (Cont’d) For comparison of these sets, we used the following QueryPerformance Prediction (QPP) metrics [Haiduc et al. 2012, Heet al. 2004]: Average Inverse Document Frequency (avgIDF) Average Inverse Collection Term Frequency (avgICTF) Query Scope (QS) Simplified Clarity Score (SCS) Additionally, we analyzed Query Lengths Number of Relevant files per bugMSR13
QR with Bug Report TitlesROCCRMSCP (Proposed)0500100015002000#BugsROCC RM SCP (Proposed)MSR13
Improvements in RetrievalAccuracy (% Increase in MAP)ROCCRMSCP (Proposed)0%20%40%60%80%100%Eclipse ChromeROCC RM SCP (Proposed)MSR13
Conclusions & Future Work Our framework can use a weak initial queryas a jumping off point for a better query. No user input is necessary We obtained significant improvements overthe baseline and the well-known AutomaticQR methods. Future Work includes evaluation of differentterm proximity metrics in source code for QRMSR13
References  B. Sisman and A. Kak, “Incorporating versionhistories in information retrieval based buglocalization,” in Proceedings of the 9th WorkingConference on Mining Software Repositories (MSR’12).IEEE, 2012, pp. 50–59  G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “Onthe use of relevance feedback in IR-based conceptlocation,” in International Conference on SoftwareMaintenance (ICSM’09), sept. 2009, pp. 351 –360.  A. Marcus, A. Sergeyev, V. Rajlich, and J. I.Maletic, “An information retrieval approach toconcept location in source code,” in Proceedings ofthe 11th Working Conference on Reverse Engineering(WCRE’04). IEEE Computer Society, 2004, pp. 214–223MSR13
References  S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, andA. Marcus, “Automatic query performance assessmentduring the retrieval of software artifacts,” inProceedings of the 27th International Conference onAutomated Software Engineering (ASE’12) .ACM, 2012, pp. 90–99  B. He and I. Ounis, “Inferring query performanceusing pre-retrieval predictors,” in Proc. Symposium onString Processing and Information Retrieval . SpringerVerlag, 2004, pp. 43–54MSR13