An effective query reformulation technique that adopts crowd sourced knowledge and large-scale data analytics from Stack Overflow Q&A site, and then improves source code search.
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics
1. EFFECTIVE REFORMULATION OF QUERY FOR
CODE SEARCH USING CROWDSOURCED
KNOWLEDGE AND EXTRA-LARGE DATA
ANALYTICS
Masud Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
International Conference on Software Maintenance and
Evolution (ICSME 2018), Madrid, Spain
2. IDEAL SCENARIO OF CODE SEARCH
2
Convert image to gray scale without losing transparency
14. RQ4: CAN NLP2API OUTPERFORM THE STATE-OF-
THE-ART IN QUERY REFORMULATION?
14
Method Improved Mean Q1 Q2 Q3 Min Max
QECK 72 139 02 11 74 01 1,861
RACK 105 75 02 08 60 01 971
COCABU 113 191 02 14 103 01 2,607
Baseline 07 25 145 02 1,460
NLP2API *152 *172 *02 *10 *61 01 1,926
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
15. RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
15
Stage-I
Stage-II
GitHub
18. THANK YOU !!! QUESTIONS?
18
Replication Package of NLP2API:
http://www.usask.ca/~masud.rahman/nlp2api
Contact: masud.rahman@usask.ca
Masud Rahman (@masud2336)
Editor's Notes
Good morning, everyone.
My name is Masud Rahman. I am a PhD Student from University of Saskatchewan, Canada.
I work with Prof. Dr. Chanchal Roy.
My research area is code search and query reformulation.
Today, I am going to talk about a code search approach where we used query reformulation.
And for query reformulation, we used data mining from Stack Overflow, and we also used large-scale data analytics with word embeddings.
First, we will see some scenarios.
This is an ideal scenario for code search.
If you provide a natural language query, and you would expect a code segment that solves your problem exactly.
But this does not happen in practice.
In real life, you get a lot of search results.
You have to analyze the results, and look for such code segments in those pages.
If the query is good enough, you might get lucky and get the Hit very quickly.
For example, Google is quite good at this. But it really depends on the query you choose.
Unfortunately, other search engines are failing to keep up with Google.
For example, GitHub code search does not work with such natural language query.
It does keyword matching, but that is not sufficient enough if the query is NOT good.
In fact, several code search engines are disappearing from the web, such as Koders, GoogleCode, which is a bit strange.
So, we try to improve basically the code search.
Now, how can we beat the status quo’ of code search?
Well, one possible way is to improve the query through query reformulation.
Since keyword search is a kind of universal idea, we cannot avoid it.
So what we can do?
We will improve the keyword search by providing more appropriate keywords.
Now what are those?
Well, source code is different from natural language texts. It has less vocabulary.
So, we have to deal with it carefully.
One possible way is to provide -- relevant API classes as the keywords for expansion.
For example, when the baseline query returns correct the result at 115th position, the reformulated query returns that at the 2nd position.
So, here is our contribution: NLP2API == Natural Language Phrase to API.
We translate a natural language query into relevant API classes for query reformulation and then we improve the code search in the process.
First we take a generic natural language query and submit to a search engine.
It retrieves relevant questions and answers from Stack Overflow.
We then mine the code segments posted in those threads using two term weighting methods – PageRank and TF-IDF.
Thus, we get a list of candidate API classes from those threads that are used by millions of people.
Now, the big question is, which candidates are the most appropriate for query at hand?
Well, we proposed two metrics – Borda count and Semantic proximity.
The essence of Borda count is -- If API A is more frequent than API B in the relevant Q & A threads from Stack Overflow, A is more appropriate than B.
So, it’s a kind of likelihood of A over B for the target query.
For the second metric, we preprocess Stack Overflow corpus, develop a Skip-gram model using FastText, an improved version of Word2Vec.
Then we determine, how close an API is to the given query keywords within the semantic space.
So, we A is more semantically close to query Q than B, then A is more appropriate than B for the query.
So, we then combine these two metrics for each candidate API class, do the ranking, and return the Top-K classes as our reformulation terms.
So, we stand on the shoulder of two giants
the massive developer crowd : We use their API relevance judgment through data mining.
Large-scale data analytics: We determine the semantic proximity between keywords and candidate API class.
We evaluate our approach from two dimensions:
API suggestion: We check our performance against ground truth whether we are doing it correctly. Otherwise, the rest part does not work.
Query reformulation/code search: We check whether our reformulation actually improves the query or not in terms of code search performance.
For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others.
We collect 300+ queries, we also collect the ground truth API classes from them.
Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow.
For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site.
Then we determine whether our reformulated query actually works or not.
We answered five research questions in this paper.
The first research question: How does our tool, NLP2API, perform in API class suggestion?
We achieve 70%+ Top-5 accuracy with 50% precision which is pretty good for an automatic approach.
That is, half of the suggested API classes are true positive, and the tool succeeds for 70% of the times.
We also get a MRR of 0.55 which suggests that the first relevant API class generally appears between 1st to 2nd position, which is promising.
We also see that two of the metrics – Borda and Semantic Proximity perform pretty well.
But obviously, we combined them due to their orthogonal aspects of strength, and then achieved the highest performance.
The second research question compares our approach with the state-of-the-art.
For Top-1, we see that our approach doubled the performance in all three metrics which is interesting.
For Top-5 results, we see that NLP2API also improves over the state-of-the-art by 38% in precision and 46% in reciprocal rank.
So, our approach is advancing the state-of-the-art which is highly expected.
In the third research question, we investigate whether our reformulation actually improves the baseline query or not.
Well, it does!
When the baseline natural language query is used, we achieved an accuracy of 50%
However, when we keep adding the API classes suggested by our tool, we see performance improvement, which justifies our whole hypothesis.
For example, we get around 65% accuracy when add 10-15 API classes which is a fairly descent performance improvement.
We also get the same picture in the case of reciprocal rank.
So, yes, the query reformulation works!
In the fourth research question, we compare our query reformulation performance with three other approaches from the literature.
In particular, what we did, we determine query effectiveness. That is, the rank of the first correct result returned by a query.
We collect such ranks for all queries, determine their quartiles, and then compare with other approaches.
Here, we see that our reformulation improves 50% of the queries which is the highest obviously.
However, these are the baseline quartiles, and these are our quartiles.
Well, our reformulations improved the ranks, and is advancing the state-of-the-art.
In the fifth research question, we investigate whether our reformulated queries can improve the results of traditional code search engines.
So, what we did, we collect results from Google, Stack Overflow and GitHub for the baseline queries first.
Then manually analyze them, compare them with our goldset, and setup a baseline performance. This is step-I.
In the second stage, we repeat the experiments with our reformulate queries.
Then we compare the performance of these two steps.
We see that Google obviously performs better than the other two, which is pretty much expected.
It achieves around 65% precision which is pretty good.
However, our reformulated queries can make it even better to like 75%.
So, yes, although, this approach is not designed for Google, rather code search engines like GitHub.
it can significantly improve the precision of Google in the code search which is great.
We also got significant performance improvement in terms of NDCG, another state-of-the-art ranking metric, which proves our hypothesis to be true.
However, we faced some issues while comparing with Google, which is discussed in the paper
So, these are the take-home messages.
Code search engines are NOT working well.
However, keyword search is a kind of universal idea.
So, we tried to improve the keyword search by providing more appropriate keywords for code search.
Our approach stands on the shoulder of two giants: (1) crowd generated knowledge, and (2) large-scale data analytics.
We conducted experiments using 300+ queries, and answered 5 research questions.
Our approach outperformed the state-of-the-art in API suggestion, query reformulation and code search.
We have a replication package publicly available. Its on GitHub.
You can simply clone it, and use it for you work.
Go ahead and develop the next best tool
Thanks for your time and attention.
I am ready to have a few questions.