Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

•Download as PPTX, PDF•

1 like•99 views

An effective query reformulation technique that adopts crowd sourced knowledge and large-scale data analytics from Stack Overflow Q&A site, and then improves source code search.

Technology

EFFECTIVE REFORMULATION OF QUERY FOR
CODE SEARCH USING CROWDSOURCED
KNOWLEDGE AND EXTRA-LARGE DATA
ANALYTICS
Masud Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
International Conference on Software Maintenance and
Evolution (ICSME 2018), Madrid, Spain

IDEAL SCENARIO OF CODE SEARCH
2
Convert image to gray scale without losing transparency

REAL LIFE SCENARIO: GOOGLE
3
QUERY MATTERS!

REAL LIFE SCENARIO: GITHUB SEARCH
4
NOT WORKING!!

SOLUTION: QUERY REFORMULATION
5
Convert image to gray scale without losing transparency 115
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics
ImageEffects
02
Convert image to gray scale without losing transparency
CONTRIBUTION

NLP2API: PROPOSED QUERY
REFORMULATION FOR CODE SEARCH
6

PageRank
TF-IDF
STEPS OF NLP2API
7
BORDA Count: A>B
if ∑rank(A) > ∑rank(B)
Semantic Proximity: A>B
if proximity(Q,A) > proximity(Q,B)

NLP2API: TWO PILLARS
8
NLP2API
Developer Crowd Data Analytics

EXPERIMENT: EVALUATION SCENARIOS
9
NLP2API
API Suggestion Query Reformulation

EXPERIMENT: DATASET COLLECTION
10
Java2s
CodeJava
310 Queries & Ground truth
4K Code segments

RQ1: HOW DOES NLP2API PERFORM IN API
CLASS SUGGESTION?
11
70%
50%

RQ2: CAN NLP2API OUTPERFORM THE
STATE-OF-THE-ART?
12
Metric RACK,
SANER 2016
NLP2API Improved(%)
Hit@1 20.97% 41.94% *100%
MRR@1 0.21 0.42 *100%
MAP@1 20.97% 41.94% *100%
Hit@5 64.19% 72.90% 14%
MRR@5 0.37 0.54 *46%
MAP@5 36.76% 50.56% *38%

RQ3: CAN REFORMULATED QUERIES
OUTPERFORM BASELINE NL QUERIES?
13
30%

RQ4: CAN NLP2API OUTPERFORM THE STATE-OF-
THE-ART IN QUERY REFORMULATION?
14
Method Improved Mean Q1 Q2 Q3 Min Max
QECK 72 139 02 11 74 01 1,861
RACK 105 75 02 08 60 01 971
COCABU 113 191 02 14 103 01 2,607
Baseline 07 25 145 02 1,460
NLP2API *152 *172 *02 *10 *61 01 1,926
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE

RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
15
Stage-I
Stage-II
GitHub

RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
16

TAKE-HOME MESSAGES
17
NOT WORKING!!
NLP2API
API Suggestion
Query Reformulation
Code Search

THANK YOU !!! QUESTIONS?
18
Replication Package of NLP2API:
http://www.usask.ca/~masud.rahman/nlp2api
Contact: masud.rahman@usask.ca
Masud Rahman (@masud2336)

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Region-oriented Convolutional Networks for Object RetrievalUniversitat Politècnica de Catalunya

Keeping Identity Graphs In Sync With Apache SparkDatabricks

62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...Pantech Solutions Pvt Ltd

Big Data in the Cloud Amazon Web Services

AbhijitTripathyAbhijit Tripathy

Generative models in the artsJorge Davila-Chacon

IEEE 2012 DIP & dsp_2012-13_titlesSrinivasan Natarajan

3. _dsp_2012-13_titlesPantech ProEd Pvt Ltd

The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit

Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm

Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH

Modern OpenGL scientific visualizationNicolas Rougier

My Projects & My StoriesJustin Cui

Obscenity Detection in ImagesAnil Kumar Gupta

小數據如何實現電腦視覺，微軟AI研究首席剖析關鍵CHENHuiMei

Android based application for graph analysis final reportPallab Sarkar

Resume_Vignesh_ThulasiDass VigneshThulasiDass

A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar

AI in the Financial Services IndustryAlison B. Lowndes

Accelerate AI w/ Synthetic Data using GANsRenee Yao

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics (20)

Region-oriented Convolutional Networks for Object Retrieval

Keeping Identity Graphs In Sync With Apache Spark

62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...

Big Data in the Cloud

AbhijitTripathy

Generative models in the arts

IEEE 2012 DIP & dsp_2012-13_titles

3. _dsp_2012-13_titles

The Potential of GPU-driven High Performance Data Analytics in Spark

Performance evaluation of GANs in a semisupervised OCR use case

Modern OpenGL scientific visualization

My Projects & My Stories

Obscenity Detection in Images

小數據如何實現電腦視覺，微軟AI研究首席剖析關鍵

Android based application for graph analysis final report

Resume_Vignesh_ThulasiDass

A Hands-on Intro to Data Science and R Presentation.ppt

AI in the Financial Services Industry

Accelerate AI w/ Synthetic Data using GANs

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

Key Features Of Token Development (1).pptxLBM Solutions

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Unleash Your Potential - Namagunga Girls Coding Club

Unlocking the Potential of the Cloud for IBM Power Systems

Key Features Of Token Development (1).pptx

Pigging Solutions in Pet Food Manufacturing

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

DMCC Future of Trade Web3 - Special Edition

Designing IA for AI - Information Architecture Conference 2024

SQL Database Design For Developers at php[tek] 2024

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Presentation on how to chat with PDF using ChatGPT code interpreter

Pigging Solutions Piggable Sweeping Elbows

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

APIForce Zurich 5 April Automation LPDG

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

1. EFFECTIVE REFORMULATION OF QUERY FOR CODE SEARCH USING CROWDSOURCED KNOWLEDGE AND EXTRA-LARGE DATA ANALYTICS Masud Rahman, Chanchal K. Roy Department of Computer Science University of Saskatchewan, Canada International Conference on Software Maintenance and Evolution (ICSME 2018), Madrid, Spain

2. IDEAL SCENARIO OF CODE SEARCH 2 Convert image to gray scale without losing transparency

3. REAL LIFE SCENARIO: GOOGLE 3 QUERY MATTERS!

4. REAL LIFE SCENARIO: GITHUB SEARCH 4 NOT WORKING!!

5. SOLUTION: QUERY REFORMULATION 5 Convert image to gray scale without losing transparency 115 BufferedImage Grayscale ImageEdit ColorConvertOp File Transparency ColorSpace BufferedImageOp Graphics ImageEffects 02 Convert image to gray scale without losing transparency CONTRIBUTION

6. NLP2API: PROPOSED QUERY REFORMULATION FOR CODE SEARCH 6

7. PageRank TF-IDF STEPS OF NLP2API 7 BORDA Count: A>B if ∑rank(A) > ∑rank(B) Semantic Proximity: A>B if proximity(Q,A) > proximity(Q,B)

8. NLP2API: TWO PILLARS 8 NLP2API Developer Crowd Data Analytics

9. EXPERIMENT: EVALUATION SCENARIOS 9 NLP2API API Suggestion Query Reformulation

10. EXPERIMENT: DATASET COLLECTION 10 Java2s CodeJava 310 Queries & Ground truth 4K Code segments

11. RQ1: HOW DOES NLP2API PERFORM IN API CLASS SUGGESTION? 11 70% 50%

12. RQ2: CAN NLP2API OUTPERFORM THE STATE-OF-THE-ART? 12 Metric RACK, SANER 2016 NLP2API Improved(%) Hit@1 20.97% 41.94% *100% MRR@1 0.21 0.42 *100% MAP@1 20.97% 41.94% *100% Hit@5 64.19% 72.90% 14% MRR@5 0.37 0.54 *46% MAP@5 36.76% 50.56% *38%

13. RQ3: CAN REFORMULATED QUERIES OUTPERFORM BASELINE NL QUERIES? 13 30%

14. RQ4: CAN NLP2API OUTPERFORM THE STATE-OF- THE-ART IN QUERY REFORMULATION? 14 Method Improved Mean Q1 Q2 Q3 Min Max QECK 72 139 02 11 74 01 1,861 RACK 105 75 02 08 60 01 971 COCABU 113 191 02 14 103 01 2,607 Baseline 07 25 145 02 1,460 NLP2API *152 *172 *02 *10 *61 01 1,926 QE = Rank of the first relevant code example, Qi = i-th quartile of QE QE = Rank of the first relevant code example, Qi = i-th quartile of QE

15. RQ5: CAN NLP2API IMPROVE TRADITIONAL CODE SEARCH RESULTS? 15 Stage-I Stage-II GitHub

16. RQ5: CAN NLP2API IMPROVE TRADITIONAL CODE SEARCH RESULTS? 16

17. TAKE-HOME MESSAGES 17 NOT WORKING!! NLP2API API Suggestion Query Reformulation Code Search

18. THANK YOU !!! QUESTIONS? 18 Replication Package of NLP2API: http://www.usask.ca/~masud.rahman/nlp2api Contact: masud.rahman@usask.ca Masud Rahman (@masud2336)

Editor's Notes

Good morning, everyone. My name is Masud Rahman. I am a PhD Student from University of Saskatchewan, Canada. I work with Prof. Dr. Chanchal Roy. My research area is code search and query reformulation. Today, I am going to talk about a code search approach where we used query reformulation. And for query reformulation, we used data mining from Stack Overflow, and we also used large-scale data analytics with word embeddings.
First, we will see some scenarios. This is an ideal scenario for code search. If you provide a natural language query, and you would expect a code segment that solves your problem exactly. But this does not happen in practice.
In real life, you get a lot of search results. You have to analyze the results, and look for such code segments in those pages. If the query is good enough, you might get lucky and get the Hit very quickly. For example, Google is quite good at this. But it really depends on the query you choose.
Unfortunately, other search engines are failing to keep up with Google. For example, GitHub code search does not work with such natural language query. It does keyword matching, but that is not sufficient enough if the query is NOT good. In fact, several code search engines are disappearing from the web, such as Koders, GoogleCode, which is a bit strange. So, we try to improve basically the code search.
Now, how can we beat the status quo’ of code search? Well, one possible way is to improve the query through query reformulation. Since keyword search is a kind of universal idea, we cannot avoid it. So what we can do? We will improve the keyword search by providing more appropriate keywords. Now what are those? Well, source code is different from natural language texts. It has less vocabulary. So, we have to deal with it carefully. One possible way is to provide -- relevant API classes as the keywords for expansion. For example, when the baseline query returns correct the result at 115th position, the reformulated query returns that at the 2nd position.
So, here is our contribution: NLP2API == Natural Language Phrase to API. We translate a natural language query into relevant API classes for query reformulation and then we improve the code search in the process.
First we take a generic natural language query and submit to a search engine. It retrieves relevant questions and answers from Stack Overflow. We then mine the code segments posted in those threads using two term weighting methods – PageRank and TF-IDF. Thus, we get a list of candidate API classes from those threads that are used by millions of people. Now, the big question is, which candidates are the most appropriate for query at hand? Well, we proposed two metrics – Borda count and Semantic proximity. The essence of Borda count is -- If API A is more frequent than API B in the relevant Q & A threads from Stack Overflow, A is more appropriate than B. So, it’s a kind of likelihood of A over B for the target query. For the second metric, we preprocess Stack Overflow corpus, develop a Skip-gram model using FastText, an improved version of Word2Vec. Then we determine, how close an API is to the given query keywords within the semantic space. So, we A is more semantically close to query Q than B, then A is more appropriate than B for the query. So, we then combine these two metrics for each candidate API class, do the ranking, and return the Top-K classes as our reformulation terms.
So, we stand on the shoulder of two giants the massive developer crowd : We use their API relevance judgment through data mining. Large-scale data analytics: We determine the semantic proximity between keywords and candidate API class.
We evaluate our approach from two dimensions: API suggestion: We check our performance against ground truth whether we are doing it correctly. Otherwise, the rest part does not work. Query reformulation/code search: We check whether our reformulation actually improves the query or not in terms of code search performance.
For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
We answered five research questions in this paper. The first research question: How does our tool, NLP2API, perform in API class suggestion? We achieve 70%+ Top-5 accuracy with 50% precision which is pretty good for an automatic approach. That is, half of the suggested API classes are true positive, and the tool succeeds for 70% of the times. We also get a MRR of 0.55 which suggests that the first relevant API class generally appears between 1st to 2nd position, which is promising. We also see that two of the metrics – Borda and Semantic Proximity perform pretty well. But obviously, we combined them due to their orthogonal aspects of strength, and then achieved the highest performance.
The second research question compares our approach with the state-of-the-art. For Top-1, we see that our approach doubled the performance in all three metrics which is interesting. For Top-5 results, we see that NLP2API also improves over the state-of-the-art by 38% in precision and 46% in reciprocal rank. So, our approach is advancing the state-of-the-art which is highly expected.
In the third research question, we investigate whether our reformulation actually improves the baseline query or not. Well, it does! When the baseline natural language query is used, we achieved an accuracy of 50% However, when we keep adding the API classes suggested by our tool, we see performance improvement, which justifies our whole hypothesis. For example, we get around 65% accuracy when add 10-15 API classes which is a fairly descent performance improvement. We also get the same picture in the case of reciprocal rank. So, yes, the query reformulation works!
In the fourth research question, we compare our query reformulation performance with three other approaches from the literature. In particular, what we did, we determine query effectiveness. That is, the rank of the first correct result returned by a query. We collect such ranks for all queries, determine their quartiles, and then compare with other approaches. Here, we see that our reformulation improves 50% of the queries which is the highest obviously. However, these are the baseline quartiles, and these are our quartiles. Well, our reformulations improved the ranks, and is advancing the state-of-the-art.
In the fifth research question, we investigate whether our reformulated queries can improve the results of traditional code search engines. So, what we did, we collect results from Google, Stack Overflow and GitHub for the baseline queries first. Then manually analyze them, compare them with our goldset, and setup a baseline performance. This is step-I. In the second stage, we repeat the experiments with our reformulate queries. Then we compare the performance of these two steps.
We see that Google obviously performs better than the other two, which is pretty much expected. It achieves around 65% precision which is pretty good. However, our reformulated queries can make it even better to like 75%. So, yes, although, this approach is not designed for Google, rather code search engines like GitHub. it can significantly improve the precision of Google in the code search which is great. We also got significant performance improvement in terms of NDCG, another state-of-the-art ranking metric, which proves our hypothesis to be true. However, we faced some issues while comparing with Google, which is discussed in the paper 
So, these are the take-home messages. Code search engines are NOT working well. However, keyword search is a kind of universal idea. So, we tried to improve the keyword search by providing more appropriate keywords for code search. Our approach stands on the shoulder of two giants: (1) crowd generated knowledge, and (2) large-scale data analytics. We conducted experiments using 300+ queries, and answered 5 research questions. Our approach outperformed the state-of-the-art in API suggestion, query reformulation and code search.
We have a replication package publicly available. Its on GitHub. You can simply clone it, and use it for you work. Go ahead and develop the next best tool  Thanks for your time and attention. I am ready to have a few questions.

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Recommended

Recommended

More Related Content

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Editor's Notes