SlideShare a Scribd company logo
1 of 18
EFFECTIVE REFORMULATION OF QUERY FOR
CODE SEARCH USING CROWDSOURCED
KNOWLEDGE AND EXTRA-LARGE DATA
ANALYTICS
Masud Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
International Conference on Software Maintenance and
Evolution (ICSME 2018), Madrid, Spain
IDEAL SCENARIO OF CODE SEARCH
2
Convert image to gray scale without losing transparency
REAL LIFE SCENARIO: GOOGLE
3
QUERY MATTERS!
REAL LIFE SCENARIO: GITHUB SEARCH
4
NOT WORKING!!
SOLUTION: QUERY REFORMULATION
5
Convert image to gray scale without losing transparency 115
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics
ImageEffects
02
Convert image to gray scale without losing transparency
CONTRIBUTION
NLP2API: PROPOSED QUERY
REFORMULATION FOR CODE SEARCH
6
PageRank
TF-IDF
STEPS OF NLP2API
7
BORDA Count: A>B
if ∑rank(A) > ∑rank(B)
Semantic Proximity: A>B
if proximity(Q,A) > proximity(Q,B)
NLP2API: TWO PILLARS
8
NLP2API
Developer Crowd Data Analytics
EXPERIMENT: EVALUATION SCENARIOS
9
NLP2API
API Suggestion Query Reformulation
EXPERIMENT: DATASET COLLECTION
10
Java2s
CodeJava
310 Queries & Ground truth
4K Code segments
RQ1: HOW DOES NLP2API PERFORM IN API
CLASS SUGGESTION?
11
70%
50%
RQ2: CAN NLP2API OUTPERFORM THE
STATE-OF-THE-ART?
12
Metric RACK,
SANER 2016
NLP2API Improved(%)
Hit@1 20.97% 41.94% *100%
MRR@1 0.21 0.42 *100%
MAP@1 20.97% 41.94% *100%
Hit@5 64.19% 72.90% 14%
MRR@5 0.37 0.54 *46%
MAP@5 36.76% 50.56% *38%
RQ3: CAN REFORMULATED QUERIES
OUTPERFORM BASELINE NL QUERIES?
13
30%
RQ4: CAN NLP2API OUTPERFORM THE STATE-OF-
THE-ART IN QUERY REFORMULATION?
14
Method Improved Mean Q1 Q2 Q3 Min Max
QECK 72 139 02 11 74 01 1,861
RACK 105 75 02 08 60 01 971
COCABU 113 191 02 14 103 01 2,607
Baseline 07 25 145 02 1,460
NLP2API *152 *172 *02 *10 *61 01 1,926
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
15
Stage-I
Stage-II
GitHub
RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
16
TAKE-HOME MESSAGES
17
NOT WORKING!!
NLP2API
API Suggestion
Query Reformulation
Code Search
THANK YOU !!! QUESTIONS?
18
Replication Package of NLP2API:
http://www.usask.ca/~masud.rahman/nlp2api
Contact: masud.rahman@usask.ca
Masud Rahman (@masud2336)

More Related Content

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...Pantech Solutions Pvt Ltd
 
IEEE 2012 DIP & dsp_2012-13_titles
IEEE 2012 DIP & dsp_2012-13_titlesIEEE 2012 DIP & dsp_2012-13_titles
IEEE 2012 DIP & dsp_2012-13_titlesSrinivasan Natarajan
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
Modern OpenGL scientific visualization
Modern OpenGL scientific visualizationModern OpenGL scientific visualization
Modern OpenGL scientific visualizationNicolas Rougier
 
My Projects & My Stories
My Projects & My StoriesMy Projects & My Stories
My Projects & My StoriesJustin Cui
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in ImagesAnil Kumar Gupta
 
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵CHENHuiMei
 
Android based application for graph analysis final report
Android based application for graph analysis final reportAndroid based application for graph analysis final report
Android based application for graph analysis final reportPallab Sarkar
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services IndustryAlison B. Lowndes
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsRenee Yao
 

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics (20)

Region-oriented Convolutional Networks for Object Retrieval
Region-oriented Convolutional Networks for Object RetrievalRegion-oriented Convolutional Networks for Object Retrieval
Region-oriented Convolutional Networks for Object Retrieval
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...
62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...
 
Big Data in the Cloud
Big Data in the Cloud Big Data in the Cloud
Big Data in the Cloud
 
AbhijitTripathy
AbhijitTripathyAbhijitTripathy
AbhijitTripathy
 
Generative models in the arts
Generative models in the artsGenerative models in the arts
Generative models in the arts
 
IEEE 2012 DIP & dsp_2012-13_titles
IEEE 2012 DIP & dsp_2012-13_titlesIEEE 2012 DIP & dsp_2012-13_titles
IEEE 2012 DIP & dsp_2012-13_titles
 
3. _dsp_2012-13_titles
3.  _dsp_2012-13_titles3.  _dsp_2012-13_titles
3. _dsp_2012-13_titles
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in Spark
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Modern OpenGL scientific visualization
Modern OpenGL scientific visualizationModern OpenGL scientific visualization
Modern OpenGL scientific visualization
 
My Projects & My Stories
My Projects & My StoriesMy Projects & My Stories
My Projects & My Stories
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
 
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
 
Android based application for graph analysis final report
Android based application for graph analysis final reportAndroid based application for graph analysis final report
Android based application for graph analysis final report
 
Resume_Vignesh_ThulasiDass
Resume_Vignesh_ThulasiDass Resume_Vignesh_ThulasiDass
Resume_Vignesh_ThulasiDass
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services Industry
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANs
 

More from Masud Rahman

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
 
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanPhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanMasud Rahman
 
PhD proposal of Masud Rahman
PhD proposal of Masud RahmanPhD proposal of Masud Rahman
PhD proposal of Masud RahmanMasud Rahman
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanMasud Rahman
 
Doctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanDoctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanMasud Rahman
 
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Masud Rahman
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationMasud Rahman
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015Masud Rahman
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017Masud Rahman
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016Masud Rahman
 

More from Masud Rahman (20)

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
 
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanPhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of Saskatchewan
 
PhD proposal of Masud Rahman
PhD proposal of Masud RahmanPhD proposal of Masud Rahman
PhD proposal of Masud Rahman
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud Rahman
 
Doctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanDoctoral Symposium of Masud Rahman
Doctoral Symposium of Masud Rahman
 
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Editor's Notes

  1. Good morning, everyone. My name is Masud Rahman. I am a PhD Student from University of Saskatchewan, Canada. I work with Prof. Dr. Chanchal Roy. My research area is code search and query reformulation. Today, I am going to talk about a code search approach where we used query reformulation. And for query reformulation, we used data mining from Stack Overflow, and we also used large-scale data analytics with word embeddings.
  2. First, we will see some scenarios. This is an ideal scenario for code search. If you provide a natural language query, and you would expect a code segment that solves your problem exactly. But this does not happen in practice.
  3. In real life, you get a lot of search results. You have to analyze the results, and look for such code segments in those pages. If the query is good enough, you might get lucky and get the Hit very quickly. For example, Google is quite good at this. But it really depends on the query you choose.
  4. Unfortunately, other search engines are failing to keep up with Google. For example, GitHub code search does not work with such natural language query. It does keyword matching, but that is not sufficient enough if the query is NOT good. In fact, several code search engines are disappearing from the web, such as Koders, GoogleCode, which is a bit strange. So, we try to improve basically the code search.
  5. Now, how can we beat the status quo’ of code search? Well, one possible way is to improve the query through query reformulation. Since keyword search is a kind of universal idea, we cannot avoid it. So what we can do? We will improve the keyword search by providing more appropriate keywords. Now what are those? Well, source code is different from natural language texts. It has less vocabulary. So, we have to deal with it carefully. One possible way is to provide -- relevant API classes as the keywords for expansion. For example, when the baseline query returns correct the result at 115th position, the reformulated query returns that at the 2nd position.
  6. So, here is our contribution: NLP2API == Natural Language Phrase to API. We translate a natural language query into relevant API classes for query reformulation and then we improve the code search in the process.
  7. First we take a generic natural language query and submit to a search engine. It retrieves relevant questions and answers from Stack Overflow. We then mine the code segments posted in those threads using two term weighting methods – PageRank and TF-IDF. Thus, we get a list of candidate API classes from those threads that are used by millions of people. Now, the big question is, which candidates are the most appropriate for query at hand? Well, we proposed two metrics – Borda count and Semantic proximity. The essence of Borda count is -- If API A is more frequent than API B in the relevant Q & A threads from Stack Overflow, A is more appropriate than B. So, it’s a kind of likelihood of A over B for the target query. For the second metric, we preprocess Stack Overflow corpus, develop a Skip-gram model using FastText, an improved version of Word2Vec. Then we determine, how close an API is to the given query keywords within the semantic space. So, we A is more semantically close to query Q than B, then A is more appropriate than B for the query. So, we then combine these two metrics for each candidate API class, do the ranking, and return the Top-K classes as our reformulation terms.
  8. So, we stand on the shoulder of two giants the massive developer crowd : We use their API relevance judgment through data mining. Large-scale data analytics: We determine the semantic proximity between keywords and candidate API class.
  9. We evaluate our approach from two dimensions: API suggestion: We check our performance against ground truth whether we are doing it correctly. Otherwise, the rest part does not work. Query reformulation/code search: We check whether our reformulation actually improves the query or not in terms of code search performance.
  10. For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
  11. We answered five research questions in this paper. The first research question: How does our tool, NLP2API, perform in API class suggestion? We achieve 70%+ Top-5 accuracy with 50% precision which is pretty good for an automatic approach. That is, half of the suggested API classes are true positive, and the tool succeeds for 70% of the times. We also get a MRR of 0.55 which suggests that the first relevant API class generally appears between 1st to 2nd position, which is promising. We also see that two of the metrics – Borda and Semantic Proximity perform pretty well. But obviously, we combined them due to their orthogonal aspects of strength, and then achieved the highest performance.
  12. The second research question compares our approach with the state-of-the-art. For Top-1, we see that our approach doubled the performance in all three metrics which is interesting. For Top-5 results, we see that NLP2API also improves over the state-of-the-art by 38% in precision and 46% in reciprocal rank. So, our approach is advancing the state-of-the-art which is highly expected.
  13. In the third research question, we investigate whether our reformulation actually improves the baseline query or not. Well, it does! When the baseline natural language query is used, we achieved an accuracy of 50% However, when we keep adding the API classes suggested by our tool, we see performance improvement, which justifies our whole hypothesis. For example, we get around 65% accuracy when add 10-15 API classes which is a fairly descent performance improvement. We also get the same picture in the case of reciprocal rank. So, yes, the query reformulation works!
  14. In the fourth research question, we compare our query reformulation performance with three other approaches from the literature. In particular, what we did, we determine query effectiveness. That is, the rank of the first correct result returned by a query. We collect such ranks for all queries, determine their quartiles, and then compare with other approaches. Here, we see that our reformulation improves 50% of the queries which is the highest obviously. However, these are the baseline quartiles, and these are our quartiles. Well, our reformulations improved the ranks, and is advancing the state-of-the-art.
  15. In the fifth research question, we investigate whether our reformulated queries can improve the results of traditional code search engines. So, what we did, we collect results from Google, Stack Overflow and GitHub for the baseline queries first. Then manually analyze them, compare them with our goldset, and setup a baseline performance. This is step-I. In the second stage, we repeat the experiments with our reformulate queries. Then we compare the performance of these two steps.
  16. We see that Google obviously performs better than the other two, which is pretty much expected. It achieves around 65% precision which is pretty good. However, our reformulated queries can make it even better to like 75%. So, yes, although, this approach is not designed for Google, rather code search engines like GitHub. it can significantly improve the precision of Google in the code search which is great. We also got significant performance improvement in terms of NDCG, another state-of-the-art ranking metric, which proves our hypothesis to be true. However, we faced some issues while comparing with Google, which is discussed in the paper 
  17. So, these are the take-home messages. Code search engines are NOT working well. However, keyword search is a kind of universal idea. So, we tried to improve the keyword search by providing more appropriate keywords for code search. Our approach stands on the shoulder of two giants: (1) crowd generated knowledge, and (2) large-scale data analytics. We conducted experiments using 300+ queries, and answered 5 research questions. Our approach outperformed the state-of-the-art in API suggestion, query reformulation and code search.
  18. We have a replication package publicly available. Its on GitHub. You can simply clone it, and use it for you work. Go ahead and develop the next best tool  Thanks for your time and attention. I am ready to have a few questions.