Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

•Download as PPTX, PDF•

1 like•99 views

An effective query reformulation technique that adopts crowd sourced knowledge and large-scale data analytics from Stack Overflow Q&A site, and then improves source code search.

Technology

EFFECTIVE REFORMULATION OF QUERY FOR
CODE SEARCH USING CROWDSOURCED
KNOWLEDGE AND EXTRA-LARGE DATA
ANALYTICS
Masud Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
International Conference on Software Maintenance and
Evolution (ICSME 2018), Madrid, Spain

IDEAL SCENARIO OF CODE SEARCH
2
Convert image to gray scale without losing transparency

REAL LIFE SCENARIO: GOOGLE
3
QUERY MATTERS!

REAL LIFE SCENARIO: GITHUB SEARCH
4
NOT WORKING!!

SOLUTION: QUERY REFORMULATION
5
Convert image to gray scale without losing transparency 115
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics
ImageEffects
02
Convert image to gray scale without losing transparency
CONTRIBUTION

NLP2API: PROPOSED QUERY
REFORMULATION FOR CODE SEARCH
6

PageRank
TF-IDF
STEPS OF NLP2API
7
BORDA Count: A>B
if ∑rank(A) > ∑rank(B)
Semantic Proximity: A>B
if proximity(Q,A) > proximity(Q,B)

NLP2API: TWO PILLARS
8
NLP2API
Developer Crowd Data Analytics

EXPERIMENT: EVALUATION SCENARIOS
9
NLP2API
API Suggestion Query Reformulation

EXPERIMENT: DATASET COLLECTION
10
Java2s
CodeJava
310 Queries & Ground truth
4K Code segments

RQ1: HOW DOES NLP2API PERFORM IN API
CLASS SUGGESTION?
11
70%
50%

RQ2: CAN NLP2API OUTPERFORM THE
STATE-OF-THE-ART?
12
Metric RACK,
SANER 2016
NLP2API Improved(%)
Hit@1 20.97% 41.94% *100%
MRR@1 0.21 0.42 *100%
MAP@1 20.97% 41.94% *100%
Hit@5 64.19% 72.90% 14%
MRR@5 0.37 0.54 *46%
MAP@5 36.76% 50.56% *38%

RQ3: CAN REFORMULATED QUERIES
OUTPERFORM BASELINE NL QUERIES?
13
30%

RQ4: CAN NLP2API OUTPERFORM THE STATE-OF-
THE-ART IN QUERY REFORMULATION?
14
Method Improved Mean Q1 Q2 Q3 Min Max
QECK 72 139 02 11 74 01 1,861
RACK 105 75 02 08 60 01 971
COCABU 113 191 02 14 103 01 2,607
Baseline 07 25 145 02 1,460
NLP2API *152 *172 *02 *10 *61 01 1,926
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE
QE = Rank of the first relevant code
example, Qi = i-th quartile of QE

RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
15
Stage-I
Stage-II
GitHub

RQ5: CAN NLP2API IMPROVE TRADITIONAL
CODE SEARCH RESULTS?
16

TAKE-HOME MESSAGES
17
NOT WORKING!!
NLP2API
API Suggestion
Query Reformulation
Code Search

THANK YOU !!! QUESTIONS?
18
Replication Package of NLP2API:
http://www.usask.ca/~masud.rahman/nlp2api
Contact: masud.rahman@usask.ca
Masud Rahman (@masud2336)

Investigating logs is getting more and more important as more of our lives get recorded, and graph techniques promise to help us to reveal the connections in our data. However, scale challenges forensics in many enterprise and federal settings. By focusing on the fundamentals around the pure math, GPU accelerated implementation, and the experts performing the process, we can go quite far. Demos span security, fraud, & crime, and cover concepts such as UMAP/K-NN/DL, hypergraphs, and low-code investigation automation via visual graph-based record & replay.

100X Investigations - Graphistry / Microsoft BlueHat

graphistry

Graphistry presented techniques for accelerating security investigations using graph technologies. They demonstrated how generating a virtual hypergraph from multiple data sources allows analysts to easily pivot over the data. They also discussed how automating common investigation tasks using the graph model can scale workflows. Graphistry uses GPUs to enable interactive analysis of large datasets. Their goal is to "100X" productivity by enabling analysts to more quickly extract insights and forage for relevant data through virtual hypergraph queries and automation.

numPYNQ @ NGCLE@e-Novia 15.11.2017

NECST Lab @ Politecnico di Milano

numPYNQ is a hardware library that offers an accelerated version of NumPy core functions to be used transparently from data science applications. It implements these functions on an FPGA to provide better performance, energy efficiency, and flexibility compared to GPUs. Experimental results show speedups for tasks like matrix multiplication and cross-correlation. The library uses runtime input analysis and adaptation to optimize implementations. It has potential in the growing big data market, and the team plans partnerships and a freemium business model to commercialize numPYNQ.

NLP2API: Replication package accepted by ICSME 2018

Masud Rahman

Hadoop dev 01

Vivian S. Zhang

This document discusses big data analysis and Hadoop. It begins by describing different stages of data analysis and roles of various personnel. It then discusses challenges of analyzing big data using traditional tools and how Hadoop addresses these challenges through its distributed architecture and MapReduce programming model. Several case studies are presented where companies have used Hadoop to perform large-scale data analysis. Key components of Hadoop like MapReduce, Pig, Hive and Mahout are also introduced.

HereWeCode 2022: Dalhousie University

Masud Rahman

Digital image processing

juangp3

The document outlines the fundamental steps for digital image processing projects, including image acquisition, preprocessing, segmentation, representation and description, recognition and interpretation, and postprocessing. It discusses improving images for human or machine use, and describes common image processing techniques like enhancement, thresholding, representation, description, recognition, and interpretation. The overall methodology presented is meant to increase the likelihood of success for image processing projects.

This document describes research on using region-oriented convolutional neural networks for object retrieval. It discusses using local CNNs like CaffeNet, Fast R-CNN, and SDS to extract visual features from object candidates in images. These features are used to match against query descriptors. Pooled regional features are ranked to retrieve relevant shots. Fine-tuning pre-trained networks on larger datasets like COCO can improve retrieval accuracy. Combining global and local approaches through re-ranking provides an additional boost in performance.

Keeping Identity Graphs In Sync With Apache Spark

Databricks

The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users. In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.

62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...

Pantech Solutions Pvt Ltd

The document lists 79 topics related to digital image processing, digital signal processing, and communications. The topics cover areas such as noise removal, image fusion, image segmentation, image retrieval, watermarking, steganography, and medical image analysis. Many of the topics are aimed at applications in digital cameras, Photoshop, defense, satellites, surveillance systems, web applications, and medical imaging.

Big Data in the Cloud

Amazon Web Services

AWS Summit 2014 Brisbane - Breakout 5 Most organisations are facing ever growing volumes of data that need to be stored and processed but most importantly analysed to bring value to the business. Big Data appears to have solutions to address these challenges but the landscape is littered with acronyms and obscure naming conventions such as MPP, NoSQL, Hadoop, Hive and HBase. Attend this Session to find out - What is the value proposition for each of these technologies - How do they fit with more traditional Big Data solutions such as data warehouses? - How AWS can help organisations get maximum value from their data Presenter: Russell Nash, Solutions Architect, APAC, Amazon Web Services

AbhijitTripathy

Abhijit Tripathy

This document is a resume for Abhijit Tripathy summarizing his education, skills, experience, projects and teaching experience. He is currently pursuing an MS in Computer Science from UC San Diego with a GPA of 3.75/4 and has a BTech in ECE from NIT Rourkela in India with a CGPA of 8.51/10. His skills include programming in Python, C/C++, Java and Matlab as well as frameworks like Keras, Tensorflow and OpenCV. He has work experience as a Software Engineering Intern at Experian and as a Software Engineer at Samsung R&D, and has conducted research projects in areas like visual question answering, sentiment analysis, face recognition

Generative models in the arts

Jorge Davila-Chacon

The document discusses using generative models like VAEs and GANs to explore creative possibilities in the arts with underspecified objectives. It describes an experiment using a customized VAEGAN architecture that combines a VAE and WGAN-GP to generate high resolution images in an interactive GUI. The results included face generation and tracking that could be composed into videos. References are provided for the machine learning techniques used.

3. _dsp_2012-13_titles

Pantech ProEd Pvt Ltd

This document lists 135 topics related to digital signal processing, digital image processing, and communication systems. The topics cover a wide range of fields including image enhancement, compression, watermarking, steganography, biometrics, medical imaging, remote sensing, and wireless communication systems. Many of the topics involve implementing algorithms on DSP processors like the Blackfin and TMS320Cxx.

IEEE 2012 DIP & dsp_2012-13_titles

Srinivasan Natarajan

IEEE 2012 Projects,academic projects in .net,academic projects in java,b tech mini projects,btech projects,electrical projects for students,electronic engineering final year project,electronic engineering final year projects,electronic final year project,electronics students projects,embedded in chennai,embedded projects chennai,engineering final project,engineering final projects,engineering projects in chennai,engineering projects in java,final year embedded projects,final year engineering projects,final year engineering projects chennai,final year engineering projects in chennai,final year ieee projects chennai,final year it projects,final year project chennai,final year project in chennai,final year project in electronics,final year project of electronics,final year projects for it,final year projects for mca,final year projects in .net,final year projects in chennai,final year projects in electronics,final year projects in embedded systems,final year projects in it,final year projects on embedded systems,final year student project,final year student projects,ieee embedded projects,ieee projects,ieee projects chennai,ieee projects for mca,ieee projects in .net,ieee projects in chennai,ieee projects in java,ieee projects in vlsi,ieee projects on embedded systems,ieee projects titles,ieee students projects,mca academic projects,mca final project,mca final year project,mca final year project in chennai,mca projects,mca projects chennai,mca projects titles,project in vlsi,project of mca,projects for mca,projects in vlsi,student project chennai,student projects in java,vlsi in chennai,year projects,Real Time IEEE Projects,Live Projects,Embedded Live Projects,Power Electronics Projects,Power System Projects,ME Projects,M.Tech Projects,VLSI Final Year projects,Embedded final Year Projects,Real Time Embedded Projects,Real Time Software Projects,Live Java Projects,Dot net Projects in Chennai,.Net Projects,B.tech projects,BE Projects,Real Time Project MBA, Real Time Project BE,Project Work BE,Real Time Project MCA,Real Time Project BE Electronic,Computer Software Training Embedded Systems, Real Time Project,Computer Project Work,Real Time Project IT,Embedded Training,Real Time Project Me,Project Work Ieee Based,Real Time Project B Tech,Project Work MCA,Project Work Computer Science,Project Work M E,Engineering Project Consultants,Real Time Project MSC,Real Time Project M Tech,Real Time Project Bio Medical,Project Consultants For Electronic,Project Work B Tech,Real Time Project BE Electrical,Real Time Project Dot Net,Real Time Project BCA,Project Work M Phil,Real Time Project M Phil,Project Work Embedded System,Real Time Project Embedded System,Project Work M Tech,Project Engineering,Real Time Project Java,Real Time Project PHD,Project Work IT,Real Time Project Networking,Real Time Project BSc,Real Time Project Matlab,Computer Software Training Embedded Network,Project Work Java,Real Time Project Vlsi,Real Time Project Animation,Project Work HTML,Real

The Potential of GPU-driven High Performance Data Analytics in Spark

Spark Summit

This document discusses Andy Steinbach's presentation at Spark Summit Brussels on using GPUs to drive high performance data analytics in Spark. It summarizes that GPUs can help scale up compute intensive tasks and scale out data intensive tasks. Deep learning is highlighted as a new computing model that is being applied beyond just computer vision to areas like medicine, robotics, self-driving cars, and predictive analytics. GPU-powered systems like NVIDIA's DGX-1 are able to achieve superhuman performance for deep learning tasks by providing high memory bandwidth and FLOPS.

Performance evaluation of GANs in a semisupervised OCR use case

Florian Wilhelm

This document discusses using generative adversarial networks (GANs) for a semi-supervised optical character recognition (OCR) use case involving vehicle identification numbers (VINs). It describes the text spotting pipeline, challenges with limited training data, data augmentation techniques, and implementing a GAN for character detection. Evaluation shows the semi-supervised GAN approach outperforms other methods, achieving over 99% accuracy on VIN detection and recognition from images using only 85 labeled examples. Key learnings include that custom solutions can outperform off-the-shelf tools for specific tasks, and GANs are well-suited for problems with limited labeled data when combined with data augmentation.

Performance evaluation of GANs in a semisupervised OCR use case

inovex GmbH

Online vehicle marketplaces are embracing artificial intelligence to ease the process of selling a vehicle on their platform. The tedious work of copying information from the vehicle registration document into some web form can be automated with the help of smart text-spotting systems, in which the seller takes a picture of the document, and the necessary information is extracted automatically. Florian Wilhelm details the components of a text-spotting system, including the subtasks of object detection and optical character recognition (OCR). Florian elaborates on the challenges of OCR in documents with various distortions and artifacts, which rule out off-the-shelf products for this task. After offering an overview of semisupervised learning based on generative adversarial networks (GANs), Florian evaluates the performance gains of this method compared to supervised learning. More specifically, for a varying amount of labeled data, he compares the accuracy of a convolution neural network (CNN) to a GANthat uses additional unlabeled data during the training phase, showing that GANs significantly outperform classical CNNs in use cases with a lack of labeled data. What you'll learn: Understand how semisupervised learning with GANs works Explore beneficial semisupervised methods based on GANs for use cases with a limited amount of labeled data Gain insight into an interesting OCR use case of an online vehicle marketplace Event: O'Reilly Artificial Intelligence Conference, London, 11.10.2018 Speaker: Dr. Florian Wilhelm Mehr Tech-Vorträge: www.inovex.de/vortraege Mehr Tech-Artikel: www.inovex.de/blog

Modern OpenGL scientific visualization

Nicolas Rougier

This document discusses using shader-based rendering for modern and interactive scientific visualization in Python. It outlines limitations of existing Python visualization libraries like Matplotlib, and how shader programming in OpenGL can address these through higher quality rendering of text, dashed lines, image filters, grids and other primitives. The approach improves rendering quality and speeds through GPU processing while maintaining an easy to use Python interface. Several open source projects aim to integrate these techniques into interactive visualization libraries.

My Projects & My Stories

Justin Cui

Justin Cui introduces himself and summarizes his work experience. He has 12 years of R&D experience in the industry instruments and communication systems fields. He then lists several projects he worked on at different companies, including developing test compilers, hardware simulators, and drivers for various testing and imaging equipment. For each project, he outlines his role, team size, and key tasks involved in requirements analysis, design, implementation, and verification.

Obscenity Detection in Images

Anil Kumar Gupta

小數據如何實現電腦視覺，微軟AI研究首席剖析關鍵

CHENHuiMei

1) Deep learning has achieved great success in many computer vision tasks such as image classification, object detection, and segmentation. Convolutional neural networks (CNNs) are often used. 2) The size and quality of training datasets is crucial, as deep learning models require large amounts of labeled data to learn meaningful patterns. Data augmentation and synthesis can help increase data quantity and quality. 3) Semi-supervised and transfer learning techniques can help address the challenge of limited labeled data by making use of unlabeled data as well. Generative adversarial networks (GANs) have also been used for data augmentation.

Android based application for graph analysis final report

Pallab Sarkar

This document describes an Android application for graph analysis using image processing techniques. The application allows users to select points on an image of a graph and obtain the coordinate values. It uses OpenCV for preprocessing images, including converting to grayscale, edge detection and contour finding. The application controller directs the preprocessing, postprocessing and coordinate calculation modules. The postprocessing module identifies graph features to allow interpolation of values from pixel locations. The coordinate calculation module uses known scale factors to convert pixel coordinates to value coordinates. The application provides an interactive way to analyze graphical data on a smartphone.

Resume_Vignesh_ThulasiDass

VigneshThulasiDass

This document is a resume for Vignesh Thulasi Dass summarizing his education and experience. He has a Master's degree in Data Analytics from Northeastern University and a Bachelor's degree in Computer Science. His skills include programming languages like R, Python, and SQL as well as tools like Hadoop, Tableau, and PowerBI. He has work experience as a Software Developer at Just Dial India where he performed website and data analysis. His academic projects include predicting Airbnb user bookings using R and Tableau and analyzing household energy consumption using PySpark and PowerBI. He also has leadership experience co-founding an NGO and being a member of clubs at Northeastern and in Bangalore.

A Hands-on Intro to Data Science and R Presentation.ppt

Sanket Shikhar

AI in the Financial Services Industry

Alison B. Lowndes

Accelerate AI w/ Synthetic Data using GANs

Renee Yao

Renee Yao from NVIDIA gave a presentation on using generative adversarial networks (GANs) to generate synthetic data. She discussed how GANs work by having two neural networks, a generator and discriminator, compete against each other. She then provided several examples of real-world applications of GANs, including generating images, video, and medical data. She concluded by discussing NVIDIA's DGX systems for powering large-scale deep learning and GAN projects.

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...

Masud Rahman

The document summarizes a study on improving search queries for bug localization using natural language text from bug reports. The study evaluated different keyword selection techniques, generated optimal search queries using a genetic algorithm, and compared optimal versus non-optimal queries. Key findings include: 1) Current approaches failed to identify keywords for 34% of bug reports, 2) A genetic algorithm produced optimal queries that achieved up to 80% higher performance than baselines, and 3) Optimal queries differed in using less frequent, less ambiguous, noun-heavy keywords located in bug report bodies.

PhD Seminar - Masud Rahman, University of Saskatchewan

Masud Rahman

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Region-oriented Convolutional Networks for Object Retrieval

Universitat Politècnica de Catalunya

Keeping Identity Graphs In Sync With Apache Spark

Databricks

62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...

Pantech Solutions Pvt Ltd

Big Data in the Cloud

Amazon Web Services

AbhijitTripathy

Abhijit Tripathy

Generative models in the arts

Jorge Davila-Chacon

3. _dsp_2012-13_titles

Pantech ProEd Pvt Ltd

IEEE 2012 DIP & dsp_2012-13_titles

Srinivasan Natarajan

The Potential of GPU-driven High Performance Data Analytics in Spark

Spark Summit

Performance evaluation of GANs in a semisupervised OCR use case

Florian Wilhelm

Performance evaluation of GANs in a semisupervised OCR use case

inovex GmbH

Modern OpenGL scientific visualization

Nicolas Rougier

My Projects & My Stories

Justin Cui

Obscenity Detection in Images

Anil Kumar Gupta

小數據如何實現電腦視覺，微軟AI研究首席剖析關鍵

CHENHuiMei

Android based application for graph analysis final report

Pallab Sarkar

Resume_Vignesh_ThulasiDass

VigneshThulasiDass

A Hands-on Intro to Data Science and R Presentation.ppt

Sanket Shikhar

AI in the Financial Services Industry

Alison B. Lowndes

Accelerate AI w/ Synthetic Data using GANs

Renee Yao

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics (20)

Region-oriented Convolutional Networks for Object Retrieval

Keeping Identity Graphs In Sync With Apache Spark

62316925 dip-digital-image-processing-digital-communication-cdma-medical-imag...

Big Data in the Cloud

AbhijitTripathy

Generative models in the arts

3. _dsp_2012-13_titles

IEEE 2012 DIP & dsp_2012-13_titles

The Potential of GPU-driven High Performance Data Analytics in Spark

Performance evaluation of GANs in a semisupervised OCR use case

Modern OpenGL scientific visualization

My Projects & My Stories

Obscenity Detection in Images

小數據如何實現電腦視覺，微軟AI研究首席剖析關鍵

Android based application for graph analysis final report

Resume_Vignesh_ThulasiDass

A Hands-on Intro to Data Science and R Presentation.ppt

AI in the Financial Services Industry

Accelerate AI w/ Synthetic Data using GANs

More from Masud Rahman

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...

Masud Rahman

PhD Seminar - Masud Rahman, University of Saskatchewan

Masud Rahman

PhD proposal of Masud Rahman

Masud Rahman

The document outlines Masud Rahman's PhD thesis proposal on supporting source code search with context-aware, analytics-driven query reformulation. The proposal discusses three research questions: 1) evaluating term weighting techniques for keyword selection from source code and bug reports, 2) incorporating bug report quality for local code search, and 3) leveraging crowd knowledge and data analytics to deliver query keywords. The contribution summary highlights techniques for term dependence, quality-aware bug localization, and using crowd knowledge and large data analytics.

PhD Comprehensive exam of Masud Rahman

Masud Rahman

This document presents a systematic literature review of automated query reformulations for source code search. It discusses seven research questions explored in the review, including the methods, algorithms, data sources, evaluation metrics, challenges, publication trends, and comparisons between local and internet-scale code search queries. The review analyzed over 50 primary studies identified through a multi-database search and filtering process. Key findings include the predominant use of term weighting, query expansion and reduction techniques, evaluations based on standard information retrieval metrics, and various challenges like vocabulary mismatch that remain unsolved. Opportunities for future work are also identified, such as leveraging bug reports for keyword selection and using semantic representations to address vocabulary issues.

Doctoral Symposium of Masud Rahman

Masud Rahman

This document summarizes a talk given by Masud Rahman, a PhD candidate at the University of Saskatchewan. The talk focused on Rahman's PhD thesis research, which aims to improve code search by generating context-aware, analytics-driven queries through effective reformulation. The talk outlined three research questions around improving keyword selection, incorporating bug report quality, and using crowd knowledge and data analytics. It provided an overview of Rahman's PhD thesis and publications addressing the research questions. Evaluation methods for the proposed approaches were also discussed.

Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...

Masud Rahman

ICSE2018-Poster-Bug-Localization

Masud Rahman

This document summarizes a study on improving bug localization through considering the quality of bug reports and reformulating bug report queries. The study analyzes 5,500 bug reports from eight projects and finds that existing bug localization techniques perform poorly when bug reports lack useful information or contain excessive stack traces. Preliminary findings suggest context-aware query reformulation may help address these limitations by improving the quality and relevance of the queries used.

MSR2017-Challenge

Masud Rahman

This document summarizes research into the impact of continuous integration (CI) on code reviews. The researchers studied over 500,000 pull requests and builds from open source projects to answer three questions: 1) Whether build status influences code review participation, 2) If frequent builds improve review quality, and 3) Predicting if a build will trigger new reviews. They found that passed builds were more associated with new reviews and comments. Projects with frequent builds received more review comments that remained steady over time, unlike less frequently built projects. Their machine learning model could predict if a build would trigger new reviews with up to 64% accuracy.

MSR2017-RevHelper

Masud Rahman

This document presents research on predicting the usefulness of code review comments using textual features and developer experience. The researchers analyzed 1,482 code review comments, manually classified as useful or non-useful. They found non-useful comments had more stop words and less code elements, while useful comments had higher conceptual similarity to changed code. More experienced reviewers provided more useful comments. The researchers also built a Random Forest model that predicts comment usefulness with 66% accuracy, outperforming baselines. Their work provides the first automated approach to assess code review comment usefulness.

STRICT-SANER2017

Masud Rahman

The document describes a technique called STRICT that uses TextRank and POSRank algorithms to identify important terms from a software change task description to generate an effective initial search query. An experiment on 1,939 change tasks from 8 open source projects found that STRICT improved the query effectiveness in 57.84% of cases compared to baseline queries like title alone. STRICT also showed better retrieval performance based on metrics like mean average precision and mean recall compared to state-of-the-art techniques. The approach validates the use of graph-based ranking algorithms to address the challenge of generating relevant initial search queries from natural language change task descriptions.

MSR2015-Challenge

Masud Rahman

The document analyzes why some questions on Stack Overflow remain unresolved and explores whether machine learning can predict which questions will be unresolved. It finds that unresolved questions have higher topic entropy, meaning they are less specific. Owners of unresolved questions reject answers more often, have lower reputation, and are less active on Stack Overflow. Models using features like topic entropy, answer rejection ratio, and owner reputation achieved up to 78% accuracy at predicting unresolved questions. The study aims to help improve question quality on Stack Overflow.

MSR2014-Challenge

Masud Rahman

This document analyzes data from over 78,000 pull requests on GitHub to understand why pull request failure rates are high. It finds that 57.05% of pull requests failed, most often due to issues with recursion/refactoring, database queries, arrays/functions. Programming languages like Java, JavaScript and Ruby saw more failed pull requests on average than PHP. Projects in IDE and framework domains had the most pull request activity. Older projects, projects with more forks/developers, and projects where developers had 20-50 months of experience saw the highest numbers of pull requests and failures. The study aims to help understand and address common reasons for pull request failures on GitHub.

CodeInsight-SCAM2015

Masud Rahman

The document describes a technique called CodeInsight that mines insightful code comments from crowdsourced knowledge on Stack Overflow. An exploratory study of Stack Overflow discussions found that around 22% of comments discuss tips, bugs, or warnings related to code examples. CodeInsight uses heuristics like popularity, relevance, comment rank, sentiment, and word count to retrieve these insightful comments for a given code segment. An empirical evaluation showed the technique could recall over 80% of relevant comments on average. A user study with professional developers found that 80% of the comments recommended by CodeInsight were accurate and useful.

STRICT-SANER2015

Masud Rahman

This document proposes using TextRank to identify initial search terms for software change tasks. It adapts TextRank, originally used for keyword extraction and text summarization, to build a graph of terms from development artifacts and rank them. An evaluation on 349 change tasks from two systems identifies search terms, which outperform an existing approach in solving more tasks with higher precision and recall. The approach recommends initial search queries to help developers find relevant code artifacts when performing change tasks.

CMPT-842-BRACK

Masud Rahman

This document discusses a method called BRACK for identifying bug-prone API methods using crowdsourced knowledge from Stack Overflow. BRACK ranks API method invocations based on two heuristics: API Context-Susceptibility (ACS) which estimates how context can impact an invocation, and API Error-Associativity (AEA) which calculates the co-occurrence of an invocation in defective and corrected code segments. An evaluation of BRACK on 8 open source systems found that it achieved a top-3 accuracy of 75.93% in identifying bug-prone invocations, and that ACS was more effective than AEA. The evaluation also showed BRACK had no significant bias towards system size or API package and performed comparably

RACK-Tool-ICSE2017

Masud Rahman

The document presents research on RACK, a tool that uses crowdsourced knowledge from Stack Overflow to reformulate natural language code search queries into relevant API names. The researchers analyzed Stack Overflow data to find that answers frequently refer to APIs by name and cover a high percentage of core APIs. They also found question titles contain terms relevant to real code search queries. RACK maps query terms to API names using this data, then searches GitHub code examples. An evaluation showed RACK returns relevant examples with 79% top-10 accuracy, outperforming existing techniques.

RACK-SANER2016

Masud Rahman

RACK is an approach that automatically recommends relevant APIs for code search queries using crowdsourced knowledge from Stack Overflow questions, answers, and titles. An exploratory study found that accepted Stack Overflow answers frequently mention API names and cover a large percentage of standard APIs. Question titles often contain keywords relevant to code search. RACK constructs an API-token mapping database from Stack Overflow and ranks APIs for a given query based on heuristics measuring keyword-API co-occurrence and coherence. An evaluation found RACK achieved around 79% top-10 accuracy and outperformed existing techniques, demonstrating the potential of leveraging crowdsourced technical knowledge for API recommendation.

QUICKAR-ASE2016-Singapore

Masud Rahman

QUICKAR is a technique for automatically reformulating code search queries using crowdsourced knowledge from Stack Overflow. It constructs an adjacency list database of terms from Stack Overflow question titles. For an initial search query, it identifies reformulation candidates by comparing the query terms to terms in the adjacency list database and project source code. In experiments, QUICKAR significantly outperformed a baseline technique, improving over 50% of queries while worsening less than 50%, by leveraging vocabulary from Stack Overflow to address mismatches between developer queries and code.

CORRECT-ToolDemo-ASE2016

Masud Rahman

CORRECT is a code reviewer recommendation tool that: - Recommends appropriate code reviewers automatically by mining developers' contributions across projects - Provides recommendation rationales that fit within developers' workflows - Achieves over 90% accuracy in recommending reviewers based on library and technology experience - Outperforms an existing technique (RevFinder) with 92.15% top-5 accuracy, 85.93% mean precision and 81.39% mean recall - Performs similarly on open source projects with 85.20% top-5 accuracy, demonstrating effectiveness for public and private codebases

CORRECT-ICSE2016

Masud Rahman

The document describes CORRECT, a technique for recommending code reviewers for pull requests on GitHub based on developers' cross-project and technology experience. It evaluates CORRECT using codebases from both a commercial software company and open source projects. The results show that CORRECT achieves over 90% accuracy in recommending reviewers, outperforming a baseline technique. Library and technology experience are also found to be good proxies for code review skills. CORRECT performs equally well on both private and public codebases without bias toward any development framework.

More from Masud Rahman (20)

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...

PhD Seminar - Masud Rahman, University of Saskatchewan

PhD proposal of Masud Rahman

PhD Comprehensive exam of Masud Rahman

Doctoral Symposium of Masud Rahman

Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...

ICSE2018-Poster-Bug-Localization

MSR2017-Challenge

MSR2017-RevHelper

STRICT-SANER2017

MSR2015-Challenge

MSR2014-Challenge

CodeInsight-SCAM2015

STRICT-SANER2015

CMPT-842-BRACK

RACK-Tool-ICSE2017

RACK-SANER2016

QUICKAR-ASE2016-Singapore

CORRECT-ToolDemo-ASE2016

CORRECT-ICSE2016

Recently uploaded

Session 1 - Intro to Robotic Process Automation.pdf

UiPathCommunity

👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program: https://bit.ly/Automation_Student_Kickstart In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC. 📕 Detailed agenda: What is RPA? Benefits of RPA? RPA Applications The UiPath End-to-End Automation Platform UiPath Studio CE Installation and Setup 💻 Extra training through UiPath Academy: Introduction to Automation UiPath Business Automation Platform Explore automation development with UiPath Studio 👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

Fwdays

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph

Neo4j

Essentials of Automations: Exploring Attributes & Automation Parameters

Safe Software

Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they? Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality. You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.

Main news related to the CCS TSI 2023 (2023/1695)

Jakub Marek

An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers. The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 . The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

DanBrown980551

This LF Energy webinar took place June 20, 2024. It featured: -Alex Thornton, LF Energy -Hallie Cramer, Google -Daniel Roesler, UtilityAPI -Henry Richardson, WattTime In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms. This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups. Three primary specifications will be discussed: -Discovery and client registration, emphasizing transparent processes and secure and private access -Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure -Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data

GNSS spoofing via SDR (Criptored Talks 2024)

Javier Junquera

In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security. This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing. The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

Christine's Supplier Sourcing Presentaion.pptx

christinelarrosa

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Fwdays

Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless. As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency. We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.

Apps Break Data

Ivo Velitchkov

How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx

christinelarrosa

The Microsoft 365 Migration Tutorial For Beginner.pptx

operationspcvita

Dandelion Hashtable: beyond billion requests per second on a commodity server

Antonios Katsarakis

This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).

"Choosing proper type of scaling", Olena Syrota

Fwdays

Must Know Postgres Extension for DBA and Developer during Migration

Mydbops

Mydbops Opensource Database Meetup 16 Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting Date & Time: 8th June | 10 AM - 1 PM IST Venue: Bangalore International Centre, Bangalore Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle. Key Takeaways: * Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities. * Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom. * Discover how these key extensions can empower both developers and DBAs during the migration process. * Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends. Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL. Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability. Contact us: info@mydbops.com Visit: https://www.mydbops.com/ Follow us on LinkedIn: https://in.linkedin.com/company/mydbops For more details and updates, please follow up the below links. Meetup Page : https://www.meetup.com/mydbops-databa... Twitter: https://twitter.com/mydbopsofficial Blogs: https://www.mydbops.com/blog/ Facebook(Meta): https://www.facebook.com/mydbops/

Choosing The Best AWS Service For Your Website + API.pptx

Brandon Minnick, MBA

Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API? Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose? Which one is cheapest? Which one is fastest? Which one will scale to meet our needs? Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Jason Yip

The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.

Recently uploaded (20)

Session 1 - Intro to Robotic Process Automation.pdf

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph

Essentials of Automations: Exploring Attributes & Automation Parameters

Main news related to the CCS TSI 2023 (2023/1695)

Introduction of Cybersecurity with OSS at Code Europe 2024

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

GNSS spoofing via SDR (Criptored Talks 2024)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

Christine's Supplier Sourcing Presentaion.pptx

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Apps Break Data

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx

The Microsoft 365 Migration Tutorial For Beginner.pptx

Dandelion Hashtable: beyond billion requests per second on a commodity server

"Choosing proper type of scaling", Olena Syrota

Must Know Postgres Extension for DBA and Developer during Migration

Choosing The Best AWS Service For Your Website + API.pptx

Columbus Data & Analytics Wednesdays - June 2024

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

1. EFFECTIVE REFORMULATION OF QUERY FOR CODE SEARCH USING CROWDSOURCED KNOWLEDGE AND EXTRA-LARGE DATA ANALYTICS Masud Rahman, Chanchal K. Roy Department of Computer Science University of Saskatchewan, Canada International Conference on Software Maintenance and Evolution (ICSME 2018), Madrid, Spain

2. IDEAL SCENARIO OF CODE SEARCH 2 Convert image to gray scale without losing transparency

3. REAL LIFE SCENARIO: GOOGLE 3 QUERY MATTERS!

4. REAL LIFE SCENARIO: GITHUB SEARCH 4 NOT WORKING!!

5. SOLUTION: QUERY REFORMULATION 5 Convert image to gray scale without losing transparency 115 BufferedImage Grayscale ImageEdit ColorConvertOp File Transparency ColorSpace BufferedImageOp Graphics ImageEffects 02 Convert image to gray scale without losing transparency CONTRIBUTION

6. NLP2API: PROPOSED QUERY REFORMULATION FOR CODE SEARCH 6

7. PageRank TF-IDF STEPS OF NLP2API 7 BORDA Count: A>B if ∑rank(A) > ∑rank(B) Semantic Proximity: A>B if proximity(Q,A) > proximity(Q,B)

8. NLP2API: TWO PILLARS 8 NLP2API Developer Crowd Data Analytics

9. EXPERIMENT: EVALUATION SCENARIOS 9 NLP2API API Suggestion Query Reformulation

10. EXPERIMENT: DATASET COLLECTION 10 Java2s CodeJava 310 Queries & Ground truth 4K Code segments

11. RQ1: HOW DOES NLP2API PERFORM IN API CLASS SUGGESTION? 11 70% 50%

12. RQ2: CAN NLP2API OUTPERFORM THE STATE-OF-THE-ART? 12 Metric RACK, SANER 2016 NLP2API Improved(%) Hit@1 20.97% 41.94% *100% MRR@1 0.21 0.42 *100% MAP@1 20.97% 41.94% *100% Hit@5 64.19% 72.90% 14% MRR@5 0.37 0.54 *46% MAP@5 36.76% 50.56% *38%

13. RQ3: CAN REFORMULATED QUERIES OUTPERFORM BASELINE NL QUERIES? 13 30%

14. RQ4: CAN NLP2API OUTPERFORM THE STATE-OF- THE-ART IN QUERY REFORMULATION? 14 Method Improved Mean Q1 Q2 Q3 Min Max QECK 72 139 02 11 74 01 1,861 RACK 105 75 02 08 60 01 971 COCABU 113 191 02 14 103 01 2,607 Baseline 07 25 145 02 1,460 NLP2API *152 *172 *02 *10 *61 01 1,926 QE = Rank of the first relevant code example, Qi = i-th quartile of QE QE = Rank of the first relevant code example, Qi = i-th quartile of QE

15. RQ5: CAN NLP2API IMPROVE TRADITIONAL CODE SEARCH RESULTS? 15 Stage-I Stage-II GitHub

16. RQ5: CAN NLP2API IMPROVE TRADITIONAL CODE SEARCH RESULTS? 16

17. TAKE-HOME MESSAGES 17 NOT WORKING!! NLP2API API Suggestion Query Reformulation Code Search

18. THANK YOU !!! QUESTIONS? 18 Replication Package of NLP2API: http://www.usask.ca/~masud.rahman/nlp2api Contact: masud.rahman@usask.ca Masud Rahman (@masud2336)

Editor's Notes

Good morning, everyone. My name is Masud Rahman. I am a PhD Student from University of Saskatchewan, Canada. I work with Prof. Dr. Chanchal Roy. My research area is code search and query reformulation. Today, I am going to talk about a code search approach where we used query reformulation. And for query reformulation, we used data mining from Stack Overflow, and we also used large-scale data analytics with word embeddings.
First, we will see some scenarios. This is an ideal scenario for code search. If you provide a natural language query, and you would expect a code segment that solves your problem exactly. But this does not happen in practice.
In real life, you get a lot of search results. You have to analyze the results, and look for such code segments in those pages. If the query is good enough, you might get lucky and get the Hit very quickly. For example, Google is quite good at this. But it really depends on the query you choose.
Unfortunately, other search engines are failing to keep up with Google. For example, GitHub code search does not work with such natural language query. It does keyword matching, but that is not sufficient enough if the query is NOT good. In fact, several code search engines are disappearing from the web, such as Koders, GoogleCode, which is a bit strange. So, we try to improve basically the code search.
Now, how can we beat the status quo’ of code search? Well, one possible way is to improve the query through query reformulation. Since keyword search is a kind of universal idea, we cannot avoid it. So what we can do? We will improve the keyword search by providing more appropriate keywords. Now what are those? Well, source code is different from natural language texts. It has less vocabulary. So, we have to deal with it carefully. One possible way is to provide -- relevant API classes as the keywords for expansion. For example, when the baseline query returns correct the result at 115th position, the reformulated query returns that at the 2nd position.
So, here is our contribution: NLP2API == Natural Language Phrase to API. We translate a natural language query into relevant API classes for query reformulation and then we improve the code search in the process.
First we take a generic natural language query and submit to a search engine. It retrieves relevant questions and answers from Stack Overflow. We then mine the code segments posted in those threads using two term weighting methods – PageRank and TF-IDF. Thus, we get a list of candidate API classes from those threads that are used by millions of people. Now, the big question is, which candidates are the most appropriate for query at hand? Well, we proposed two metrics – Borda count and Semantic proximity. The essence of Borda count is -- If API A is more frequent than API B in the relevant Q & A threads from Stack Overflow, A is more appropriate than B. So, it’s a kind of likelihood of A over B for the target query. For the second metric, we preprocess Stack Overflow corpus, develop a Skip-gram model using FastText, an improved version of Word2Vec. Then we determine, how close an API is to the given query keywords within the semantic space. So, we A is more semantically close to query Q than B, then A is more appropriate than B for the query. So, we then combine these two metrics for each candidate API class, do the ranking, and return the Top-K classes as our reformulation terms.
So, we stand on the shoulder of two giants the massive developer crowd : We use their API relevance judgment through data mining. Large-scale data analytics: We determine the semantic proximity between keywords and candidate API class.
We evaluate our approach from two dimensions: API suggestion: We check our performance against ground truth whether we are doing it correctly. Otherwise, the rest part does not work. Query reformulation/code search: We check whether our reformulation actually improves the query or not in terms of code search performance.
For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
We answered five research questions in this paper. The first research question: How does our tool, NLP2API, perform in API class suggestion? We achieve 70%+ Top-5 accuracy with 50% precision which is pretty good for an automatic approach. That is, half of the suggested API classes are true positive, and the tool succeeds for 70% of the times. We also get a MRR of 0.55 which suggests that the first relevant API class generally appears between 1st to 2nd position, which is promising. We also see that two of the metrics – Borda and Semantic Proximity perform pretty well. But obviously, we combined them due to their orthogonal aspects of strength, and then achieved the highest performance.
The second research question compares our approach with the state-of-the-art. For Top-1, we see that our approach doubled the performance in all three metrics which is interesting. For Top-5 results, we see that NLP2API also improves over the state-of-the-art by 38% in precision and 46% in reciprocal rank. So, our approach is advancing the state-of-the-art which is highly expected.
In the third research question, we investigate whether our reformulation actually improves the baseline query or not. Well, it does! When the baseline natural language query is used, we achieved an accuracy of 50% However, when we keep adding the API classes suggested by our tool, we see performance improvement, which justifies our whole hypothesis. For example, we get around 65% accuracy when add 10-15 API classes which is a fairly descent performance improvement. We also get the same picture in the case of reciprocal rank. So, yes, the query reformulation works!
In the fourth research question, we compare our query reformulation performance with three other approaches from the literature. In particular, what we did, we determine query effectiveness. That is, the rank of the first correct result returned by a query. We collect such ranks for all queries, determine their quartiles, and then compare with other approaches. Here, we see that our reformulation improves 50% of the queries which is the highest obviously. However, these are the baseline quartiles, and these are our quartiles. Well, our reformulations improved the ranks, and is advancing the state-of-the-art.
In the fifth research question, we investigate whether our reformulated queries can improve the results of traditional code search engines. So, what we did, we collect results from Google, Stack Overflow and GitHub for the baseline queries first. Then manually analyze them, compare them with our goldset, and setup a baseline performance. This is step-I. In the second stage, we repeat the experiments with our reformulate queries. Then we compare the performance of these two steps.
We see that Google obviously performs better than the other two, which is pretty much expected. It achieves around 65% precision which is pretty good. However, our reformulated queries can make it even better to like 75%. So, yes, although, this approach is not designed for Google, rather code search engines like GitHub. it can significantly improve the precision of Google in the code search which is great. We also got significant performance improvement in terms of NDCG, another state-of-the-art ranking metric, which proves our hypothesis to be true. However, we faced some issues while comparing with Google, which is discussed in the paper 
So, these are the take-home messages. Code search engines are NOT working well. However, keyword search is a kind of universal idea. So, we tried to improve the keyword search by providing more appropriate keywords for code search. Our approach stands on the shoulder of two giants: (1) crowd generated knowledge, and (2) large-scale data analytics. We conducted experiments using 300+ queries, and answered 5 research questions. Our approach outperformed the state-of-the-art in API suggestion, query reformulation and code search.
We have a replication package publicly available. Its on GitHub. You can simply clone it, and use it for you work. Go ahead and develop the next best tool  Thanks for your time and attention. I am ready to have a few questions.

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Recommended

Recommended

More Related Content

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Similar to Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics

Editor's Notes