SlideShare a Scribd company logo
1 of 36
SUPPORTING CODE SEARCH WITH
CONTEXT-AWARE, ANALYTICS-DRIVEN, EFFECTIVE
QUERY REFORMULATION
Masud Rahman, PhD Candidate
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6
TALK OUTLINE
MasudRahman,PhDCandidate,UofS
Part 2: PhD Thesis
Part 1: Research Problem
Part 3: Q&A + Discussions
2
MasudRahman,PhDCandidate,UofS
Part 1: Research Problem
P1 P2 P3
3
MCAS: A SOFTWARE BUG THAT KILLS
MasudRahman,PhDCandidate,UofS
P1 P2 P3
Boeing 737 MAX 8
4
MCAS
THE SEARCH FOR THE BUGGY CODE
MasudRahman,PhDCandidate,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3
5
SYSTEMATIC LITERATURE REVIEW
MasudRahman,PhDCandidate,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
6
3
I1: INAPPROPRIATE TERM WEIGHTING


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3
MasudRahman,PhDCandidate,UofS
7
RQ1: Can TF-IDF deliver appropriate search keywords
either from source code or from bug reports? If not, how
can we improve the keyword selection?
I2: LOW QUALITY OF BUG REPORTS
8
5000+
MasudRahman,PhDCandidate,UofS
P1 P2 P3
PoorNoisyRich
RQ2: Can we deliver appropriate keywords for IR-
based bug localization (a.k.a., local code search)
by incorporating the bug report quality?
Traditional Practices
I3: WORDNET FOR SEMANTIC SIMILARITY
9
MasudRahman,PhDCandidate,UofS
P1 P2 P3
W1  W2
RQ3: Can we deliver appropriate query keywords for
the code search using crowd knowledge (Stack
Overflow) and data analytics (FastText)?
MasudRahman,PhDCandidate,UofS
Part 2: PhD Thesis
P1 P2 P3
10
PHD THESIS OVERVIEW
11
MasudRahman,PhDCandidate,UofS
P1 P2 P3
S1 (SANER 2017)
S2 (ASE 2017)
S3 (ESEC/FSE 2018) S6 (ICSME 2018)
S5 (EMSE 2019)
S4
Thesis
RQ1
RQ2
RQ3
Graph-based Term
Weighting
Bug Report Quality
Dimension
Crowd Knowledge Data Analytics
TF-IDF: TERM IMPORTANCE (TRADITIONAL)
12
MasudRahman,PhDCandidate,UofS
S1 S2 S3 S4P1 P2 P3
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
S5 S6
TEXTRANK: TERM IMPORTANCE USING CO-
OCCURRENCES (MIHALCEA ET AL, EMNLP 2004)
13
MasudRahman,PhDCandidate,UofS
IResource … IJavaElement
IResource … IJavaElement
P1 P2 P3
(Term Co-occurrence)
S1 S2 S3 S4 S5 S6
POSRANK: TERM IMPORTANCE USING SYNTACTIC
DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012)
14
MasudRahman,PhDCandidate,UofS
Noun Verb Adjective
Element …reported, element …plain
P1 P2 P3
Jespersen Rank Theory
(Syntactic Dependence)
S1 S2 S3 S4 S5 S6
S1: QUERY KEYWORD SELECTION WITH
PAGERANK (BRIN & PAGE, 1998)
15
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
•Element
•Iresource
•Provider
•Level
•Tree
Candidate
Query 1
Candidate
Query 2
P1 P2 P3
Sergey
Brin
Larry
Page
PageRank
Algorithm
Best Query
RQ1 : Keywords selected by PageRank are more
effective for local code searches (e.g., IR-based bug
localization) than that of TF-IDF
S1 S2 S3 S4 S5 S6
S3: QUALITY-AWARE SEARCH QUERIES
16
Noisy Poor Rich
MasudRahman,PhDCandidate,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
PoorNoisyRich
Rich
Noisy
Poor
Equality Equity
RQ2: Incorporation of bug report quality into query
construction process significantly improves the
performance of the queries in the code search.
Semantic
Hyperspace
S4: QUERY REFORMULATION WITH CROWD
KNOWLEDGE & DATA ANALYTICS
17
MasudRahman,PhDCandidate,UofS
P1 P2 P3
Stack Overflow
(Crowd Knowledge)
Data
preprocessing
Neural Text classifier
FastText model
(skip-gram)
S1 S2 S3 S4 S5 S6
SEMANTIC HYPERSPACE
18
MasudRahman,PhDCandidate,UofS
P1 P2 P3
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
S1 S2 S3 S4 S5 S6
19
MasudRahman,PhDCandidate,UofS
P1 P2 P3
channel
join spam
entered
connect
invitation
message
room
chat
handle
mask
remote
synd
admin
Q
C1
C2
• Hopkins Statistic (HS)
• Polygon Area (PA)
CLUSTERING TENDENCY WITH DATA ANALYTICS
C1 is better than C2
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
S1 S2 S3 S4 S5 S6
EVALUATION METHODOLOGY
20
Evaluation Paradigms
IR-Based Bug
Localization
Query
Reformulation
1. Hit@K
2. MAP@K
3. MRR@K
Query
Effectiveness
(QE)
MasudRahman,PhDCandidate,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
5K+ 8
CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY
EXPANSION
MasudRahman,PhDCandidate,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects
P1 P2 P3
21
S1 S2 S3 S4 S5 S6
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
TAKE-HOME MESSAGES
22
MasudRahman,PhDCandidate,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
Term Independence
(TF-IDF)
Term Dependence
(PageRank)
Reliance on Auxiliary
Resources (e.g., history mining)
Efficient Use of Primary
Resource (e.g., Bug Reports)
Bug Report Quality
(Overlooked)
Reporting Quality-Aware
Bug Localization
Thesaurus-Based Similar
Keyword Suggestion
Crowdsourced Knowledge &
Large Data Analytics
Traditional Proposed
Cosine Similarity for
Semantic Distance
Semantic Hyperspace &
Clustering Tendency
MasudRahman,PhDCandidate,UofS
P1 P2 P3
23
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part III: Q & A
TAKE-HOME MESSAGES
24
MasudRahman,PhDCandidate,UofS
RQ1
RQ2 RQ3
TF-IDF
PageRank
Equality
Equity
Stack Overflow
FastText
WordNet
Thesis
P1 P2 P3
SEMANTIC HYPERSPACE
25
MasudRahman,PhDCandidate,UofS
P1 P2 P3
x P (1, 5, 6, 7, ….., N)
y P (2, 4, 6, 9, ….., N)
y
S1 S2 S3 S4 S5 S6
y = mx + c,
x^2 +y^2 = r^2
ax^2+bx+c=0
TWO WORKING CONTEXTS: LOCAL & GLOBAL
MasudRahman,PhDCandidate,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
26
S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
27
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6
HOW DID WE DO?
28
MasudRahman,PhDCandidate,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
3
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
R3: SOLVE VOCABULARY MISMATCH ISSUE
MasudRahman,PhDCandidate,UofS
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
29
SOLUTION: SEMANTIC HYPERSPACE
MasudRahman,PhDCandidate,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
30
R4: GENETIC ALGORITHM FOR QUERIES
MasudRahman,PhDCandidate,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
31
SEARCH QUERY FROM NOISY BUG REPORT
32
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
MasudRahman,PhDCandidate,UofS
S1 S2 S3 S4P1 P2 P3
DICE, ROCCHIO, RSV
MasudRahman,PhDCandidate,UofS
33
VOCABULARY MISMATCH PROBLEM
MasudRahman,PhDCandidate,UofS
P1 P2 P3
Both are correct and wrong!
Boeing
Customer Boeing
Developer
34
MasudRahman,PhDCandidate,UofS
KEYWORDS FROM A BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3
35
PROBABILISTIC TERM WEIGHTING
MasudRahman,PhDCandidate,UofS
KLD
36

More Related Content

Similar to Doctoral Symposium of Masud Rahman

Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank ProductMahmoud Parsian
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017Masud Rahman
 
Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...sopekmir
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache CassandraTrivadis
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Mariano Rodriguez-Muro
 
Making sense of your data
Making sense of your dataMaking sense of your data
Making sense of your dataGerald Muecke
 
Cassandra: why will the relational thinking destroy your system performance?
Cassandra: why will the relational thinking destroy your system performance?Cassandra: why will the relational thinking destroy your system performance?
Cassandra: why will the relational thinking destroy your system performance?Paulo Ricardo Rocha de Almeida
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLLuiz Henrique Zambom Santana
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
dot15926 Software Presentation
dot15926 Software Presentationdot15926 Software Presentation
dot15926 Software PresentationVictor Agroskin
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 

Similar to Doctoral Symposium of Masud Rahman (20)

Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank Product
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
Rank mysteps demo
Rank mysteps demoRank mysteps demo
Rank mysteps demo
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
 
Making sense of your data
Making sense of your dataMaking sense of your data
Making sense of your data
 
Cassandra: why will the relational thinking destroy your system performance?
Cassandra: why will the relational thinking destroy your system performance?Cassandra: why will the relational thinking destroy your system performance?
Cassandra: why will the relational thinking destroy your system performance?
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
dot15926 Software Presentation
dot15926 Software Presentationdot15926 Software Presentation
dot15926 Software Presentation
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

More from Masud Rahman

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityMasud Rahman
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationMasud Rahman
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015Masud Rahman
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016Masud Rahman
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-MeetingMasud Rahman
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slidesMasud Rahman
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018Masud Rahman
 
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Masud Rahman
 
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationImproving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationMasud Rahman
 
Exploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and ExceptionsExploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and ExceptionsMasud Rahman
 
SOAP--Simple Object Access Protocol
SOAP--Simple Object Access ProtocolSOAP--Simple Object Access Protocol
SOAP--Simple Object Access ProtocolMasud Rahman
 
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...Masud Rahman
 

More from Masud Rahman (20)

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie University
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-Meeting
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slides
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018
 
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
 
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationImproving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
 
Exploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and ExceptionsExploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and Exceptions
 
SOAP--Simple Object Access Protocol
SOAP--Simple Object Access ProtocolSOAP--Simple Object Access Protocol
SOAP--Simple Object Access Protocol
 
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...
ContentSuggest--Recommendation of Relevant Sections from a Webpage about Erro...
 

Recently uploaded

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Doctoral Symposium of Masud Rahman

  • 1. SUPPORTING CODE SEARCH WITH CONTEXT-AWARE, ANALYTICS-DRIVEN, EFFECTIVE QUERY REFORMULATION Masud Rahman, PhD Candidate Department of Computer Science University of Saskatchewan, Canada Advisor: Dr. Chanchal Roy @masud233 6
  • 2. TALK OUTLINE MasudRahman,PhDCandidate,UofS Part 2: PhD Thesis Part 1: Research Problem Part 3: Q&A + Discussions 2
  • 4. MCAS: A SOFTWARE BUG THAT KILLS MasudRahman,PhDCandidate,UofS P1 P2 P3 Boeing 737 MAX 8 4 MCAS
  • 5. THE SEARCH FOR THE BUGGY CODE MasudRahman,PhDCandidate,UofS Boeing Customer MCAS Bug report Boeing Developer Code search Query Suggestion Query Reformulation Boeing Codebase P1 P2 P3 5
  • 6. SYSTEMATIC LITERATURE REVIEW MasudRahman,PhDCandidate,UofS ACM DL CrossRef DBLP Mendeley Google Scholar IEEE Xplore ProQuest ScienceDirect SpringerLink Web of Science Wiley Online Lib 2871 2317 562 Initial results Impurity removal Filter by Title 195 Filter by Abstract 93 Merging & Duplicate removal 56 Primary studies P1 P2 P3 Filter by Full texts Query reformulation, query expansion, query reduction, query formulation, query refinement, automated query expansion, AQE, query suggestion, query recommendation, term selection, query replacement, query difficulty, query quality, keyword selection, keyword extraction, search term identification, search query, search term, and search keyword. 6 3
  • 7. I1: INAPPROPRIATE TERM WEIGHTING   RFDd t t n D dftIDFTF log)),log(1()( • Different syntax • Different semantics • Different structures P1 P2 P3 MasudRahman,PhDCandidate,UofS 7 RQ1: Can TF-IDF deliver appropriate search keywords either from source code or from bug reports? If not, how can we improve the keyword selection?
  • 8. I2: LOW QUALITY OF BUG REPORTS 8 5000+ MasudRahman,PhDCandidate,UofS P1 P2 P3 PoorNoisyRich RQ2: Can we deliver appropriate keywords for IR- based bug localization (a.k.a., local code search) by incorporating the bug report quality? Traditional Practices
  • 9. I3: WORDNET FOR SEMANTIC SIMILARITY 9 MasudRahman,PhDCandidate,UofS P1 P2 P3 W1  W2 RQ3: Can we deliver appropriate query keywords for the code search using crowd knowledge (Stack Overflow) and data analytics (FastText)?
  • 11. PHD THESIS OVERVIEW 11 MasudRahman,PhDCandidate,UofS P1 P2 P3 S1 (SANER 2017) S2 (ASE 2017) S3 (ESEC/FSE 2018) S6 (ICSME 2018) S5 (EMSE 2019) S4 Thesis RQ1 RQ2 RQ3 Graph-based Term Weighting Bug Report Quality Dimension Crowd Knowledge Data Analytics
  • 12. TF-IDF: TERM IMPORTANCE (TRADITIONAL) 12 MasudRahman,PhDCandidate,UofS S1 S2 S3 S4P1 P2 P3 University of Saskatchewan The Saskatchewan Huskies football team represents the University of Saskatchewan in U Sports football that competes in the Canada West Universities Athletic Association conference of U Sports. The program has won the Vanier Cup national championship three times, in 1990, 1996 and 1998. The Saskatchewan Huskies became only the second U Sports team to advance to three consecutive Vanier Cup games, after the Saint Mary's Huskies, but lost all three games from 2004-2006. The team has won the most Hardy Trophy titles in Canada West, having won a total of 20 times. The 2006 Saskatchewan Huskies became only the third team to play in a Vanier Cup that their school was hosting, when the University of Saskatchewan hosted the 42nd Vanier Cup. The Toronto Varsity Blues were the first when they won two Vanier Cups in 1965 and 1993. Saskatchewan also became the first western school to host the national championship game. Saskatchewan:6 Vanier: 5 Won: 4 Huskies: 4 Cup: 4 Team: 4 Sports: 3 Times: 2 School: 2 Championship:2 Vanier: 0.5 Won: 0.4 Huskies: 0.4 School: 0.1 Saskatchewan: 0.06 Championship: 0.06 Sports: 0.06 Times: 0.06 Cup: 0.04 Team: 0.04 TF IDF TF x IDF Saskatchewan: .01 Vanier: 0.1 Won: 0.1 Huskies: 0.1 Cup: 0.01 Team: 0.01 Sports: 0.02 Times: 0.03 School: 0.05 Championship: .03 IDF = log (DF / N) Saskatchewan Huskies S5 S6
  • 13. TEXTRANK: TERM IMPORTANCE USING CO- OCCURRENCES (MIHALCEA ET AL, EMNLP 2004) 13 MasudRahman,PhDCandidate,UofS IResource … IJavaElement IResource … IJavaElement P1 P2 P3 (Term Co-occurrence) S1 S2 S3 S4 S5 S6
  • 14. POSRANK: TERM IMPORTANCE USING SYNTACTIC DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012) 14 MasudRahman,PhDCandidate,UofS Noun Verb Adjective Element …reported, element …plain P1 P2 P3 Jespersen Rank Theory (Syntactic Dependence) S1 S2 S3 S4 S5 S6
  • 15. S1: QUERY KEYWORD SELECTION WITH PAGERANK (BRIN & PAGE, 1998) 15    )( )10( |)(| )( )1()( ivInj j j i vOut vS vS  •Element •Iresource •Provider •Level •Tree Candidate Query 1 Candidate Query 2 P1 P2 P3 Sergey Brin Larry Page PageRank Algorithm Best Query RQ1 : Keywords selected by PageRank are more effective for local code searches (e.g., IR-based bug localization) than that of TF-IDF S1 S2 S3 S4 S5 S6
  • 16. S3: QUALITY-AWARE SEARCH QUERIES 16 Noisy Poor Rich MasudRahman,PhDCandidate,UofS P1 P2 P3 S1 S2 S3 S4 S5 S6 PoorNoisyRich Rich Noisy Poor Equality Equity RQ2: Incorporation of bug report quality into query construction process significantly improves the performance of the queries in the code search.
  • 17. Semantic Hyperspace S4: QUERY REFORMULATION WITH CROWD KNOWLEDGE & DATA ANALYTICS 17 MasudRahman,PhDCandidate,UofS P1 P2 P3 Stack Overflow (Crowd Knowledge) Data preprocessing Neural Text classifier FastText model (skip-gram) S1 S2 S3 S4 S5 S6
  • 18. SEMANTIC HYPERSPACE 18 MasudRahman,PhDCandidate,UofS P1 P2 P3 Word 1 P (1, 5, 6, 7, ….., N) Word 2 P (2, 4, 6, 9, ….., N) Word 2 S1 S2 S3 S4 S5 S6
  • 19. 19 MasudRahman,PhDCandidate,UofS P1 P2 P3 channel join spam entered connect invitation message room chat handle mask remote synd admin Q C1 C2 • Hopkins Statistic (HS) • Polygon Area (PA) CLUSTERING TENDENCY WITH DATA ANALYTICS C1 is better than C2 RQ3: Appropriate query keywords can be delivered for the code search using Stack Overflow and FastText. S1 S2 S3 S4 S5 S6
  • 20. EVALUATION METHODOLOGY 20 Evaluation Paradigms IR-Based Bug Localization Query Reformulation 1. Hit@K 2. MAP@K 3. MRR@K Query Effectiveness (QE) MasudRahman,PhDCandidate,UofS P1 P2 P3 S1 S2 S3 S4 S5 S6 5K+ 8
  • 21. CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY EXPANSION MasudRahman,PhDCandidate,UofS Convert image to gray scale without losing transparency BufferedImage Grayscale ImageEdit ColorConvertOp File Transparency ColorSpace BufferedImageOp Graphics ImageEffects P1 P2 P3 21 S1 S2 S3 S4 S5 S6 RQ3: Appropriate query keywords can be delivered for the code search using Stack Overflow and FastText.
  • 22. TAKE-HOME MESSAGES 22 MasudRahman,PhDCandidate,UofS P1 P2 P3 S1 S2 S3 S4 S5 S6 Term Independence (TF-IDF) Term Dependence (PageRank) Reliance on Auxiliary Resources (e.g., history mining) Efficient Use of Primary Resource (e.g., Bug Reports) Bug Report Quality (Overlooked) Reporting Quality-Aware Bug Localization Thesaurus-Based Similar Keyword Suggestion Crowdsourced Knowledge & Large Data Analytics Traditional Proposed Cosine Similarity for Semantic Distance Semantic Hyperspace & Clustering Tendency
  • 25. SEMANTIC HYPERSPACE 25 MasudRahman,PhDCandidate,UofS P1 P2 P3 x P (1, 5, 6, 7, ….., N) y P (2, 4, 6, 9, ….., N) y S1 S2 S3 S4 S5 S6 y = mx + c, x^2 +y^2 = r^2 ax^2+bx+c=0
  • 26. TWO WORKING CONTEXTS: LOCAL & GLOBAL MasudRahman,PhDCandidate,UofS Local code search (e.g., bug localization) Internet-scale code search Boeing codebase GitHub P1 P2 P3 26
  • 27. S2: KEYWORDS SELECTION FROM SOURCE CODE WITH CODERANK 27 resolveRuntimeClasspathEntry Resolve Runtime Classpath Entry P1 P2 P3    )( )10( |)(| )( )1()( ivInj j j i vOut vS vS  RQ1 [Source Code]: Keywords selected by PageRank are more effective for local code searches (e.g., concept location) than that of TF-IDF S1 S2 S3 S4 S5 S6
  • 28. HOW DID WE DO? 28 MasudRahman,PhDCandidate,UofS P1 P2 P3 S1 S2 S3 S4 S5 S6 3 RQ3: Appropriate query keywords can be delivered for the code search using Stack Overflow and FastText.
  • 29. R3: SOLVE VOCABULARY MISMATCH ISSUE MasudRahman,PhDCandidate,UofS Customer Developer Past Developer Bug Report Codebase P1 P2 P3 P4 29
  • 30. SOLUTION: SEMANTIC HYPERSPACE MasudRahman,PhDCandidate,UofS Word 1 P (1, 5, 6, 7, ….., N) Word 2 P (2, 4, 6, 9, ….., N) Word 2 Cosine distance = Semantic relevance P1 P2 P3 P4 30
  • 31. R4: GENETIC ALGORITHM FOR QUERIES MasudRahman,PhDCandidate,UofS Method Search Query QE Baseline {title + description} 25 STRICT[140] {tab classpath enabled buttons user entry} 86 TF-IDF {button entry bootstrap enabled incorrectly moving} 177 GA {open reflect tab bottom entry classpath} 01 Title Description Lower QE is better P1 P2 P3 P4 31
  • 32. SEARCH QUERY FROM NOISY BUG REPORT 32 Bug 31637 – should be able to cast null NullPointerException Ci Cj Mk Mn Cp 53 01 MasudRahman,PhDCandidate,UofS S1 S2 S3 S4P1 P2 P3
  • 34. VOCABULARY MISMATCH PROBLEM MasudRahman,PhDCandidate,UofS P1 P2 P3 Both are correct and wrong! Boeing Customer Boeing Developer 34
  • 35. MasudRahman,PhDCandidate,UofS KEYWORDS FROM A BUG REPORT Title Description ID Query QE 1. Custom search results view iresource 2. Custom search results search results view 3. element iresource provider level tree 4. Custom search results hierarchically java search results 1331 636 01 570 Lower QE is better P1 P2 P3 35

Editor's Notes

  1. Hello everyone! Good afternoon! My name is Masud Rahman. I am a PhD Candidate from University of Saskatchewan, Canada I work with Dr. Chanchal K. Roy. Today, I will be talking about automated query reformulations for code search.
  2. Today, my talk will be divided into three sections. In the first section, I will discuss the research problem I am trying to solve in my PhD. In the second section, I will discuss about my PhD Thesis that solves the research problem. Finally, we will have a Q&A session and interesting discussions.
  3. Part 1: Research Problem
  4. You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia. These are called the nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year. These crashes took 346 precious human lives and cost trillions of dollars. Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version. The bottom line conclusion is, this is a faulty component, not well designed, and ultimately leads to crash. That is why, Boeing 737 Max planes are grounded right now.
  5. Now, lets say, a Boeing customer has submitted a bug report. Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug. As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase. But the study shows that 88% of the keywords chosen by the developer could be incorrect. That is, they do not return the buggy code. So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located. There are also tools that take a bug report and suggest appropriate search queries in the first place. So, we are interested into these part of the process, and my PhD focuses on this.
  6. So what we did? We did a systematic literature survey using 56 primary studies on query reformulation for code search. During this study, we found 3 major issues in the literature.
  7. Now, this is a metric which has been on the play from last the century. It was proposed in the 70s. It is a good metric, but it was actually proposed for regular texts such as news articles. On the other hand, we are dealing with source code here. Now, regular texts and source code have different semantics and different structures. They are not the same So, metrics for regular texts are not appropriate for the source code– this is our hypothesis. So, here is our first research question? How does TF-IDF perform? If not good, can we propose something new?
  8. We did an empirical study with 5K+ bug reports in our ICSE poster. And we discovered that bug reports could be very different in terms of quality. There could be different types of bug reports. It could be noisy with stack traces which is 16% It could be really poor that does not contain any structured entities, which is 30% Or it could be rich bug reports that include source code, test case and other stuffs, which is 54% Now, what the existing studies do? They treat all these different types of bug reports like the same. So, in their approach, everybody does not get a chance to watch the game. So, here is our second research question. Can we incorporate reporting quality into bug localization and deliver better queries?
  9. Identifying similar words is very important during query reformulation. We found that WordNet has been extensively used by the literature for finding the similar words. Now, it is good for regular texts. But again, we are dealing with source code here. Evidence suggest that WordNet might not work well for source code. However, those were old days. Now we have Stack Overflow and advanced tools like FastText for semantic similarity calculation. So, here is our third RQ. Can deliver appropriate keywords during code search using Stack Overflow and FastText?
  10. Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- PhD Thesis
  11. So, this is our thesis statement. We hypothesize that we can improve the query reformulation using graph-based term weighting rather than TF-IDF Bug report quality and document contexts Crowdsourced knowledge, i.e., Stack Overflow and Data analytics such as word embedding from FastText. So, to evaluate these hypothesis, we conduct six studies in the PhD. The first and second study address RQ1, the third study addresses RQ2 and the rest answers RQ3
  12. Similarly, we can see the phrases and dependencies among the terms in the bug report texts as well. Our job is to identify the keywords from these texts, right? So, did we do? We consider the co-occurrences among the terms. That is, how terms occur with other terms within a certain context. We encode such co-occurrences as edges, and transform the texts into a graph like this.
  13. Besides term co-occurrences, we consider another aspect called syntactic dependencies. For this, we used Jespersen Rank Theory, a theory developed back in 1925. According this theory, parts of speech of sentence can be divided into three ranks – nouns (first), verbs + adjectives in the second rank and the rest are the third ranks According to Jespersen, verb and adjective modifies noun. That is there are some syntactic dependencies for between element and reported and element and plain to covey the overall meaning of the sentence. Now, we capture such syntactic dependencies as well, and transform the report texts into a POS graph as well.
  14. So, we have created two graphs, right? Now, we have two graphs developed from the bug report based on two different dimensions --Word co-occurrence and syntactic dependence. Once we have graphs, we apply this famous algorithm called PageRank algorithm. This is the backbone of Google search. Now, the algorithmic details are a bit complex, but I will try to provide an overview here. Why do you think, this guy is laughing? Because, it is getting the maximum votes. Similarly, in the graph, the node that is connected to most of the nodes is the winner. That is, a term’s importance will be determined by its connectivity with other nodes. More importantly, since this is a recursive algorithm, the importance depends on the weights of the connected node as well. Once the computation is done, we get a reformulation candidate from each graph. What is the reformulation candidate? – a ranked list of keywords like this. So, we collect two candidates from two graph, apply machine learning and suggest the best one as our suggested query from the bug report.
  15. So, first we take a bug report as input. Then we apply regular expressions to identify the structured components. We then classify whether this is a a noisy report containing stack traces. a poor bug report containing only regular texts a rich bug report containing source code and texts. Once the quality level is identified, what’s the next step? Well, we do query reformulation unlike the earlier studies. We separate signals from noise from noisy report, feed the poor bug report with appropriate keywords. We mostly keep the rich bug report as is. So, that is the equity approach.
  16. First, we construct a semantic hyperspace using Stack Overflow corpus. What is hyperspace? Now, if we have more than 3 dimensions, then we call that space as hyperspace. How do we do it? First we Stack Overflow data dump that contain software specific texts. Our corpus contains about 2.1 million questions and answers. We do pre-processing and feed the contents to FastText. Now FastText generates a three-layer neural network model. This model essentially represents the whole vocabulary like this in a hyperspace. Now how does it help?
  17. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  18. Since we used query reformulation in the context of bug localization We performed evaluation in two different dimensions: -- bug localization -- query reformulation. Our approach contributes in both dimensions. We use four standard metrics such as Hit@K, mean avearge precision, mean reciprocal rank and Query Effectiveness. We answer 7 research questions in our work.
  19. Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  20. OK! Now we are done with the literature survey. Now, we will focus on the third part, the future research opportunities.
  21. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  22. Now lets expand and generalize the problem a bit. So far, we discuss the code search within a local codebase. It could also be in the large-scale open source repository such as GitHub. Now, based on these contexts, there are different challenges in query reformulation. The local codebase is small, domain specific and organized. On the contrary, GitHub is huge, cross-domain and very noisy. So, yes, they need different strategies to suggest queries for them.
  23. Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  24. That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept. In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20% So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code. This costs development time, money and valuable efforts.
  25. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  26. So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  27. Now the question is, why is this so challenging? The answer is vocabulary mismatch problem. In fact, this is a common problem for any type of document search. Here we see both guys are looking at the same object, but they are explaining it differently. That is, they are both correct from their perspective, but wrong from other guy’s perspective. This also actually happens with bug reports as well. Both customer and developer will explain the same problem using the same terminologies, that probability is only 15% That is why selecting appropriate keywords from the bug report is very challenging.
  28. Let us see an example. This is a bug report, this is title and this is the description. Now, developer JOE would use this bug report to localize the bug from source code. Now he chose some ad hoc queries. Which one is the best do you think, here? PAUSE! Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query. … oh… this one is the best. So, selecting appropriate keywords from the bug report is not that simple.