Doctoral Symposium of Masud Rahman

SUPPORTING CODE SEARCH WITH
CONTEXT-AWARE, ANALYTICS-DRIVEN, EFFECTIVE
QUERY REFORMULATION
Masud Rahman, PhD Candidate
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6

TALK OUTLINE
MasudRahman,PhDCandidate,UofS
Part 2: PhD Thesis
Part 1: Research Problem
Part 3: Q&A + Discussions
2

Part 1: Research Problem
P1 P2 P3
3

MCAS: A SOFTWARE BUG THAT KILLS
P1 P2 P3
Boeing 737 MAX 8
4
MCAS

THE SEARCH FOR THE BUGGY CODE
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3
5

SYSTEMATIC LITERATURE REVIEW
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
6
3

I1: INAPPROPRIATE TERM WEIGHTING


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3
7
RQ1: Can TF-IDF deliver appropriate search keywords
either from source code or from bug reports? If not, how
can we improve the keyword selection?

I2: LOW QUALITY OF BUG REPORTS
8
5000+
P1 P2 P3
PoorNoisyRich
RQ2: Can we deliver appropriate keywords for IR-
based bug localization (a.k.a., local code search)
by incorporating the bug report quality?
Traditional Practices

I3: WORDNET FOR SEMANTIC SIMILARITY
9
P1 P2 P3
W1  W2
RQ3: Can we deliver appropriate query keywords for
the code search using crowd knowledge (Stack
Overflow) and data analytics (FastText)?

Part 2: PhD Thesis
P1 P2 P3
10

PHD THESIS OVERVIEW
11
P1 P2 P3
S1 (SANER 2017)
S2 (ASE 2017)
S3 (ESEC/FSE 2018) S6 (ICSME 2018)
S5 (EMSE 2019)
S4
Thesis
RQ1
RQ2
RQ3
Graph-based Term
Weighting
Bug Report Quality
Dimension
Crowd Knowledge Data Analytics

TF-IDF: TERM IMPORTANCE (TRADITIONAL)
12
S1 S2 S3 S4P1 P2 P3
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
S5 S6

TEXTRANK: TERM IMPORTANCE USING CO-
OCCURRENCES (MIHALCEA ET AL, EMNLP 2004)
13
IResource … IJavaElement
IResource … IJavaElement
P1 P2 P3
(Term Co-occurrence)
S1 S2 S3 S4 S5 S6

POSRANK: TERM IMPORTANCE USING SYNTACTIC
DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012)
14
Noun Verb Adjective
Element …reported, element …plain
P1 P2 P3
Jespersen Rank Theory
(Syntactic Dependence)
S1 S2 S3 S4 S5 S6

S1: QUERY KEYWORD SELECTION WITH
PAGERANK (BRIN & PAGE, 1998)
15
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
•Element
•Iresource
•Provider
•Level
•Tree
Candidate
Query 1
Candidate
Query 2
P1 P2 P3
Sergey
Brin
Larry
Page
PageRank
Algorithm
Best Query
RQ1 : Keywords selected by PageRank are more
effective for local code searches (e.g., IR-based bug
localization) than that of TF-IDF
S1 S2 S3 S4 S5 S6

S3: QUALITY-AWARE SEARCH QUERIES
16
Noisy Poor Rich
P1 P2 P3 S1 S2 S3 S4 S5 S6
PoorNoisyRich
Rich
Noisy
Poor
Equality Equity
RQ2: Incorporation of bug report quality into query
construction process significantly improves the
performance of the queries in the code search.

Semantic
Hyperspace
S4: QUERY REFORMULATION WITH CROWD
KNOWLEDGE & DATA ANALYTICS
17
P1 P2 P3
Stack Overflow
(Crowd Knowledge)
Data
preprocessing
Neural Text classifier
FastText model
(skip-gram)
S1 S2 S3 S4 S5 S6

SEMANTIC HYPERSPACE
18
P1 P2 P3
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
S1 S2 S3 S4 S5 S6

19
P1 P2 P3
channel
join spam
entered
connect
invitation
message
room
chat
handle
mask
remote
synd
admin
Q
C1
C2
• Hopkins Statistic (HS)
• Polygon Area (PA)
CLUSTERING TENDENCY WITH DATA ANALYTICS
C1 is better than C2
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
S1 S2 S3 S4 S5 S6

EVALUATION METHODOLOGY
20
Evaluation Paradigms
IR-Based Bug
Localization
Query
Reformulation
1. Hit@K
2. MAP@K
3. MRR@K
Query
Effectiveness
(QE)
P1 P2 P3 S1 S2 S3 S4 S5 S6
5K+ 8

CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY
EXPANSION
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects
P1 P2 P3
21
S1 S2 S3 S4 S5 S6

TAKE-HOME MESSAGES
22
P1 P2 P3 S1 S2 S3 S4 S5 S6
Term Independence
(TF-IDF)
Term Dependence
(PageRank)
Reliance on Auxiliary
Resources (e.g., history mining)
Efficient Use of Primary
Resource (e.g., Bug Reports)
Bug Report Quality
(Overlooked)
Reporting Quality-Aware
Bug Localization
Thesaurus-Based Similar
Keyword Suggestion
Crowdsourced Knowledge &
Large Data Analytics
Traditional Proposed
Cosine Similarity for
Semantic Distance
Semantic Hyperspace &
Clustering Tendency

P1 P2 P3
23
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part III: Q & A

TAKE-HOME MESSAGES
24
RQ1
RQ2 RQ3
TF-IDF
PageRank
Equality
Equity
Stack Overflow
FastText
WordNet
Thesis
P1 P2 P3

SEMANTIC HYPERSPACE
25
P1 P2 P3
x P (1, 5, 6, 7, ….., N)
y P (2, 4, 6, 9, ….., N)
y
S1 S2 S3 S4 S5 S6
y = mx + c,
x^2 +y^2 = r^2
ax^2+bx+c=0

TWO WORKING CONTEXTS: LOCAL & GLOBAL
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
26

S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
27
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6

HOW DID WE DO?
28
P1 P2 P3 S1 S2 S3 S4 S5 S6
3

R3: SOLVE VOCABULARY MISMATCH ISSUE
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
29

SOLUTION: SEMANTIC HYPERSPACE
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
30

R4: GENETIC ALGORITHM FOR QUERIES
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
31

SEARCH QUERY FROM NOISY BUG REPORT
32
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
S1 S2 S3 S4P1 P2 P3

DICE, ROCCHIO, RSV
33

VOCABULARY MISMATCH PROBLEM
P1 P2 P3
Both are correct and wrong!
Boeing
Customer Boeing
Developer
34

KEYWORDS FROM A BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3
35

PROBABILISTIC TERM WEIGHTING
KLD
36

Doctoral Symposium of Masud Rahman

Recommended

Recommended

More Related Content

Similar to Doctoral Symposium of Masud Rahman

Similar to Doctoral Symposium of Masud Rahman (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

Doctoral Symposium of Masud Rahman

Editor's Notes