PhD proposal of Masud Rahman

SUPPORTING SOURCE CODE SEARCH WITH
CONTEXT-AWARE, ANALYTICS-DRIVEN
QUERY REFORMULATION
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6

TALK OUTLINE
Part 2: PhD Thesis
Part 1: Research Problem
Part 4: Q&A + Discussions
2
Part 3: Contribution Summary
MasudRahman,UofS

MasudRahman,UofS
Part 1: Research Problem
3
P1 P2 P4P3

MCAS: A SOFTWARE BUG THAT KILLS
MasudRahman,UofS
Boeing 737 MAX 8
4
MCAS
P1 P2 P4P3

THE SEARCH FOR THE BUGGY CODE
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
5
P1 P2 P4P3

SYSTEMATIC LITERATURE REVIEW
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
6
3
P1 P2 P4P3

I1: INAPPROPRIATE TERM WEIGHTING


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
MasudRahman,UofS
7
RQ1: Can TF-IDF deliver appropriate search keywords
either from source code or from bug reports? If not, how
can we improve the keyword selection?
P1 P2 P4P3

I2: LOW QUALITY OF BUG REPORTS
8
5000+
MasudRahman,UofS
PoorNoisyRich
RQ2: Can we deliver appropriate keywords for IR-
based bug localization (a.k.a., local code search)
by incorporating the bug report quality?
Traditional Practices
P1 P2 P4P3

I3: WORDNET FOR SEMANTIC SIMILARITY
9
MasudRahman,UofS
W1  W2
RQ3: Can we deliver appropriate query keywords for
the code search using crowd knowledge (Stack
Overflow) and large data analytics (FastText)?
P1 P2 P4P3

MasudRahman,UofS
Part 2: PhD Thesis Proposal
10
P1 P2 P4P3

PHD THESIS OVERVIEW
11
MasudRahman,UofS
S1 (STRICT)
S3 (ACER)
S2 (BLIZZARD) S6 (NLP2API)
S5 (RACK)
S4 (BLADER)
Thesis
Graph-based Term
Weighting
Bug Report Quality
Dimension
Crowd Knowledge Data Analytics
RQ1 RQ2 RQ3
P1 P2 P4P3

12
MasudRahman,UofS
S1 S2 S3 S4 S5 S6
RQ1: Can TF-IDF deliver appropriate search
keywords either from source code or from
bug reports? If not, how can we improve
the keyword selection?
Graph-based Term
Weighting
P1 P2 P4P3

TF-IDF: TERM IMPORTANCE (TRADITIONAL)
13
MasudRahman,UofS
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

TEXTRANK: TERM IMPORTANCE USING CO-
OCCURRENCES (MIHALCEA ET AL, EMNLP 2004)
14
MasudRahman,UofS
IResource … IJavaElement
IResource … IJavaElement
(Term Co-occurrence)
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

POSRANK: TERM IMPORTANCE USING SYNTACTIC
DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012)
15
MasudRahman,UofS
Jespersen Rank Theory
(Syntactic Dependence)
Noun Verb Adjective
Element …reported, element …plain
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
(Syntactic Dependence)

STRICT: QUERY KEYWORD SELECTION WITH
PAGERANK (BRIN & PAGE, 1998)
16
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
•Element
•Iresource
•Provider
•Level
•Tree
Candidate
Query 1
Candidate
Query 2
Sergey
Brin
Larry
Page
PageRank
Algorithm
Best Query
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
K1
K2
K3
K4
K5
K1
K2
K3
K4
K5

ACER: KEYWORDS FROM SOURCE CODE
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1: Keywords selected by PageRank are more
effective for local code searches (e.g., concept location,
bug localization) than that of TF-IDF
17
MasudRahman,UofS
P1 P2 P4P3
launch
debug
resolve
required
classpath
S1 S2 S3 S4 S5 S6

18
MasudRahman,UofS
RQ2: Can we deliver appropriate keywords
for IR-based bug localization (a.k.a., local
code search) by incorporating the bug
report quality?
Bug Report Quality
Dimension
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

BLIZZARD: QUALITY-AWARE SEARCH QUERIES
19
Noisy Poor Rich
MasudRahman,UofS
PoorNoisyRich
Rich
Noisy
Poor
Equality Equity
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

SEARCH QUERY FROM NOISY BUG REPORT
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01 20
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RQ2: High quality keywords can be provided for IR-
based bug localization (a.k.a., local code search) by
considering bug report quality.

21
MasudRahman,UofS
Crowd
Knowledge
Data Analytics
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RQ3: Can we deliver appropriate query
keywords for the code search using crowd
knowledge (Stack Overflow) and large data
analytics (FastText)?

Semantic
Hyperspace
BLADER: QUERY REFORMULATION WITH
CROWD KNOWLEDGE & DATA ANALYTICS
22
MasudRahman,UofS
Stack Overflow
(Crowd Knowledge)
Data
preprocessing
Neural Text classifier
FastText model
(skip-gram)
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

SEMANTIC HYPERSPACE
23
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

24
MasudRahman,UofS
channel
join spam
entered
connect
invitation
message
room
chat
handle
mask
remote
synd
admin
Q
C1
C2
CLUSTERING TENDENCY WITH DATA ANALYTICS
C1 is better than C2
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
Hopkins Statistic (HS)
Polygon Area (PA)
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.

EXPERIMENT, DATASET & METRICS
25
5K+ Bug reports Version HistoryGround Truth
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
1. Hit@K
2. MAP@K
3. MRR@K
4. QE

SEARCH CONTEXTS: LOCAL & INTERNET-SCALE
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
26
76%
S1 S2 S3 S4 S5 S6
MasudRahman,UofS
P1 P2 P4P3

CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY
EXPANSION
MasudRahman,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 27
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

WHAT IS CROWD KNOWLEDGE?
28
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

RACK: QUERIES USING CROWD KNOWLEDGE
29
MasudRahman,UofS
MessageDigest
generate
MD5
hash
S1 S2 S3 S4 S5 S6P1 P2 P4P3
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using crowd knowledge
(Stack Overflow)
Q* = Q + C
Keyword-API
Mapping DB

NLP2API: QUERIES WITH DATA ANALYTICS
30
MasudRahman,UofS
S1 S2 S3 S4 S5 S6P1 P2 P4P3
Semantic Proximity:
if proximity(Q,A) > proximity(Q,B)
Q
A
B
Q* = Q + A
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using large-scale data
analytics (FastText).

31
Part 3: Contribution Summary
MasudRahman,UofS
P1 P2 P4P3

PHD PROGRESS REPORT
32
MasudRahman,UofS
S1 (STRICT)
S2 (BLIZZARD)
S3 (ACER)
S4 (BLADER)
S5 (RACK)
S6 (NLP2API)
SANER 2015, 2017 TSE(A*) (Under Review)
ESEC/FSE 2018 (A*)
ASE 2017 (A)
TSE (A*) (To be submitted)
SANER 2016 ICSE 2017 (A*)
ICSME 2018 (A)
EMSE (A)
P1 P2 P4P3
ICSE 2019 (A*) Doctoral Symposium, Montreal

TAKE-HOME MESSAGES
33
MasudRahman,UofS
Term Independence
(TF-IDF)
Term Dependence
(PageRank)
Reliance on Auxiliary
Resources (e.g., history mining)
Efficient Use of Primary
Resource (e.g., Bug Reports)
Bug Report Quality
(Overlooked)
Reporting Quality-Aware
Bug Localization
Thesaurus-Based Similar
Keyword Suggestion
Crowd Knowledge & Large
Data Analytics
Traditional Proposed
Cosine Similarity for
Semantic Distance
Semantic Hyperspace &
Clustering Tendency
P1 P2 P4P3

MasudRahman,UofS
34
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part IV: Q & A
P1 P2 P4P3

TAKE-HOME MESSAGES
36
MasudRahman,UofS
RQ1
RQ2 RQ3
TF-IDF
PageRank
Equality
Equity
Stack Overflow
FastText
WordNet
Thesis
P1 P2 P3

EXPERIMENT, DATASET & METRICS
37
Java2s
CodeJava
310 Queries & Ground truth
769K Code segments
Hit@K
MAP@K
MRR@K
MR@K
QE
NDCG
S1 S2 S3 S4 S5 S6P1 P2 P4P3
MasudRahman,UofS

Correct
Result
Correct
Result
Correct
Result
WHAT IS A GOOD SEARCH QUERY?
38
MasudRahman,UofS
Baseline Query
(Title + Description)
Worse Query Better Query
Title
Description
P1 P2 P4P3 S1 S2 S3 S4 S5 S6

SEMANTIC HYPERSPACE
39
MasudRahman,UofS
P1 P2 P3
x P (1, 5, 6, 7, ….., N)
y P (2, 4, 6, 9, ….., N)
y
S1 S2 S3 S4 S5 S6
y = mx + c,
x^2 +y^2 = r^2
ax^2+bx+c=0

TWO WORKING CONTEXTS: LOCAL & GLOBAL
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
40

S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
41
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6
MasudRahman,UofS

HOW DID WE DO?
42
MasudRahman,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
3
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.

R3: SOLVE VOCABULARY MISMATCH ISSUE
MasudRahman,UofS
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
43

SOLUTION: SEMANTIC HYPERSPACE
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
44

R4: GENETIC ALGORITHM FOR QUERIES
MasudRahman,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
45

SEARCH QUERY FROM NOISY BUG REPORT
46
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
MasudRahman,UofS
S1 S2 S3 S4P1 P2 P3

DICE, ROCCHIO, RSV
MasudRahman,UofS
47

VOCABULARY MISMATCH PROBLEM
MasudRahman,UofS
P1 P2 P3
Both are correct and wrong!
Boeing
Customer Boeing
Developer
48

MasudRahman,UofS
KEYWORDS FROM A BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3
49

PROBABILISTIC TERM WEIGHTING
MasudRahman,UofS
KLD
50

PhD proposal of Masud Rahman

Recommended

Recommended

More Related Content

Similar to PhD proposal of Masud Rahman

Similar to PhD proposal of Masud Rahman (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

PhD proposal of Masud Rahman

Editor's Notes