ContentSuggest--Recommendation of Relevant Sections from a Webpage about Errors & Exceptions

RECOMMENDING RELEVANT SECTIONS
FROM A WEBPAGE ABOUT
PROGRAMMING ERRORS AND
EXCEPTIONS
Mohammad Masudur Rahman, and Chanchal K. Roy
Software Research Lab, Department of Computer Science
University of Saskatchewan, Canada
25th Center for Advanced Studies Conference (CASCON
2015)

SOLVING EXCEPTION
(STEP I: QUERY SELECTION)
3
Selection of traditional search query
Switching to web browser for
web search
This query may not
be sufficient enough
for most of the
exceptions

SOLVING EXCEPTION
(STEP II: WEB SEARCH)
4
 The browser does NOT know the context
(i.e., details) of the exception.
 Not much helpful ranking
 Forces the developer to SWITCH back and
forth between IDE and browser.
 Trial and error in searching
 19% of development time in web search
Switching is
often
distracting

SOLVING EXCEPTION
(STEP III: MAPPING TO PAGE SECTIONS )
5
Mapping
 Mapping between the exception & relevant
page sections non-trivial
 Automated mapping between exception &
relevant page sections
 IDE-based web page content suggestion
for review
5

OUTLINE OF THIS TALK
6
Content Suggest
Architecture
Metrics & Algorithm
Empirical evaluation &
validation (using
webpages)
Validation with IR techniques
(using SO posts)
Conclusion

CONTENTSUGGEST—ARCHITECTURE
7
Start End
7

PROPOSED METRICS
 Content Density (CTD)
 Text Density (TD)
 Link Density (LD)
 Code Density (CD)
 Purity of textual content, less hyperlinks
 Content Relevance (CTR)
 Text Relevance (TR)
 Code Relevance (CR)
 Relevance of textual content with exception, interesting
tokens
 Content Score (CTS) = γ*Content Density +
δ*Content Relevance
 Normalized metrics 8

PROPOSED TECHNIQUE (CONTENTSUGGEST)
HTML
HEAD
BODY
TITLE
STYLE
SCRIPT
DIV DIV
H1 P
B
OL
LI
LI
LI
H1 TABLE P
TBODY
TR
TR
TD
TD
TD
TD
Text
Text Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
DIV
P
P
9

PROPOSED TECHNIQUE (CONTENTSUGGEST)-
-SCORING
HTML
HEAD
BODY
TITLE
DIV DIV
H1 P
B
OL
LI
LI
LI
H1 TABLE P
TBODY
TR
TR
TD
TD
TD
TD
Text
Text Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text 10

-TAGGING
HTML
HEAD
BODY
TITLE
DIV DIV
H1 P
B
OL
LI
LI
LI
H1 TABLE P
TBODY
TR
TR
TD
TD
TD
TD
Text
Text Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text 11
Content
Noise

-FILTERING
HTML
HEAD
BODY
TITLE
DIV DIV
H1 P
B
OL
LI
LI
LI
H1 TABLE P
TBODY
TR
TR
TD
TD
TD
TD
Text
Text Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text 12
Content
Noise

EXPERIMENTS
13
(80 exceptions +
250 web pages)
Manual analysis
(25 hours)
Gold sections
Evaluation Validation (Sun et al)
Stack Overflow
Crowd
SO Posts
ContentSuggest
IR (VSM, LSA)

PERFORMANCE METRICS
 Precision (P): % of the retrieved content (a) that
belong to gold content (b) of the page.
 Recall (R): % of gold content (b) that is retrieved
(a) by the technique.
 F1-measure (F1): Combination of Precision (P) &
Recall (R).
14
||
|),(|
a
baLCS
P 
||
|),(|
b
baLCS
R 
RP
RP
F



2
1

RESEARCH QUESTIONS (4)
 RQ1: How effective is ContentSuggest in
recommending relevant content from a web page?
 RQ2: How effective are the proposed metrics in
identifying relevant page content?
 RQ3: Can ContentSuggest outperform the baseline
technique?
 RQ4: Does ContentSuggest perform better than IR
techniques (VSM, LSI) in identifying relevant content?
15

ANSWERING RQ1 & RQ2– EVALUATION OF
TECHNIQUE & METRICS
16
Scores Metric SO-Pages Non-SO Pages All Pages
{ Content Density } MP 50.91% 49.50% 50.07%
MR 91.74% 75.71% 82.18%
MF 62.32% 53.76% 57.22%
{ Content Relevance } MP 86.63% 69.17% 76.23%
MR 52.17% 57.66% 55.44%
MF 61.07% 55.88% 57.98%
{ Content Density,
Content Relevance }
(Proposed Technique)
MP 92.64% 74.60% 81.96%
MR 74.17% 78.51% 76.74%
MF 80.95% 73.09% 76.30%
[ SO = Stack Overflow, MP = Mean Precision, MR = Mean Recall,
MF = Mean F1-measure ]

ANSWERING RQ3– COMPARISON WITH
BASELINE TECHNIQUE
17
Content Extractor Metric SO-Pages Non-SO Pages All Pages
Sun et al.
(SIGIR 2011)
MP 52.63% 38.89% 44.44%
MR 86.49% 41.84% 59.88%
MF 62.57% 34.49% 45.84%
ContentSuggest
MP 92.64% 74.60% 81.96%
MR 74.17% 78.51% 76.74%
MF 80.95% 73.09% 76.30%
[ SO = Stack Overflow, MP = Mean Precision, MR = Mean Recall,
MF = Mean F1-measure ]
 Performed better for all 3 sets of pages– SO pages, Non-
SO pages, and All Pages
 Performed better for all metrics– precision, recall and F-
measure.

ANSWERING RQ3– COMPARISON WITH
BASELINE TECHNIQUE
18

ANSWERING RQ4– COMPARISON WITH IR
TECHNIQUES (VSM, LSI)
19
Content Extractor Metric Accepted Posts Most Voted Posts
Latent Semantic Analysis
(Marcus et al, ICSE 2003)
MP 19.98% 23.02%
MR 21.78% 23.17%
MF 18.43% 21.07%
Vector Space Model
(Antoniol et al, TSE 2002)
MP 22.50% 33.89%
MR 23.08% 31.90%
MF 19.77% 30.44%
Content Suggest
MP 23.10% 31.36%
MR 45.15% 54.42%
MF 26.99% 35.90%

ANSWERING RQ4– COMPARISON WITH IR
TECHNIQUES (VSM, LSI)
20

THREATS TO VALIDITY
 Gold content preparation: Despite cross-
validation may contain subjective bias.
 Limited training dataset: Metric weights trained
based on limited dataset.
 Usability concern: Fully fledged user-study
required to validate the applicability of the
technique. Limited study performed with 6
participants.
21

TAKE-HOME MESSAGE
 19% of development time spent simply in web
search (Brandt et al, SIGCHI 2009)
 Mapping between information in IDE and in web
page could be non-trivial, time-consuming.
 ContentSuggest automates such mapping in the
context of exception handling.
 Content Density and Content Relevance are
found effective in identifying relevant sections from
a web page.
 ContentSuggest outperforms one baseline
technique and two IR techniques (VSM, LSI).
22

REFERENCES
[1] J. Brandt, P.J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer. Two Studies
of Opportunistic Programming: Interleaving Web Foraging, Learning, and Writing
Code. In Proc. SIGCHI, pages 1589-1598, 2009
[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering
traceability links between code and documentation. TSE, 28(10):970-983, 2002
[3] A. Marcus and J.I. Maletic. Recovering Documentation-toSource-Code Traceability
Links Using Latent Semantic Indexing. In Proc. ICSE, pages 125-135, 2003
[4] F. Sun, D. Song, and L. Liao. DOM Based Content Extraction via Text Density. In
Proc. SIGIR, pages 245-254, 2011.
[5] Luca Ponzanelli, Alberto Bacchelli, and Michele Lanza. Seahawk: Stack Overflow
in the IDE. In Proc. ICSE, pages 1295-1298, 2013
[6] M.M Rahman, S. Yeasmin, and C. Roy. Towards a ContextAware IDE-Based Meta
Search Engine for Recommendation about Programming Errors and Exceptions. In
Proc. CSMRWCRE, pages 194-203, 2014
[7] ContentSuggest Web Portal. URL http://www.usask.ca/~mor543/contentsuggest
[8] C.K. Roy and J.R. Cordy. NICAD: Accurate Detection of Near Miss Intentional
Clones Using Flexible Pretty-Printing and Code Normalization. In Proc. ICPC,
pages 172-181, 2008.
24

ContentSuggest--Recommendation of Relevant Sections from a Webpage about Errors & Exceptions

Recommended

More Related Content

Similar to ContentSuggest--Recommendation of Relevant Sections from a Webpage about Errors & Exceptions

Similar to ContentSuggest--Recommendation of Relevant Sections from a Webpage about Errors & Exceptions (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

ContentSuggest--Recommendation of Relevant Sections from a Webpage about Errors & Exceptions

Editor's Notes