The document proposes a web page segmentation algorithm called VisHue to identify semantic segments in web pages. VisHue uses visual and heuristics-based features to generate a hierarchical structure of segments. It aims to improve query interfaces by allowing users to query segments treated as attributes. The performance of querying segments (Query by Tag) is evaluated using the MedlinePlus medical encyclopedia, showing benefits over traditional keyword search. The algorithm and querying interface are presented as promising ways to facilitate deep querying of web page content.
Scaling API-first – The story of a global engineering organization
Web Page Segmentation for Querying Healthcare Repository
1. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Aastha Madaan, Wanming Chu, Subhash
Bhalla
University of Aizu
1
VisHue: Web Page Segmentation for anVisHue: Web Page Segmentation for an
Improved Query Interface for MedlinePlusImproved Query Interface for MedlinePlus
Medical EncyclopediaMedical Encyclopedia
11/12/16
2. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: The VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
211/12/16
3. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1.1. IntroductionIntroduction
WWW is a common and the largest source of
information
Deep Querying → Gaining importance
Understanding web page semantics → Improves User’s
search experience
Within a web page → Identify semantic groups
Important → Discovering these semantic blocks
311/12/16
4. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [1]1(i). The Statement [1]
A. Large variety of HTML pages → suitable query and
search ?
B. Basic Requirements → searching and querying
Simple querying and searching → → semantic querying and
searching
A. Significant → Recognize the semantic and coherent
segments
Page-level → Segment Level
B. Case Example → Medical Encyclopedia
MedlinePlus → various choices of medical encyclopedias
411/12/16
5. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [2]1(i). The Statement [2]
11/12/16
5
UML Class
Diagram
6. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
611/12/16
7. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2.2. Background: MedlinePlusBackground: MedlinePlus
Web page:
i. Relevant content ii. Irrelevant content
a. Relevant Content:
i. Topic headings ii. Topic wise contents
b. Irrelevant Content:
Navigation bars, header, footer, advertisements
Headings → Identify hierarchical structure
Distinct blocks → What a user’s perception identifies
Main focus → Skilled and Semi-skilled users
i. Assumption → Headings → Query attributes
711/12/16
8. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a). Hierarchical Structure2(a). Hierarchical Structure
1. Hierarchical structure → logical structure within the
Page(document)
2. Indicates the binary relationships (belongingness) →
between a pair of segments
3. Accurate Hierarchical Representation → User Level
Query Attributes (in segments)
4. Proposed hierarchical structure → based on domain
knowledge (skilled and semi-skilled users)
Captures users perception
811/12/16
9. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a).(i). Segmentation ↔ Semantic Query2(a).(i). Segmentation ↔ Semantic Query
9
User →
Semantic query
and search
(In future)
Common
Web
User
11/12/16
10. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2 (b). Page-Level Segmentation2 (b). Page-Level Segmentation
Definition
A self-contained logical region within a Web page that is:
(i) not nested within any other segment;
(ii)represented by a pair (l; c)
Where, l → label of the segment
c → portion of text of the segment [1].
1011/12/16
11. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
1111/12/16
12. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3. Segmentation algorithms3. Segmentation algorithms
i. History → “segmentation” traces back to the
year 2001 (continues till 2011)
ii. Various application domains
iii. Various techniques for segmenting
iv. Various terminologies used
v. Proposed → MedlinePlus → items of user’s
focus → Query Attributes
1211/12/16
13. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (a). Features of Segmentation Algorithm3 (a). Features of Segmentation Algorithm
A. Match and Identify → a user’s points of focus
B. Discover informative segments →
i. Better search and query
ii. Segments become query-able attributes
iii. Skilled users aim to query the informative areas
(only)
C. Generate → True hierarchical structure
D. Segmentation Process → Low space and time
complexity
1311/12/16
14. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). Main Focus3(b). Main Focus
Find an algorithm best suited for:
1.Generate hierarchical structure
2.Convert segments to attributes in
database
3.Facilitates in-depth querying
11/12/16 14
15. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (i). Segmentation Methods ↔ Web Technologies3 (b). (i). Segmentation Methods ↔ Web Technologies
1511/12/16
16. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (ii). Classification of Algorithms3 (b). (ii). Classification of Algorithms
1611/12/16
17. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). (iii). Timeline ↔Techniques3(b). (iii). Timeline ↔Techniques
Algorithm Year
Technique
Template Detection [9], [6] 2002, 2007
Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006
Visual-DOM based
Rendering
[2] 2003
Visual-Heuristics based
Method
Proposed -
Graph-theoretic Method [3] 2008
Linguistics based
Method
[7] 2008
Image of the Web Page [4], [5] 2010,2009
Site-Oriented Method [1] 2011
1711/12/16
18. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c). Comparison3(c). Comparison
1811/12/16
19. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(i). Main Focus3(c).(i). Main Focus
1911/12/16
20. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3.(c).(ii).Comparison: Vision based Mtds.3.(c).(ii).Comparison: Vision based Mtds.
2011/12/16
21. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(iii). Content Structure by VisHue3(c).(iii). Content Structure by VisHue
2111/12/16
22. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2211/12/16
23. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. The Proposal: VisHue Algorithm4. The Proposal: VisHue Algorithm
11/12/16 23
24. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. (i). Query Interfaces4. (i). Query Interfaces
Querying v/s Searching
Searching: Recent Trends
1. Object based search
2. Block based search
3. Entity based search
Querying: Recent Trends
Very few efforts have been done
2411/12/16
25. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2511/12/16
26. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. Query by Segment5. Query by Segment
Query by Segment as Query by Tag (Heading) → QBT
Based on → Content Structure (VisHue algorithm) :
Query by Attributes
MedlinePlus medical encyclopedia → 3886 web pages
Target → Focused and explicit querying
i. Beneficial → skilled and semi-skilled users
ii. Medical encyclopedia → result of → years of efforts
by experts
2611/12/16
27. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (i). The QBT interface5. (i). The QBT interface
27
Traditional search on MedlinePlus
medical encyclopedia
QBT interface
11/12/16
Title Caus
es
Sympt
oms
Post-
Care
…
DB
28. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (ii). QBT Interface ↔Hierarchical Structure5. (ii). QBT Interface ↔Hierarchical Structure
Labels → Query Attributes
QBT interface: Search and Query
Child nodes → search attributes
Left siblings → limit the scope of search of right
siblings in the interface
Segments → Attributes for Deep Query over all
pages of MedlinePlus
2811/12/16
29. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2911/12/16
30. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. Performance Analysis6. Performance Analysis
i. Qualitative comparison with traditional
keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of the
interface
3011/12/16
31. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6.(i). QBT vs. Keyword Search6.(i). QBT vs. Keyword Search
3111/12/16
32. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (ii). Query Formulation: A Comparison6. (ii). Query Formulation: A Comparison
3211/12/16
33. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (iii). Query Example6. (iii). Query Example
Query 1: Cases where patient has
hypertension but not high blood pressure
QBT query :
Symptoms: “Hypertension”
Symptoms: NOT “High Blood Pressure”
33
34. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
34
6. (iv). Query Attributes6. (iv). Query Attributes
35. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
35
6. (v). Query Results6. (v). Query Results
36. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (vi). Quantitative Performance Analysis6. (vi). Quantitative Performance Analysis
36
QBT Query
Symptom: “Hypertension”
Symptom: NOT “High
Blood Pressure
Before Procedure: “Stop”
After Procedure:
“Normal”
Cause: “High Blood
Pressure”
Symptom: “Heart Attack”
Food Source: “Fish”
Side Effect: “Poisoning”
11/12/16
37. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
7. Discussions7. Discussions
Content fragments as perceived by skilled and semi-
skilled domain users → determined by web page
segmentation process
Proposed effort → Formulating a generic heuristic
design-rule and visual features based algorithm
The QBT interface → Query over user identified
segments (attributes)
Aim → Convert MedlinePlus pages → DB
Contention → web page → good source → easy to use
new query language interface for segments
3711/12/16
38. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
8. Summary and Conclusions8. Summary and Conclusions
A. Heuristics + visual features based segmentation →
turning point:
A. Provides → independent solution
B. Improves → Query interfaces for chosen domain
B. The medical domain → need to make the information
accessible to the end-users
C. Query by Segment or Tag (QBT) → An attempt
A. Aim → return the users query-able attributes
3811/12/16
39. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
ReferencesReferences
1. A Site Oriented Method for Segmenting Web Pages”, David Fernandes, Edleno S. de Moura, Altigran S. da
Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR’11, July 24-28, 2011.
2. “Extracting Content Structure for Web Pages based on Visual Representation”, Deng Cai, Shipeng Yu, Ji-Rong
Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003,
Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
3. “Graph-Theoretic Approach to Webpage Segmentation”, Deepayan Chakrabarti, Ravi Kumar, Kunal Punera,
WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China.
4. “A segmentation method for web page analysis using shrinking and dividing”, Jiuxin Cao, Bo Mao & Junzhou
Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
5. “Web Page Layout via Visual Segmentation”, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP
Laboratories, 2009.
6. Page-level template detection via isotonic smoothing”. D. Chakrabarti, R. Kumar, and K. Punera. In 16th
WWW, pages 61–70, 2007.
7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschütter, Wolfgang Nejdl, CIKM’08,
October 26–30, 2008
8. “HTML Page Analysis Based on Visual Cues” , Yudong Yang and HongJiang Zhang, IEEE 2001
9. “Template Detection via Data Mining and its Applications” , Ziv Bar Yossef, Sridhar Rajagopalan, In
Proceedings of WWW'02, May 7–11, 2002, Honolulu, Hawaii, USA.
10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu Hui,
Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid (SKG
2005).
11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael Gertz,
Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering (ICDE’02).
3911/12/16
40. to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Thank youThank you
QuestionsQuestions
4011/12/16