SlideShare a Scribd company logo
1 of 40
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Aastha Madaan, Wanming Chu, Subhash
Bhalla
University of Aizu
1
VisHue: Web Page Segmentation for anVisHue: Web Page Segmentation for an
Improved Query Interface for MedlinePlusImproved Query Interface for MedlinePlus
Medical EncyclopediaMedical Encyclopedia
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: The VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1.1. IntroductionIntroduction
 WWW is a common and the largest source of
information
 Deep Querying → Gaining importance
 Understanding web page semantics → Improves User’s
search experience
 Within a web page → Identify semantic groups
 Important → Discovering these semantic blocks
311/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [1]1(i). The Statement [1]
A. Large variety of HTML pages → suitable query and
search ?
B. Basic Requirements → searching and querying
 Simple querying and searching → → semantic querying and
searching
A. Significant → Recognize the semantic and coherent
segments
 Page-level → Segment Level
B. Case Example → Medical Encyclopedia
 MedlinePlus → various choices of medical encyclopedias
411/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [2]1(i). The Statement [2]
11/12/16
5
UML Class
Diagram
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2.2. Background: MedlinePlusBackground: MedlinePlus
 Web page:
i. Relevant content ii. Irrelevant content
a. Relevant Content:
i. Topic headings ii. Topic wise contents
b. Irrelevant Content:
Navigation bars, header, footer, advertisements
 Headings → Identify hierarchical structure
 Distinct blocks → What a user’s perception identifies
 Main focus → Skilled and Semi-skilled users
i. Assumption → Headings → Query attributes
711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a). Hierarchical Structure2(a). Hierarchical Structure
1. Hierarchical structure → logical structure within the
Page(document)
2. Indicates the binary relationships (belongingness) →
between a pair of segments
3. Accurate Hierarchical Representation → User Level
Query Attributes (in segments)
4. Proposed hierarchical structure → based on domain
knowledge (skilled and semi-skilled users)
 Captures users perception
811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a).(i). Segmentation ↔ Semantic Query2(a).(i). Segmentation ↔ Semantic Query
9
User →
Semantic query
and search
(In future)
Common
Web
User
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2 (b). Page-Level Segmentation2 (b). Page-Level Segmentation
 Definition
A self-contained logical region within a Web page that is:
(i) not nested within any other segment;
(ii)represented by a pair (l; c)
Where, l → label of the segment
c → portion of text of the segment [1].
1011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
1111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3. Segmentation algorithms3. Segmentation algorithms
i. History → “segmentation” traces back to the
year 2001 (continues till 2011)
ii. Various application domains
iii. Various techniques for segmenting
iv. Various terminologies used
v. Proposed → MedlinePlus → items of user’s
focus → Query Attributes
1211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (a). Features of Segmentation Algorithm3 (a). Features of Segmentation Algorithm
A. Match and Identify → a user’s points of focus
B. Discover informative segments →
i. Better search and query
ii. Segments become query-able attributes
iii. Skilled users aim to query the informative areas
(only)
C. Generate → True hierarchical structure
D. Segmentation Process → Low space and time
complexity
1311/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). Main Focus3(b). Main Focus
Find an algorithm best suited for:
1.Generate hierarchical structure
2.Convert segments to attributes in
database
3.Facilitates in-depth querying
11/12/16 14
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (i). Segmentation Methods ↔ Web Technologies3 (b). (i). Segmentation Methods ↔ Web Technologies
1511/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (ii). Classification of Algorithms3 (b). (ii). Classification of Algorithms
1611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). (iii). Timeline ↔Techniques3(b). (iii). Timeline ↔Techniques
Algorithm Year
Technique
Template Detection [9], [6] 2002, 2007
Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006
Visual-DOM based
Rendering
[2] 2003
Visual-Heuristics based
Method
Proposed -
Graph-theoretic Method [3] 2008
Linguistics based
Method
[7] 2008
Image of the Web Page [4], [5] 2010,2009
Site-Oriented Method [1] 2011
1711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c). Comparison3(c). Comparison
1811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(i). Main Focus3(c).(i). Main Focus
1911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3.(c).(ii).Comparison: Vision based Mtds.3.(c).(ii).Comparison: Vision based Mtds.
2011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(iii). Content Structure by VisHue3(c).(iii). Content Structure by VisHue
2111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. The Proposal: VisHue Algorithm4. The Proposal: VisHue Algorithm
11/12/16 23
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. (i). Query Interfaces4. (i). Query Interfaces
Querying v/s Searching
Searching: Recent Trends
1. Object based search
2. Block based search
3. Entity based search
Querying: Recent Trends
 Very few efforts have been done
2411/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2511/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. Query by Segment5. Query by Segment
 Query by Segment as Query by Tag (Heading) → QBT
 Based on → Content Structure (VisHue algorithm) :
Query by Attributes
 MedlinePlus medical encyclopedia → 3886 web pages
 Target → Focused and explicit querying
i. Beneficial → skilled and semi-skilled users
ii. Medical encyclopedia → result of → years of efforts
by experts
2611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (i). The QBT interface5. (i). The QBT interface
27
Traditional search on MedlinePlus
medical encyclopedia
QBT interface
11/12/16
Title Caus
es
Sympt
oms
Post-
Care
…
DB
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (ii). QBT Interface ↔Hierarchical Structure5. (ii). QBT Interface ↔Hierarchical Structure
 Labels → Query Attributes
 QBT interface: Search and Query
 Child nodes → search attributes
 Left siblings → limit the scope of search of right
siblings in the interface
 Segments → Attributes for Deep Query over all
pages of MedlinePlus
2811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline
1. Introduction
2. Background
a) Hierarchical structure
b) Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions
2911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. Performance Analysis6. Performance Analysis
i. Qualitative comparison with traditional
keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of the
interface
3011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6.(i). QBT vs. Keyword Search6.(i). QBT vs. Keyword Search
3111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (ii). Query Formulation: A Comparison6. (ii). Query Formulation: A Comparison
3211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (iii). Query Example6. (iii). Query Example
Query 1: Cases where patient has
hypertension but not high blood pressure
QBT query :
Symptoms: “Hypertension”
Symptoms: NOT “High Blood Pressure”
33
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
34
6. (iv). Query Attributes6. (iv). Query Attributes
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
35
6. (v). Query Results6. (v). Query Results
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (vi). Quantitative Performance Analysis6. (vi). Quantitative Performance Analysis
36
QBT Query
Symptom: “Hypertension”
Symptom: NOT “High
Blood Pressure
Before Procedure: “Stop”
After Procedure:
“Normal”
Cause: “High Blood
Pressure”
Symptom: “Heart Attack”
Food Source: “Fish”
Side Effect: “Poisoning”
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
7. Discussions7. Discussions
 Content fragments as perceived by skilled and semi-
skilled domain users → determined by web page
segmentation process
 Proposed effort → Formulating a generic heuristic
design-rule and visual features based algorithm
 The QBT interface → Query over user identified
segments (attributes)
 Aim → Convert MedlinePlus pages → DB
 Contention → web page → good source → easy to use
new query language interface for segments
3711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
8. Summary and Conclusions8. Summary and Conclusions
A. Heuristics + visual features based segmentation →
turning point:
A. Provides → independent solution
B. Improves → Query interfaces for chosen domain
B. The medical domain → need to make the information
accessible to the end-users
C. Query by Segment or Tag (QBT) → An attempt
A. Aim → return the users query-able attributes
3811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
ReferencesReferences
1. A Site Oriented Method for Segmenting Web Pages”, David Fernandes, Edleno S. de Moura, Altigran S. da
Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR’11, July 24-28, 2011.
2. “Extracting Content Structure for Web Pages based on Visual Representation”, Deng Cai, Shipeng Yu, Ji-Rong
Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003,
Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
3. “Graph-Theoretic Approach to Webpage Segmentation”, Deepayan Chakrabarti, Ravi Kumar, Kunal Punera,
WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China.
4. “A segmentation method for web page analysis using shrinking and dividing”, Jiuxin Cao, Bo Mao & Junzhou
Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
5. “Web Page Layout via Visual Segmentation”, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP
Laboratories, 2009.
6. Page-level template detection via isotonic smoothing”. D. Chakrabarti, R. Kumar, and K. Punera. In 16th
WWW, pages 61–70, 2007.
7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschütter, Wolfgang Nejdl, CIKM’08,
October 26–30, 2008
8. “HTML Page Analysis Based on Visual Cues” , Yudong Yang and HongJiang Zhang, IEEE 2001
9. “Template Detection via Data Mining and its Applications” , Ziv Bar Yossef, Sridhar Rajagopalan, In
Proceedings of WWW'02, May 7–11, 2002, Honolulu, Hawaii, USA.
10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu Hui,
Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid (SKG
2005).
11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael Gertz,
Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering (ICDE’02).
3911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Thank youThank you
QuestionsQuestions
4011/12/16

More Related Content

Similar to Web Page Segmentation for Querying Healthcare Repository

Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)amytaylor
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebPhiloWeb
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...unyil96
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
 
A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building dannyijwest
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futureslisld
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs inventionjournals
 
LIS688_Group1
LIS688_Group1 LIS688_Group1
LIS688_Group1 e_chae
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Mapping the content ecosystem
Mapping the content ecosystemMapping the content ecosystem
Mapping the content ecosystemRob Hanna, ECMs
 

Similar to Web Page Segmentation for Querying Healthcare Repository (20)

Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the Web
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
Ibrahim ramadan paper
Ibrahim ramadan paperIbrahim ramadan paper
Ibrahim ramadan paper
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Distributed Databases Overview
Distributed Databases OverviewDistributed Databases Overview
Distributed Databases Overview
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs
 
LIS688_Group1
LIS688_Group1 LIS688_Group1
LIS688_Group1
 
11 info architecture-2014
11 info architecture-201411 info architecture-2014
11 info architecture-2014
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Mapping the content ecosystem
Mapping the content ecosystemMapping the content ecosystem
Mapping the content ecosystem
 

More from Aastha Madaan

Components of openEHR based EHRs
Components of openEHR based EHRsComponents of openEHR based EHRs
Components of openEHR based EHRsAastha Madaan
 
Risk and Credentials based Access Control
Risk and Credentials based Access ControlRisk and Credentials based Access Control
Risk and Credentials based Access ControlAastha Madaan
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web scienceAastha Madaan
 
Domain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesDomain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesAastha Madaan
 
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...Aastha Madaan
 

More from Aastha Madaan (7)

Components of openEHR based EHRs
Components of openEHR based EHRsComponents of openEHR based EHRs
Components of openEHR based EHRs
 
Risk and Credentials based Access Control
Risk and Credentials based Access ControlRisk and Credentials based Access Control
Risk and Credentials based Access Control
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web science
 
Domain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesDomain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document Repositories
 
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...
A Quasi Relational Query Language for Persistent Standardized EHRs: Using NoS...
 
Observlets
Observlets Observlets
Observlets
 
IoT Observatory
IoT ObservatoryIoT Observatory
IoT Observatory
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Web Page Segmentation for Querying Healthcare Repository

  • 1. to Advance Knowledge for Humanityto Advance Knowledge for Humanity Aastha Madaan, Wanming Chu, Subhash Bhalla University of Aizu 1 VisHue: Web Page Segmentation for anVisHue: Web Page Segmentation for an Improved Query Interface for MedlinePlusImproved Query Interface for MedlinePlus Medical EncyclopediaMedical Encyclopedia 11/12/16
  • 2. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Page-Level Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: The VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 211/12/16
  • 3. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 1.1. IntroductionIntroduction  WWW is a common and the largest source of information  Deep Querying → Gaining importance  Understanding web page semantics → Improves User’s search experience  Within a web page → Identify semantic groups  Important → Discovering these semantic blocks 311/12/16
  • 4. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 1(i). The Statement [1]1(i). The Statement [1] A. Large variety of HTML pages → suitable query and search ? B. Basic Requirements → searching and querying  Simple querying and searching → → semantic querying and searching A. Significant → Recognize the semantic and coherent segments  Page-level → Segment Level B. Case Example → Medical Encyclopedia  MedlinePlus → various choices of medical encyclopedias 411/12/16
  • 5. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 1(i). The Statement [2]1(i). The Statement [2] 11/12/16 5 UML Class Diagram
  • 6. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Page-Level Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 611/12/16
  • 7. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 2.2. Background: MedlinePlusBackground: MedlinePlus  Web page: i. Relevant content ii. Irrelevant content a. Relevant Content: i. Topic headings ii. Topic wise contents b. Irrelevant Content: Navigation bars, header, footer, advertisements  Headings → Identify hierarchical structure  Distinct blocks → What a user’s perception identifies  Main focus → Skilled and Semi-skilled users i. Assumption → Headings → Query attributes 711/12/16
  • 8. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 2(a). Hierarchical Structure2(a). Hierarchical Structure 1. Hierarchical structure → logical structure within the Page(document) 2. Indicates the binary relationships (belongingness) → between a pair of segments 3. Accurate Hierarchical Representation → User Level Query Attributes (in segments) 4. Proposed hierarchical structure → based on domain knowledge (skilled and semi-skilled users)  Captures users perception 811/12/16
  • 9. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 2(a).(i). Segmentation ↔ Semantic Query2(a).(i). Segmentation ↔ Semantic Query 9 User → Semantic query and search (In future) Common Web User 11/12/16
  • 10. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 2 (b). Page-Level Segmentation2 (b). Page-Level Segmentation  Definition A self-contained logical region within a Web page that is: (i) not nested within any other segment; (ii)represented by a pair (l; c) Where, l → label of the segment c → portion of text of the segment [1]. 1011/12/16
  • 11. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Page-Level Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 1111/12/16
  • 12. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3. Segmentation algorithms3. Segmentation algorithms i. History → “segmentation” traces back to the year 2001 (continues till 2011) ii. Various application domains iii. Various techniques for segmenting iv. Various terminologies used v. Proposed → MedlinePlus → items of user’s focus → Query Attributes 1211/12/16
  • 13. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3 (a). Features of Segmentation Algorithm3 (a). Features of Segmentation Algorithm A. Match and Identify → a user’s points of focus B. Discover informative segments → i. Better search and query ii. Segments become query-able attributes iii. Skilled users aim to query the informative areas (only) C. Generate → True hierarchical structure D. Segmentation Process → Low space and time complexity 1311/12/16
  • 14. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3(b). Main Focus3(b). Main Focus Find an algorithm best suited for: 1.Generate hierarchical structure 2.Convert segments to attributes in database 3.Facilitates in-depth querying 11/12/16 14
  • 15. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3 (b). (i). Segmentation Methods ↔ Web Technologies3 (b). (i). Segmentation Methods ↔ Web Technologies 1511/12/16
  • 16. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3 (b). (ii). Classification of Algorithms3 (b). (ii). Classification of Algorithms 1611/12/16
  • 17. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3(b). (iii). Timeline ↔Techniques3(b). (iii). Timeline ↔Techniques Algorithm Year Technique Template Detection [9], [6] 2002, 2007 Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006 Visual-DOM based Rendering [2] 2003 Visual-Heuristics based Method Proposed - Graph-theoretic Method [3] 2008 Linguistics based Method [7] 2008 Image of the Web Page [4], [5] 2010,2009 Site-Oriented Method [1] 2011 1711/12/16
  • 18. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3(c). Comparison3(c). Comparison 1811/12/16
  • 19. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3(c).(i). Main Focus3(c).(i). Main Focus 1911/12/16
  • 20. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3.(c).(ii).Comparison: Vision based Mtds.3.(c).(ii).Comparison: Vision based Mtds. 2011/12/16
  • 21. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 3(c).(iii). Content Structure by VisHue3(c).(iii). Content Structure by VisHue 2111/12/16
  • 22. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Page-Level Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 2211/12/16
  • 23. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 4. The Proposal: VisHue Algorithm4. The Proposal: VisHue Algorithm 11/12/16 23
  • 24. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 4. (i). Query Interfaces4. (i). Query Interfaces Querying v/s Searching Searching: Recent Trends 1. Object based search 2. Block based search 3. Entity based search Querying: Recent Trends  Very few efforts have been done 2411/12/16
  • 25. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Page-Level Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 2511/12/16
  • 26. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 5. Query by Segment5. Query by Segment  Query by Segment as Query by Tag (Heading) → QBT  Based on → Content Structure (VisHue algorithm) : Query by Attributes  MedlinePlus medical encyclopedia → 3886 web pages  Target → Focused and explicit querying i. Beneficial → skilled and semi-skilled users ii. Medical encyclopedia → result of → years of efforts by experts 2611/12/16
  • 27. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 5. (i). The QBT interface5. (i). The QBT interface 27 Traditional search on MedlinePlus medical encyclopedia QBT interface 11/12/16 Title Caus es Sympt oms Post- Care … DB
  • 28. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 5. (ii). QBT Interface ↔Hierarchical Structure5. (ii). QBT Interface ↔Hierarchical Structure  Labels → Query Attributes  QBT interface: Search and Query  Child nodes → search attributes  Left siblings → limit the scope of search of right siblings in the interface  Segments → Attributes for Deep Query over all pages of MedlinePlus 2811/12/16
  • 29. to Advance Knowledge for Humanityto Advance Knowledge for Humanity OutlineOutline 1. Introduction 2. Background a) Hierarchical structure b) Segmentation 3. Web Page segmentation Algorithms a) Features b) Main focus c) Comparison 4. The Proposal: VisHue Algorithm 5. Query by Segment 6. Performance Analysis 7. Discussions 8. Summary and Conclusions 2911/12/16
  • 30. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 6. Performance Analysis6. Performance Analysis i. Qualitative comparison with traditional keyword search ii. Query formulation and interpretation iii. Quantitative performance analysis of the interface 3011/12/16
  • 31. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 6.(i). QBT vs. Keyword Search6.(i). QBT vs. Keyword Search 3111/12/16
  • 32. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 6. (ii). Query Formulation: A Comparison6. (ii). Query Formulation: A Comparison 3211/12/16
  • 33. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 6. (iii). Query Example6. (iii). Query Example Query 1: Cases where patient has hypertension but not high blood pressure QBT query : Symptoms: “Hypertension” Symptoms: NOT “High Blood Pressure” 33
  • 34. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 34 6. (iv). Query Attributes6. (iv). Query Attributes
  • 35. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 35 6. (v). Query Results6. (v). Query Results
  • 36. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 6. (vi). Quantitative Performance Analysis6. (vi). Quantitative Performance Analysis 36 QBT Query Symptom: “Hypertension” Symptom: NOT “High Blood Pressure Before Procedure: “Stop” After Procedure: “Normal” Cause: “High Blood Pressure” Symptom: “Heart Attack” Food Source: “Fish” Side Effect: “Poisoning” 11/12/16
  • 37. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 7. Discussions7. Discussions  Content fragments as perceived by skilled and semi- skilled domain users → determined by web page segmentation process  Proposed effort → Formulating a generic heuristic design-rule and visual features based algorithm  The QBT interface → Query over user identified segments (attributes)  Aim → Convert MedlinePlus pages → DB  Contention → web page → good source → easy to use new query language interface for segments 3711/12/16
  • 38. to Advance Knowledge for Humanityto Advance Knowledge for Humanity 8. Summary and Conclusions8. Summary and Conclusions A. Heuristics + visual features based segmentation → turning point: A. Provides → independent solution B. Improves → Query interfaces for chosen domain B. The medical domain → need to make the information accessible to the end-users C. Query by Segment or Tag (QBT) → An attempt A. Aim → return the users query-able attributes 3811/12/16
  • 39. to Advance Knowledge for Humanityto Advance Knowledge for Humanity ReferencesReferences 1. A Site Oriented Method for Segmenting Web Pages”, David Fernandes, Edleno S. de Moura, Altigran S. da Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR’11, July 24-28, 2011. 2. “Extracting Content Structure for Web Pages based on Visual Representation”, Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596. 3. “Graph-Theoretic Approach to Webpage Segmentation”, Deepayan Chakrabarti, Ravi Kumar, Kunal Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China. 4. “A segmentation method for web page analysis using shrinking and dividing”, Jiuxin Cao, Bo Mao & Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104. 5. “Web Page Layout via Visual Segmentation”, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP Laboratories, 2009. 6. Page-level template detection via isotonic smoothing”. D. Chakrabarti, R. Kumar, and K. Punera. In 16th WWW, pages 61–70, 2007. 7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschütter, Wolfgang Nejdl, CIKM’08, October 26–30, 2008 8. “HTML Page Analysis Based on Visual Cues” , Yudong Yang and HongJiang Zhang, IEEE 2001 9. “Template Detection via Data Mining and its Applications” , Ziv Bar Yossef, Sridhar Rajagopalan, In Proceedings of WWW'02, May 7–11, 2002, Honolulu, Hawaii, USA. 10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid (SKG 2005). 11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering (ICDE’02). 3911/12/16
  • 40. to Advance Knowledge for Humanityto Advance Knowledge for Humanity Thank youThank you QuestionsQuestions 4011/12/16