SlideShare a Scribd company logo
1 of 1
Download to read offline
HMM-based Artificial Designer for
Search Interface Segmentation
Ritu Khare, Yuan An, Il-Yeol Song
ACCESSING THE DEEP WEB

HMM: ARTIFICIAL DESIGNER

Deep Web: Data that exist on the Web but are not
returned by search engines through traditional crawling
and indexing.

An HMM (Hidden Markov Model) can act like a human designer
who has the ability to design an interface using acquired
knowledge and to determine (decode) the segment boundaries
and semantic labels of components.

Accessing Deep Web contents: The primary way to
access this data (by manually filling up HTML forms on
search interfaces ) is not scalable.
Hence, more sophisticated solutions, such as designing
meta-search engines or creating dynamic page
repositories, are required. A pre-requisite to these
solutions is an understanding of the search interfaces.
Interface Segmentation is an important portion of the
problem of search interface understanding.

INTERFACE SEGMENTATION

RESULTS
0.3

0.15

Knowledge of
Semantic Labels

TextTrivial

0.23
0.21

Segments &
Tagged
Components

DESIGNING

2-Layered
HMM

Search
Interface

0.21

0.59

Operand
Bag of
Components

0.44
0.16

Fig 2. Simulating a Human
Designer using HMMs

Attributename

0.54
0.08
0.89

0.09
Operator

DECODING

The designing process is similar to statistically choosing one
component from a bag of components (a superset of all possible
components) and placing it on the interface while keeping the
semantic role (attribute-name, operand, or operator) of the
component in mind. See Figure 2.

Fig 4. Learnt Topology of semantic labels

Semantic Label

Accuracy

Segment /Logical Attribute

86.05
86 05 %

Marker Range:

Operator

85.10 %

between

Operand

98.60 %

Attribute-name

90.11 %

and
e.g., between “D19Mit32” and “Tbx10”

cM Position:
between
e.g., “10.0 -40.0”
Fig 1. Segmented Interface
(segments marked by dotted lines)

While a user
is naturally trained to perform
g
,
g
segmentation, a machine is unable to “see” a segment
due to the following reasons:
1. The components that are visually close to each other
might be located very far apart in the HTML source
code.
2. A machine does not implicitly have any search
experience that can be leveraged to identify a
segment ‘ b
t ‘s boundary.
d
Research Question: How can we make a machine learn
how to segment an interface?

2-LAYERED HMM APPROACH
The problem of decoding is two-folded: 1) Segmentation, 2)
Assignment of semantic labels to components. Hence, a 2-layered
HMM is employed as shown in Figure 3. The first layer T-HMM
tags each component with appropriate semantic labels (attributeg
p
pp p
(
name, operator, and operand). The second layer S-HMM
segments the interface into logical attributes.
HTML
coded
Interfaces

T-HMM
Training
Interfaces

Manually
y
Tagged
Sequences

S-HMM
Manually
y
Segmented
Interfaces

Fig 3. 2-Layered HMM Architecture

EXPERIMENTATION
Data Set

200 interfaces from Biology Domain

Parsing

DOM-trees of components

Training

Maximum Likelihood Method

Testing

Viterbi Algorithm

Segmented
and Tagged
Interfaces

CONTRIBUTIONS
1 This approach outperforms LEX a contemporary
1.
LEX,
heuristic-based method, and achieves a 10%
improvement in segmentation accuracy.
2. This is the first work to apply HMMs on deep Web
search interfaces. HMMs helped in incorporating the
first-hand knowledge of the designer to perform
interface understanding.

FUTURE WORK
1. To recover the schema of deep Web databases by
extraction of finer details such as data type and
constraints of logical attribute.
2. To test this approach on interfaces from other
domains, given the diverse domain distribution of
the deep Web
3. To investigate the use of the use of Baum Welch
training algorithm to minimize the degree of
automation .

More Related Content

Viewers also liked

Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...
Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...
Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...William Kritsonis
 
6º Ideias na Laje - Apresentação 5.0 // Logotube
6º Ideias na Laje - Apresentação 5.0 // Logotube6º Ideias na Laje - Apresentação 5.0 // Logotube
6º Ideias na Laje - Apresentação 5.0 // LogotubeIdeias na Laje
 
2º Ideias na Laje Pitch - NetFraldas
2º Ideias na Laje Pitch - NetFraldas2º Ideias na Laje Pitch - NetFraldas
2º Ideias na Laje Pitch - NetFraldasIdeias na Laje
 
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARK
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARKSPICE MODEL of 8RHB_33uH_1.4A in SPICE PARK
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARKTsuyoshi Horigome
 
1988 a+a 200-146-alpha-sco
1988 a+a 200-146-alpha-sco1988 a+a 200-146-alpha-sco
1988 a+a 200-146-alpha-scoKees De Jager
 
Three newly discovered_globular_clusters_in_ngc6822
Three newly discovered_globular_clusters_in_ngc6822Three newly discovered_globular_clusters_in_ngc6822
Three newly discovered_globular_clusters_in_ngc6822Sérgio Sacani
 
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBM India Smarter Computing
 
12º Ideias na Laje - Apresentação 5.0 // Yeti
12º Ideias na Laje - Apresentação 5.0 // Yeti12º Ideias na Laje - Apresentação 5.0 // Yeti
12º Ideias na Laje - Apresentação 5.0 // YetiIdeias na Laje
 
Refiners Fire Presentation
Refiners Fire PresentationRefiners Fire Presentation
Refiners Fire Presentationryanklong
 
Battle Of The Bulge
Battle Of The BulgeBattle Of The Bulge
Battle Of The Bulgemjrybarski
 

Viewers also liked (16)

Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...
Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...
Dr. William Allan Kritsonis, Dissertation Chair for Elias Alex Torrez, Disser...
 
6º Ideias na Laje - Apresentação 5.0 // Logotube
6º Ideias na Laje - Apresentação 5.0 // Logotube6º Ideias na Laje - Apresentação 5.0 // Logotube
6º Ideias na Laje - Apresentação 5.0 // Logotube
 
Improving Interoperability of Text Mining Tools with BioC
Improving Interoperability of Text Mining Tools with BioCImproving Interoperability of Text Mining Tools with BioC
Improving Interoperability of Text Mining Tools with BioC
 
2º Ideias na Laje Pitch - NetFraldas
2º Ideias na Laje Pitch - NetFraldas2º Ideias na Laje Pitch - NetFraldas
2º Ideias na Laje Pitch - NetFraldas
 
Aa17043 11
Aa17043 11Aa17043 11
Aa17043 11
 
1106.2545v1
1106.2545v11106.2545v1
1106.2545v1
 
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARK
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARKSPICE MODEL of 8RHB_33uH_1.4A in SPICE PARK
SPICE MODEL of 8RHB_33uH_1.4A in SPICE PARK
 
1988 a+a 200-146-alpha-sco
1988 a+a 200-146-alpha-sco1988 a+a 200-146-alpha-sco
1988 a+a 200-146-alpha-sco
 
The Real Truth Behind SDO Collaboration
The Real Truth Behind SDO CollaborationThe Real Truth Behind SDO Collaboration
The Real Truth Behind SDO Collaboration
 
Summary to cv
Summary to cvSummary to cv
Summary to cv
 
Three newly discovered_globular_clusters_in_ngc6822
Three newly discovered_globular_clusters_in_ngc6822Three newly discovered_globular_clusters_in_ngc6822
Three newly discovered_globular_clusters_in_ngc6822
 
Versão 1.66
Versão 1.66Versão 1.66
Versão 1.66
 
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
 
12º Ideias na Laje - Apresentação 5.0 // Yeti
12º Ideias na Laje - Apresentação 5.0 // Yeti12º Ideias na Laje - Apresentação 5.0 // Yeti
12º Ideias na Laje - Apresentação 5.0 // Yeti
 
Refiners Fire Presentation
Refiners Fire PresentationRefiners Fire Presentation
Refiners Fire Presentation
 
Battle Of The Bulge
Battle Of The BulgeBattle Of The Bulge
Battle Of The Bulge
 

Similar to HMM-based Artificial Designer for Search Interface Segmentation

An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...The Children's Hospital of Philadelphia
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and contentIJCSEA Journal
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesIJCSEA Journal
 
Internet and Web Technology (CLASS-5) [HTML DOM]
Internet and Web Technology (CLASS-5) [HTML DOM] Internet and Web Technology (CLASS-5) [HTML DOM]
Internet and Web Technology (CLASS-5) [HTML DOM] Ayes Chinmay
 
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...IRJET Journal
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
E mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCTE mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCTdinesh2vasu
 
Taking browsers fuzzing new
Taking browsers fuzzing newTaking browsers fuzzing new
Taking browsers fuzzing newgeeksec80
 
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_level
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_levelDeep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_level
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_levelgeeksec80
 
AI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotAI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotIRJET Journal
 
Course_Documents
Course_DocumentsCourse_Documents
Course_DocumentsKaran Patil
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
Licence plate recognition using matlab programming
Licence plate recognition using matlab programming Licence plate recognition using matlab programming
Licence plate recognition using matlab programming somchaturvedi
 
Semantic Technolgies for the Internet of Things
Semantic Technolgies for the Internet of ThingsSemantic Technolgies for the Internet of Things
Semantic Technolgies for the Internet of ThingsPayamBarnaghi
 

Similar to HMM-based Artificial Designer for Search Interface Segmentation (20)

Two Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface SegmentationTwo Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface Segmentation
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and content
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web Pages
 
Internet and Web Technology (CLASS-5) [HTML DOM]
Internet and Web Technology (CLASS-5) [HTML DOM] Internet and Web Technology (CLASS-5) [HTML DOM]
Internet and Web Technology (CLASS-5) [HTML DOM]
 
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...
GENERATION OF HTML CODE AUTOMATICALLY USING MOCK-UP IMAGES WITH MACHINE LEARN...
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
E mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCTE mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCT
 
8017 25 image mining
8017 25 image mining8017 25 image mining
8017 25 image mining
 
Taking browsers fuzzing new
Taking browsers fuzzing newTaking browsers fuzzing new
Taking browsers fuzzing new
 
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_level
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_levelDeep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_level
Deep sec 2012_rosario_valotta_-_taking_browsers_fuzzing_to_the_next_(dom)_level
 
AI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotAI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry Chatbot
 
Course_Documents
Course_DocumentsCourse_Documents
Course_Documents
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
Lecture 01 - Chapter 1 (Part 01): Some basic concept of Operating System (OS)...
Lecture 01 - Chapter 1 (Part 01): Some basic concept of Operating System (OS)...Lecture 01 - Chapter 1 (Part 01): Some basic concept of Operating System (OS)...
Lecture 01 - Chapter 1 (Part 01): Some basic concept of Operating System (OS)...
 
Licence plate recognition using matlab programming
Licence plate recognition using matlab programming Licence plate recognition using matlab programming
Licence plate recognition using matlab programming
 
Webcomponents v2
Webcomponents v2Webcomponents v2
Webcomponents v2
 
Semantic Technolgies for the Internet of Things
Semantic Technolgies for the Internet of ThingsSemantic Technolgies for the Internet of Things
Semantic Technolgies for the Internet of Things
 
Ak4301197200
Ak4301197200Ak4301197200
Ak4301197200
 
Html
HtmlHtml
Html
 

Recently uploaded

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

HMM-based Artificial Designer for Search Interface Segmentation

  • 1. HMM-based Artificial Designer for Search Interface Segmentation Ritu Khare, Yuan An, Il-Yeol Song ACCESSING THE DEEP WEB HMM: ARTIFICIAL DESIGNER Deep Web: Data that exist on the Web but are not returned by search engines through traditional crawling and indexing. An HMM (Hidden Markov Model) can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. Accessing Deep Web contents: The primary way to access this data (by manually filling up HTML forms on search interfaces ) is not scalable. Hence, more sophisticated solutions, such as designing meta-search engines or creating dynamic page repositories, are required. A pre-requisite to these solutions is an understanding of the search interfaces. Interface Segmentation is an important portion of the problem of search interface understanding. INTERFACE SEGMENTATION RESULTS 0.3 0.15 Knowledge of Semantic Labels TextTrivial 0.23 0.21 Segments & Tagged Components DESIGNING 2-Layered HMM Search Interface 0.21 0.59 Operand Bag of Components 0.44 0.16 Fig 2. Simulating a Human Designer using HMMs Attributename 0.54 0.08 0.89 0.09 Operator DECODING The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the interface while keeping the semantic role (attribute-name, operand, or operator) of the component in mind. See Figure 2. Fig 4. Learnt Topology of semantic labels Semantic Label Accuracy Segment /Logical Attribute 86.05 86 05 % Marker Range: Operator 85.10 % between Operand 98.60 % Attribute-name 90.11 % and e.g., between “D19Mit32” and “Tbx10” cM Position: between e.g., “10.0 -40.0” Fig 1. Segmented Interface (segments marked by dotted lines) While a user is naturally trained to perform g , g segmentation, a machine is unable to “see” a segment due to the following reasons: 1. The components that are visually close to each other might be located very far apart in the HTML source code. 2. A machine does not implicitly have any search experience that can be leveraged to identify a segment ‘ b t ‘s boundary. d Research Question: How can we make a machine learn how to segment an interface? 2-LAYERED HMM APPROACH The problem of decoding is two-folded: 1) Segmentation, 2) Assignment of semantic labels to components. Hence, a 2-layered HMM is employed as shown in Figure 3. The first layer T-HMM tags each component with appropriate semantic labels (attributeg p pp p ( name, operator, and operand). The second layer S-HMM segments the interface into logical attributes. HTML coded Interfaces T-HMM Training Interfaces Manually y Tagged Sequences S-HMM Manually y Segmented Interfaces Fig 3. 2-Layered HMM Architecture EXPERIMENTATION Data Set 200 interfaces from Biology Domain Parsing DOM-trees of components Training Maximum Likelihood Method Testing Viterbi Algorithm Segmented and Tagged Interfaces CONTRIBUTIONS 1 This approach outperforms LEX a contemporary 1. LEX, heuristic-based method, and achieves a 10% improvement in segmentation accuracy. 2. This is the first work to apply HMMs on deep Web search interfaces. HMMs helped in incorporating the first-hand knowledge of the designer to perform interface understanding. FUTURE WORK 1. To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. 2. To test this approach on interfaces from other domains, given the diverse domain distribution of the deep Web 3. To investigate the use of the use of Baum Welch training algorithm to minimize the degree of automation .