HMM-based Artificial Designer for Search Interface Segmentation

HMM-based Artificial Designer for
Search Interface Segmentation
Ritu Khare, Yuan An, Il-Yeol Song
ACCESSING THE DEEP WEB

HMM: ARTIFICIAL DESIGNER

Deep Web: Data that exist on the Web but are not
returned by search engines through traditional crawling
and indexing.

An HMM (Hidden Markov Model) can act like a human designer
who has the ability to design an interface using acquired
knowledge and to determine (decode) the segment boundaries
and semantic labels of components.

Accessing Deep Web contents: The primary way to
access this data (by manually filling up HTML forms on
search interfaces ) is not scalable.
Hence, more sophisticated solutions, such as designing
meta-search engines or creating dynamic page
repositories, are required. A pre-requisite to these
solutions is an understanding of the search interfaces.
Interface Segmentation is an important portion of the
problem of search interface understanding.

INTERFACE SEGMENTATION

RESULTS
0.3

0.15

Knowledge of
Semantic Labels

TextTrivial

0.23
0.21

Segments &
Tagged
Components

DESIGNING

2-Layered
HMM

Search
Interface

0.21

0.59

Operand
Bag of
Components

0.44
0.16

Fig 2. Simulating a Human
Designer using HMMs

Attributename

0.54
0.08
0.89

0.09
Operator

DECODING

The designing process is similar to statistically choosing one
component from a bag of components (a superset of all possible
components) and placing it on the interface while keeping the
semantic role (attribute-name, operand, or operator) of the
component in mind. See Figure 2.

Fig 4. Learnt Topology of semantic labels

Semantic Label

Accuracy

Segment /Logical Attribute

86.05
86 05 %

Marker Range:

Operator

85.10 %

between

Operand

98.60 %

Attribute-name

90.11 %

and
e.g., between “D19Mit32” and “Tbx10”

cM Position:
between
e.g., “10.0 -40.0”
Fig 1. Segmented Interface
(segments marked by dotted lines)

While a user
is naturally trained to perform
g
,
g
segmentation, a machine is unable to “see” a segment
due to the following reasons:
1. The components that are visually close to each other
might be located very far apart in the HTML source
code.
2. A machine does not implicitly have any search
experience that can be leveraged to identify a
segment ‘ b
t ‘s boundary.
d
Research Question: How can we make a machine learn
how to segment an interface?

2-LAYERED HMM APPROACH
The problem of decoding is two-folded: 1) Segmentation, 2)
Assignment of semantic labels to components. Hence, a 2-layered
HMM is employed as shown in Figure 3. The first layer T-HMM
tags each component with appropriate semantic labels (attributeg
p
pp p
(
name, operator, and operand). The second layer S-HMM
segments the interface into logical attributes.
HTML
coded
Interfaces

T-HMM
Training
Interfaces

Manually
y
Tagged
Sequences

S-HMM
Manually
y
Segmented
Interfaces

Fig 3. 2-Layered HMM Architecture

EXPERIMENTATION
Data Set

200 interfaces from Biology Domain

Parsing

DOM-trees of components

Training

Maximum Likelihood Method

Testing

Viterbi Algorithm

Segmented
and Tagged
Interfaces

CONTRIBUTIONS
1 This approach outperforms LEX a contemporary
1.
LEX,
heuristic-based method, and achieves a 10%
improvement in segmentation accuracy.
2. This is the first work to apply HMMs on deep Web
search interfaces. HMMs helped in incorporating the
first-hand knowledge of the designer to perform
interface understanding.

FUTURE WORK
1. To recover the schema of deep Web databases by
extraction of finer details such as data type and
constraints of logical attribute.
2. To test this approach on interfaces from other
domains, given the diverse domain distribution of
the deep Web
3. To investigate the use of the use of Baum Welch
training algorithm to minimize the degree of
automation .

HMM-based Artificial Designer for Search Interface Segmentation

More Related Content

Viewers also liked

Recently uploaded

HMM-based Artificial Designer for Search Interface Segmentation