Neil andersonjunhong

Visually
Extrac.ng
Data
Records

from
the
Deep
Web

Neil
Anderson
and
Jun
Hong

Queen’s
University
Belfast,
UK

Data
Record
Extrac.on

Given
a
query
result
page
containing
a
set
of
data

records,
our
goal
is
to
group
the
data
items
and

labels
of
each
data
record
together.

Previous
Approaches

•  Common
theme
is
to
iden.fy
repeated
paJerns

•  Source
code
and
regular
expressions

–  JavaScript
makes
this
tricky

•  Supervised
learning
with
annotated
pages

–  Wrapper
induc.on

•  Tag
tree
representa.on
(DOM)

–  Hierarchical
representa.on
of
the
page,
designed
for

the
browser,
not
for
humans

–  Doesn’t
mirror
the
displayed
structure
-‐
modern

complex
web
pages
make
this
diﬃcult

Our
Visual
Approach

•  Mimic
human
intui.on

•  To
make
use
of
the
common

sources
of
evidence
on

displayed
pages
that
humans

use,
including

– Structural
regularity

– Visual
and
content
similarity

between
data
records

Previous
Approaches
Need
to
Iden.fy

Data
Rich
Sec.on

PiWalls:

How
to
iden.fy
the
Data
Rich
Sec.on

DRS
does
not
contain
all
the
records

DRS
contains
noise
as
well
as
records

Our
Approach

•  We
ﬁnd
records,
not
the
Data
Rich
Sec.on

•  Extract
data
records
individually
on
displayed
query

result
pages,
while
excluding
noise
items

•  Records
in
a
grid
or
a
column

•  Use
clustering
algorithms
and
a
set
of
similarity

measures
to:

Iden.fy
records

Exclude
noise

Our
Approach

jQuery

Web

Page

Renderer

WebKit

Visual

Block

Modeller

JavaScript

Seed

Block

Selector

JavaScript

Data

Record

Block

Selector

jQuery

Record

Boundary

Drawer

Green
and
blue
blocks

Selec.ng
Other
Candidate
Containers

Filter
the
set
of
all
container
blocks
on
the
page

(blue
blocks)
and

Discard
blocks
that
don’t
match
the
width
of
any

candidate
container
block
(orange
blocks).

Cluster
the
remaining
blocks
by
width.

Why
width?

Web
pages
designed
for
ver.cal,
not
horizontal,
scrolling.

Selec.ng
Record
Containers

Block
content
similarly
measure

Block
A
–
Candidate
record
block
(orange)

Block
B
–
Container
block
(block)
with
the
same
width

as
A

The
cluster
with
the
maximum
number
of
similar

blocks
is
the
winner!

Visual
Block
Model
-‐
Clean

Conclusions:
Main
Contribu.ons

•  Visual
approach
to
directly
access
a
rendering

engine
to
get
posi.onal
and
visual
features

rather
than
codes
or
tag
trees

•  No
need
to
iden.fy
data
rich
sec.on

•  Use
observa.ons
on
visual
and
content

similarity,
and
structural
regularity
to
group

data
items
into
records

Future
Work

•  Use
a
domain
schema
from
schema.org,
or
a

domain
ontology
to
annotate
data
records

•  Use
a
domain
schema
or
ontology
to
annotate

query
forms
too

•  Solve
Label
incompleteness
and
inconsistency

issues

•  Similarity
threshold

– Set
by
machine
learning.

Neil andersonjunhong

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Neil andersonjunhong

Similar to Neil andersonjunhong (20)

Neil andersonjunhong