DIADEM WWW 2012

DIADEM domain-centric intelligent automated
data extraction methodology

DIADEM
Domain-centric, Intelligent, Automated
Data Extraction
Tim Furche
April 18th, 2012 @ WWW 2012 • Department of Computer Science,
Oxford University

DIADEM ›❯ What?
1

Data Extraction with DIADEM
fully automated, but domain-centric
based on extensive domain knowledge
no per site training at all
no user input other than the domain model

we aim for complete extraction of the domain
works on the vast majority web sites of a domain
extracts the vast majority of records of each site

main target: websites with structured records
3

DIADEM ›❯ What?
1

Domain-Centric Data Extraction
Blackbox that
turns any of the thousands of websites of a domain
into structured data

1 <?xml version ="1.0" encoding="UTF-8"?
2 <results>
3 <tyre>
4 <brand>Star Performer</brand>
5 <profile>HP</profile>
6 <price>42.60</price>
7 </tyre>
8 <tyre>
9 <brand>High Performer</brand>
10 <profile>HS-3</profile>
12 </tyre>
13 ...
14 </results>

4

DIADEM ›❯ What?
1

Domain-Centric Data Extraction
Blackbox that
turns any of the thousands of websites of a domain
into structured data

1 <?xml version ="1.0" encoding="UTF-8"?
2 <results>
3 <tyre>
4 <brand>Star Performer</brand>
5 <profile>HP</profile>
7 </tyre>
8 <tyre>
9 <brand>High Performer</brand>

DIADEM
10 <profile>HS-3</profile>
12 </tyre>
13 ...
14 </results>

4

About 7,070 results (0.18 seconds) Advanced search

DIADEM ›❯ The StateChangethe Game
Your location: Oxford - of
1 Everything Sort by: Relevance

Images
Buy Sony Vaio Laptops Now | johnlewis.com
Videos

“Product” Search for Properties
View our range of Sony Vaio laptops at John Lewis online now.
News johnlewis.com is rated 296 reviews
www.johnlewis.com/sony-vaio
Shopping
More Sony Vaio Laptops - Clearance Sale Now On | europc.co.uk
Buy Securely Online.
www.europc.co.uk/sony-laptop-sale
Show only
Google Checkout Oxford Street, Woodstock - OX20 £895pcm
Free shipping
Sony VAIO Y Series VPC-YA1V9E/B - Core i3 1.33 GHz - 11.6″ - 4 GB ... £601
cheaper than market
Black, Microsoft Windows 7 Professional 64-bit Edition, 1.46 kg, Lithium Ion batteryFloor plan 29 cm x
Basics Highlights Map 6 hour(s), from 4 stores
New items 20.3 cm x 2.5 cm Property Type: Apartment Available Date: 26/09/2011 Compare prices
Details
Any category The deceptively quick Y series packs pleasing performance in an ultra-thin frame. Whether you're
On Market: very long (3+ weeks) Bedrooms: 3
House type:
running multiple programs while writing a paperthan average (65/100)
Energy rating: better ...
Laptop Power Adaptors
ﬂat
Add to Shopping List Nearby: train station; M40; Thame town centre
house Batteries
Laptop
bungalow
Any price • Reception room
Sony VAIO Y Series VPC-Y11M1E/S - Pentium 1.3 GHz - 13.3″ - 4 GB ... £390
… • tripple-glazed windows
Up to £500 Silver, Microsoft Windows 7 Home Premium 64-bit Edition, 1.8 kg, Lithium Ion battery 10 hour(s), 32.6 from 5 stores
Price: £600
£500 – cm x 22.7 cm x 3.2 cm
Over £600 The deceptively quick Y Series packs pleasing performance-in an ultra-thin frame. An Intel Pentium
Wolvercote, North Oxford OX2 Compare prices
£825pcm
Low High ultra-low voltage processor helps ensure that ... average
Basics Highlights Map Floor plan
Rating: to
£ 3 reviews - Add to Shopping List
Property Type: Apartment Available Date: 26/09/2011 Details
£ Go On Market: very long (3+ weeks) Bedrooms: 3
Sony VAIO Y Series VPC-Y21S1E/L -than average (65/100)
Energy rating: better Pentium 1.2 GHz - 13.3″ - 4 GB ... £498
Any brand
Bedrooms: Blue, Microsoft Windows 7 Home Premium 64-bit Edition, 1.8 kg, Lithium Ion battery 9 hour(s), 32.6 cm
Nearby: train station; M40; Thame town centre from 5 stores
Sony x 22.7 cm x 3.2 cm
16 Compare prices
Your easy-to-use multimedia companion - travels anywhere in blue with long battery life and easy VAIO
• Reception room
Any store solutions. • tripple-glazed windows
Others:
Overstock.com Add to Shopping List
Play.com Bennett Crescent, Oxford - OX4 £995pcm
Tesco.com Sony VAIO Y Series VPC-Y21S1E/PHighlights 1.2 GHz - 13.3″ - 4 Floor...
Basics
- Pentium Map
GB plan £520 average
Aria Technology Pink, Microsoft Windows 7 Home Premium 64-bit Edition, 1.8 kg, Lithium Ion battery 9 hour(s), 32.6 cm from 4 stores
x 22.7 cm x 3.2 cm Property Type: Apartment Available Date: 26/09/2011 Details
Oyyy.co.uk Compare prices
The deceptively quick Y Series packs pleasing performance in an ultra-thin frame. Whether you're
On Market: very long (3+ weeks) Bedrooms: 3
More
running multiple programs while writing a paperthan average (65/100)
Energy rating: better ...
Add to Shopping List Nearby: train station; M40; Thame town centre

• Reception room
Sony VAIO Y Series VPC-Y11V9E/S - Core 2 Duo 1.3 GHz - 13.3″ - 4 ...
• tripple-glazed windows
£450
Silver, Microsoft Windows 7 Professional 64-bit Edition, 1.8 kg, Lithium Ion battery 9 hour(s), 32.6 cm x from 3 stores
22.7 cm x 3.2 cm
The deceptively quick Y series packs pleasing performance in an ultra-thin frame. An Intel Core 2 Duo Compare prices 6
ultra-low voltage processor helps ensure ...

Web Data Extraction
2

Scenario ➀: Electronics retailer
electronics retailer: online market intelligence
comprehensive overview of the market
daily information on price, shipping costs, trends, product
mix
by product, geographical region, or competitor
thousands of products
hundreds of competitors

nowadays: specialised companies
mostly manual, interpolation
large cost 7

Web Data Extraction › Scenarios
2

Scenario ➁: Supermarket chain
supermarket chain
competitors’ product prices
special offer or promotion (time sensitive)
new products, product formats & packaging

8

2

Scenario ➂: Hotel Agency
online travel agency
best price guarantee
prices of competing agencies
average market price
taken and report history

9

2

Scenario ➃: Hedge Fund
house price index
published in regular intervals by national statistics agency
affects share values of various industries
hedge fund:
online market intelligence to predict the house price index

10

2

Scenario ➄: Construction
tenders from all over the world
existing aggregators
expensive, often incomplete
yet need to be published (online) by law in most countries

11

2

Scenario ➅: Supporting Scientists
automatic document analysis
and annotation
data extraction from scientiﬁc databases
improving search for scientiﬁc literature

12

1

About us …

DIADEM lab at Oxford University

13

1

About us …

DIADEM lab at Oxford University
2010 2011 2012 2013 2014 2015

13

2

How:
Knowledge
14

DIADEM ›❯ Knowledge
2

Data Extraction
Three steps in data extraction:

ﬁnding the relevant pages

interaction (forms)

identifying the relevant objects

segmentation

extracting the relevant attributes

alignment

In all cases: derive patterns from examples

15

DIADEM ›❯ Automation in Data Extraction
2

Bad News: Nobody Can do it Yet
Wrapper
Induction high accuracy
(ML)

high accuracy

Template low supervision
Discovery
low supervision
16

2

Knowledge in Data Extraction

17

2

what’s “knowledge” here
observational:
what to observe, annotations
that a certain text is highlighted, that a certain keyword
appears in it
phenomenological:
how observations become concepts
that a text “...:” to the close north-west of a ﬁeld is that
ﬁeld’s label
ontological:
schema, concepts & constraints
e.g., “bathroom”, “every property must have a location”
orthogonal: script knowledge for web pages
both domain-independent and domain-dependent
17

2

phenomenon
observational:
appears in it
phenomenological:
ﬁeld’s label
ontological:
17

2

phenomenon
observational:
appears in it
phenomenological:
ﬁeld’s label
idea/noumenon
ontological:
17

2

phenomenon
observational:

mapping
appears in it
phenomenological:
ﬁeld’s label
idea/noumenon
ontological:
17

2

Trend: Towards Domain-
Observational only:
Su, Wang, Lochovsky. ODE, TODS 2009

Ontological only:
Fazzinga, Flesca, Tagarelli. Schema-based Web wrapping. K&IS
2011

Observational & ontological:
Dalvi, Kumar, Soliman. Automatic Wrappers for Large Scale
Web Extraction, VLDB 2011. (AutoWrapper in the following)
Venetis, Halevy, Madhavan, et al. Recovering Semantics of
18

2

Trend: Towards Domain-
Observational only:
Su, Wang, Lochovsky. ODE, TODS 2009

Ontological only:
Fazzinga, Flesca, Tagarelli. Schema-based Web wrapping. K&IS
2011
shallow ontology, better
for single attribute
extraction
Observational & ontological:
Dalvi, Kumar, Soliman. Automatic Wrappers for Large Scale
Web Extraction, VLDB 2011. (AutoWrapper in the following)
Venetis, Halevy, Madhavan, et al. Recovering Semantics of
18

2

DIADEM: Suffused by Knowledge
Key insight ➊: all three types of knowledge
every piece of DIADEM is driven by knowledge
exploration: script/interaction knowledge
block/form/result page/description analysis
all combine all three types
algorithms:
search for “consistent” interpretation informed by domain
knowledge
rather than uninformed as, e.g., in AutoWrappers

19

➏

Model
Explorer
script/interaction
ontological
➎
Interpretation

➊ phenomenological
➍
Observed Facts
Browser

observational
➌
DOM

➋
20

➏

Model
Explorer
script/interaction
ontological
➎
Interpretation

➍
imperfect
Observed Facts observer
(incomplete,
ambigue)
Browser

observational
➌
DOM

➋
20

➏

Model
Explorer
script/interaction
ontological
➎
per-se
Interpretation consistent
interpretation
➍
imperfect
(incomplete,
ambigue)
Browser

observational
➌
DOM

➋
20

➏

Model consistent
Explorer interpretation
script/interaction
ontological
➎
per-se
Interpretation consistent
interpretation
➍
imperfect
(incomplete,
ambigue)
Browser

observational
➌
DOM

➋
20

2

All in one …
Finding the pages
:= crawling, web forms, etc.
form understanding (OPAL) and navigation (BERYL)

Segmentation
:= divide into records, cells, etc.
page segmentation (BERYL) and record segmentation (AMBER)

Alignment
:= class of a record, attribute, column,
etc.
attribute alignment (AMBER) and attribute extraction
(Oxtractor)

21

2

All in one …
DEMO
Finding the pages

PAPER
Segmentation

Alignment
etc.
(Oxtractor)

21

2

All in one …
DEMO
Finding the pages
PROFOUND
PAPER
Segmentation

Alignment
etc.
(Oxtractor)

21

2

All in one …
DEMO
Finding the pages
PROFOUND
PAPER
Segmentation
DEMO
Alignment
etc.
(Oxtractor)

21

2

All in … two …
All the analysis is integrated
but separated from the actual extraction
only samples pages sufficient to generate an exhaustive
wrapper
script knowledge guides the exploration and “stop” strategy

Large-scale extraction: OXPath in the Cloud → OXLatin
separate, cloud-based extraction
efficient, highly-scalable extraction language & analysis
SCOUT: Provisioning and scheduling in cloud computing
under external global constraints
22

2

All in … two …
All the analysis is integrated
but separated from the actual extraction
only samples pages sufficient to generate an exhaustive
wrapper
script knowledge guides the exploration and “stop” strategy

Large-scale extraction: OXPath in the Cloud → OXLatin
separate, cloud-based extraction DEMO
efficient, highly-scalable extraction language & analysis
SCOUT: Provisioning and scheduling in cloud computing
under external global constraints
22

DIADEM ›❯ Inside
3

A Journey into DIADEM
Examples of knowledge (and its representation) in
DIADEM
observational:
clues for price (“looks like a price”) and
location
representation:
Gazetteers, JAPE rules, WEKA classiﬁers
&

Datalog¬,Agg rules
phenomenological:
a real estate record and its attributes
representation:
Datalog¬,Agg,± rules
ontological:
constraints for real estate form
representation:
template language on top of Datalog¬,Agg,
± rules

25

DIADEM ›❯ Inside _by<Model,AType>
3
TEMPLATE annotated {
2 <Model>::annotated_by<AType>(X) ( node_of_interest(X),
gate::annotation(X, <AType>, _). }

BERyL: Navigation Blocks
4 TEMPLATE in_proximity<Model,Property(Close)> {
<Model>::in_proximity<Property>(X) ( node_of_interest(X),
6 std::proximity(Y,X), <Property(Close)>. }
TEMPLATE num_in_proximity<Model,Property(Close)> {
<Model>::in_proximity<Property>(X,Num) ( node_of_interest(X),
feature model: derived #count(N: observed facts
8

std::proximity(Close,X), Num =
from <Property(Close)>). }
10 TEMPLATE relative_position<Model,Within(Height,Width)> {
through Datalog program with templates
<Model>::relative_position<Within>(X, (PosH, PosV)) ( node_of_interest(X),
12 css::box(X, LeftX, TopX, _, _), <Within(Height,Width)>,
less than two dozen lines of code
100·TopX
PosH = 100·LeftX , PosV = Height . }
Width
14 TEMPLATE contained_in<Model,Container(Left,Top,Bottom,Right)> {
<Model>::contained_in<Container>(X) ( node_of_interest(X),
16 css::box(X,LeftX,TopX,RightX,BottomX), <Container(Left,Top,Right,Bottom)>,
Left < LeftX < RightX < Right, Top < TopX < BottomX < Bottom. }
18 TEMPLATE closest<Model,Relation(Closest,X),Property(Closest),Test(Closest)> {
Precision Recall F1
<Model>::closest<Relation>_with<Property>_is<Test>(X) ( node_of_interest(X),
<Relation(Closest,X)>, <Property(Closest)>, <Test(Closest)>,
1.00 20

¬(<Relation(Y,X)>, <Property(Y)>, <Relation(Y,Closest)>). }

0.98 Fig. 4: BERy L feature templates

In a similar way, the second template deﬁnes a boolean feature that holds for nodes
0.97
of interest, if there is another node in their proximity for which Property(Close) is true.
To instantiate it to nodes that are annotated with PAGINATION, we write
0.95
26
Real Estate Carsproximity<Model,Property(Close)>
INSTANTIATE in_ Retail Forums Total

3

Phenomenological: Record
How to ﬁnd the boundaries of records in a page?
Record := representation of single entity of the domain
values, structure, layout: similar to other records on the page
clearly separated from other records in a regular structure
(data area)
content-rich (text, attributes)
Attribute := value of a certain attribute type of an entity
similar (content, structure, layout) to same attributes in other
records
often labeled or with speciﬁc value type
Data area := area of repeated, regular records
27

3

Phenomenological: Record
Exhaustive search is inefficient and only addresses low
precision
low recall is at least as much of an issue
+ contradicting annotations may be a clue per se
therefore: AMBER search informed by domain
knowledge
use domain knowledge to guess data area & record
segmentation
support alignment with domain knowledge

28

D1

M1,1

M1,3 E D2 D3

M1,2 M1,4 … …

consistent_cluster_members(C, N1, N2,identiﬁcation
Figure 3: Data area N3) :- pivot(N1), pivot(N2), ...
similar_depth(N1, N2), similar_depth(N2, N3), similar_depth(N1,N3),
similar_tree_distance(N1, N2, N3).
cluster(C,N)dominance: The pivot nodes in E of allorganized rather
its of order :- continuous, lca, contains at least one are mandatories
regularly, whereas the pivot nodes in D1 vary quite notably. How-
29
ever, there variation is small enough that M1,1 to M1,4 are depth and

precision recall
100

99.5

99

98.5

98
data areas records attributes

Real Estate
(100 pages)

30

precision recall
100

99.5

99

98.5

98
data areas records attributes

Real Estate
(100 pages)
precision recall
100
97.5
95
92.5
90
price postcode location bathroom bedroom reception legal type 30

precision recall precision recall
100 100

99.5 99.5

99 99

98.5 98.5

98 98
data areas records attributes data areas records attributes

Real Estate Used Car
(100 pages) (100 pages)
precision recall
100
97.5
95
92.5
90
price postcode location bathroom bedroom reception legal type 30

3

Ontological: Constraints for real
Annotation schema: Λ=(A,<,≺,(isLabela, isValuea: a ∈ A))
set A of annotation types
a transitive, reflexive subclass relation <
a transitive, irreflexive, antisymmetric precedence relation ≺
and two characteristic functions isLabela and isValuea on
text nodes for each a ∈ A.

Domain schema: Σ = (Λ,T,CT ,CΛ)
annotation schema Λ
set of domain types T
CT, CΛ: map domain types to classification & structural
constraints
31

Real-Estate Form

Buy/Rent Form

Geographic Features

Location

Buy/Rent Location Type of Use Price

Buy/Rent Buy/Rent Location Location Location Area/Branch Type of Use Type of Use Bedroom Min-Price Max-Price Button

Location/… Ofﬁce Min. Bedrooms Price Range (£) to
Buying Renting Local National Residential Commercial Submit
All Any 0 700

32

TEMPLATE segment<C>{
2 segment<C>(G)( child(N1 ,G),not child(N2 ,G)
not(concept<C>(N2 ) _ segment<C>(N2 )) }
4
TEMPLATE segment_range<C,CM > {
6 segment<C>(G)( concept<CM >(N1 ),concept<CM >(N2 ), N1 6= N2 ,
child(N1 ,G),child(N2 ,G) }
8
TEMPLATE segment_with_unique<C,U> {
10 segment<C>(G)( child(N1 ,G), concept<U>(N1 ,G),not child(N2 ,G),
N1 6= N2 ,not(concept<C>(N2 ) _ segment<C>(N2 )) . }
12
TEMPLATE unique<C> {
14 unique<C>(N1,G)( concept<C>(N1 ),child(N1 ,G),
¬(child(N2 ,G),N1 6=N2 ,concept<C>(N2 )) }

33
Figure 9: OPAL - TL structural constraints

Precision Recall F-score
1

0.985

0.97

0.955

0.94
UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)

34

1

0.985

0.97

0.955

0.94
1

0.98

0.96

0.94

0.92

0.9
Airfare Auto Book Job US R.E. 34

1

0.985

0.97

0.955

0.94
1

0.98

0.96

0.94

0.92
Dragut et al., VLDB,
0.9 2009
Airfare Auto Book Job US R.E. 34

3

Contribution of Scopes

ﬁeld segment layout domain

Real-estate

Used-car

0.6 0.7 0.8 0.9 1

35

DIADEM ›❯ Future
4

Summary
Examples of knowledge (and its representation) in
DIADEM
observational:
clues for price (“looks like a price”) and
location
representation:
Gazetteers, JAPE rules, WEKA classiﬁers
&

Datalog¬,Agg rules
phenomenological:
a real estate record and its attributes
representation:
Datalog¬,Agg,± rules
ontological:
constraints for real estate form
representation:
template language on top of Datalog¬,Agg,
± rules

36

DIADEM ›❯ Future
4

Where are we?
Known knowns: we know what and how
site-speciﬁc or supervised data extraction
Known unknowns: we know what
templates need to be discovered
but: what we are interested in is known
DIADEM 0.2 will mostly cover this
Unknown unknowns:
where we don’t even know what we are looking for
never-ending learning of domain concepts
semi-supervised
37

DIADEM WWW 2012

Recommended

Recommended

More Related Content

Similar to DIADEM WWW 2012

Similar to DIADEM WWW 2012 (20)

More from Giorgio Orsi

More from Giorgio Orsi (20)

Recently uploaded

Recently uploaded (20)

DIADEM WWW 2012

Editor's Notes