Information Extraction from the Web - Algorithms and Tools

Algorithms and Tools
Information Extraction
from the Web
Benjamin Habegger
University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205
Seminary on Information Extraction from the Web
ENSIAS, Rabat, Morocco - June 19, 2013

About Me
@b_habegger
http://www.linkedin.com/in/benjaminhabegger
benjamin.habegger@insa-lyon.fr

Overview
● Fundamentals of information extraction from the web
– Document representations
– Approaches
● Algorithms to extract information from semi-structured web content
– Wien, Stalker, DIPRE, IERel
● Tools to describe and web scrappers
– WetDL, WebSource
● Applications and extensions of information extraction
– Making our human web smarter
– Learning mappings for data integration

What types of data are we taking about ?

Types of data on the Web
● Structured
● Unstructured
● Semi-structured

Semi-structured data
● Usually, but not limited to, data from a
database formatted as HTML
● Listings of entities
● Presented in a “regular” presentation format

Multiple possible representations
(DOM) Tree
Rendered page
<tr class="participant">
  <td class="pname" id="part1968752570">
     […]
     <div class="pname">Benjamin</div>
  </td>
  […]
</tr>
String
     […]
     <div class="pname">Benjamin</div>
  </td>
  […]
</tr>
HTML string

What do we want to do with those documents ?

Information extraction from the web
monster.frmonster.fr apec.frapec.fr remixjobs.comremixjobs.com
Job DatabaseJob Database

Information extraction from the web
● Extract data from one or more web sites
● Wrap it into a predefined target format

Wrappers (scrapper)
monster.fr apec.fr remixjobs.com
Job Database

Algorithms to learn wrappers
● Wien
● Stalker
● SoftMealy
● IEPad
● RoadRunner
● DIPRE
● IERel
● TreePat Miner
● Squirrel

Wrapper representations
● A program
● A transducer (string or tree)
● A regular expression
● A tree pattern
● A query

Document and wrapper
representations
Algorithm Document
Model
Query/Wrapper
Model
Wien [Kushmerick] String LR-Patterns
Stalker [Muslea] String Delimiter-rules
SoftMealy [Hsu & Dung] Analysed String Transducer
IERel [Habegger] HTML String *-Patterns
Squirrel [Carme] DOM Tree Tree Automata
Habegger & Debarbieux DOM Tree Tree-Pattern Queries

SoftMealy
● Input:
– Completely labeled document
● Preprocessing:
– Tokenize input string
● Output:
– A transducer

SoftMealy: Document
Representation
Symbol Description
CAlph(x) String composed of only capitals
C1Alph(x) Strinng starting with a capital
Num(x) Numerical string
Html(x) An HTML tag
OAlph(x) String of alpha-numerical characters
Punc(x) Punctuation symbol
NL(n) n line feeds
Tab(n) n tabulations
Spc(n) n spaces

SoftMealy: Conclusion
● String-based wrapper induction algorithm
● Patterns which take format into account
→ Improvement over WIEN
● As WIEN & Stalker
– imposes much labeling
– “batch” approach

RoadRunner
● Input:
– Collection of sample pages
● Algorithm
– Induce structural pattern from the pages
● Output
– A DTD-like schema structure for the documents

RoadRunner
● Wraps regularities into a page pattern
– Compacts structure
● Structural item of the found schema NOT
mapped to a target schema
● Option: uses output as input of a mapping
mining algorithm

Dipre [Brin1998]
● Input:
– Example instances of a relation to be extracted
– A collection of web documents
● Output:
– Patterns to be applied to the collection
– (New) instances extracted using the patterns

DIPRE: Relation extraction from a
web cache
Web Cache
Relation
Instances
Very Basic
Extraction
Patterns

Dipre
● Interesting cyclic process
● Very (too) simple patterns for IE
● Problem of over-generalizations
● Pattern set drifting from their extraction target

IERel
● Input:
– Examples of a relation to be extracted
● Algorithm
– Extract patterns & generalize them
● Output
– Extraction patterns

IERel: Document representation
[…]
<div class="pname">
B
e
n
j
a
m
i
n
</div>
</td>
[…]
</tr>
§1§
§2§
[…]
§3§
B
e
n
j
a
m
i
n
§4§
§5§
[…]
§6§

IERel: Generalization
[…]
<div class="pname">
M
o
h
a
m
e
d
</div>
</td>
[…]
</tr>
§1§
§7§
[…]
§3§
M
o
h
a
m
e
d
§4§
§5§
[…]
§6§

IERel: Generalization
§1§
§7§
[…]
§3§
M
o
h
a
m
e
d
§4§
§5§
[…]
§6§
§1§
§2§
[…]
§3§
B
e
n
j
a
m
i
n
§4§
§5§
[…]
§6§
§1§
*
[…]
§3§
*
§4§
§5§
[…]
§6§

IERel: Interactive Learning
Examples
Extracted Results
Patterns
Refined
Patterns
Refined
Patterns
New examples / Negate wrong ones
Results using refined patterns

Coping with over-generalization
Learn a set of patterns
i.e.
a disjunction of conjunctions

IERel: Evaluation
● Multiple tested domains
– Online directories
– Search engine results
– Product catalogs

IERel: Conclusion
● Labeling can be limited
● Underlines the interest for interactive learning

Other algorithms on trees
● Carme et al.
– inducing node selecting tree automata
● Marty et al.
– Tabluar descriptions of nodes to be selected
– Using classification techiques

We can extract data from the web.
Now what ?

WetDL
– Query
– Fetch
– Parse
– Extract
– Transform
– External
● Workflow description of a web navigation patterns
● An execution model
● A collection of meta-operators

Semantics of a WetDL workflow
● Nodes are processors
– Receive messages through a queue
– Process and dispatch the result messages
● A processor may generate 0, 1 or n messages
● Workflow terminates when all queues are empty

WebSource: execute WetDL flows
● Each node can:
– enqueue data (push)
– generate data (pull)
● Processing can occur:
– on push (forward chaining)
– on pull (backward chaining)

WetDL
● Simple description of navigation patterns
– Straightforward operators in the context of IE
● Powerful expressiveness (in particular for IE)
– We can describe most (if not all) web information
extraction tasks

WebSource
Open-source WetDL interpreter
http://websource.sf.net/

Semabot: Motivation
What does the following query give ?
“lyon informatique emploi”

Semabot: Motivation
A list of documents containing the terms
“lyon”
“informatique”
“emploi”

Semabot: Objectives
The query “lyon informatique emploi”
should give:
A list of computer engineer job offers

Semabot
● Registry of “object” schemas and wrappers
● Wrappers generate “objects”
– Job offers, People, Products, etc.
● Crawler wraps pages and indexes objects

Semabot: Open problems
● Wrap the web into objects
– i.e. what we have seen in this seminar ;)
● Interpret (some of) the terms of the query
– “lyon” => http://en.wikipedia.org/wiki/Lyon
– “emploi” => http://en.wikipedia.org/wiki/Job_(role)

Information Extraction
● WHAT ?
– Make content adapted to human consumption as
content consumable by a target schema
● HOW ?
– Using machine learning approaches

Data Integration
● WHAT ?
– Make content adapted to human consumption as
content consumable by a target schema
● HOW ?
– Using machine learning approaches
to a source schema

Data Integration
DB 1
Schema 1
App ASchema 2
Mappings Query Rewriting

Extracting = Mapping
Data model Query Super Model
String Regular Expressions / Automata
Tree Xpath Expressions
Relational data SQL/SPARQL Expressions

Wrapping HTML to RDF
<li id=”gs2”>
<b>Samsung Galaxy S II</b>
<i>300 EUR</i> <br />
Vendor: charly@example.com
</li>
● Samsung Galaxy S 300 EUR
Vendor: charly@example.com
http://phones.example.com/samsung/charly/#gs2
name price vendor
Samsung Galaxy S II
300 EUR
charly@example.com

Wrap-up
● Tour of information extraction
– Learning wrappers
– Building IE tasks
● Link with semantic web/open data
● Link with data integration

Perspectives
● Further explore the potential interactive learning
● Learning navigation patterns
● Search of “objects” rather than documents
● Extension of interaction cycle
– pattern generation
– some form of automated pattern evaluation
– continuous (re)learning

Thank you
@b_habegger
http://www.linkedin.com/in/benjaminhabegger
benjamin.habegger@insa-lyon.fr

Information Extraction from the Web - Algorithms and Tools

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Information Extraction from the Web - Algorithms and Tools

Similar to Information Extraction from the Web - Algorithms and Tools (20)

Recently uploaded

Recently uploaded (20)

Information Extraction from the Web - Algorithms and Tools