Information Extraction from the Web - Algorithms and Tools
1. Algorithms and Tools
Information Extraction
from the Web
Benjamin Habegger
University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205
Seminary on Information Extraction from the Web
ENSIAS, Rabat, Morocco - June 19, 2013
3. Overview
● Fundamentals of information extraction from the web
– Document representations
– Approaches
● Algorithms to extract information from semi-structured web content
– Wien, Stalker, DIPRE, IERel
● Tools to describe and web scrappers
– WetDL, WebSource
● Applications and extensions of information extraction
– Making our human web smarter
– Learning mappings for data integration
5. Types of data on the Web
● Structured
● Unstructured
● Semi-structured
6. Types of data on the Web
● Structured
● Unstructured
● Semi-structured
7. Types of data on the Web
● Structured
● Unstructured
● Semi-structured
8. Semi-structured data
● Usually, but not limited to, data from a
database formatted as HTML
● Listings of entities
● Presented in a “regular” presentation format
16. Wrapper representations
● A program
● A transducer (string or tree)
● A regular expression
● A tree pattern
● A query
17. Document and wrapper
representations
Algorithm Document
Model
Query/Wrapper
Model
Wien [Kushmerick] String LR-Patterns
Stalker [Muslea] String Delimiter-rules
SoftMealy [Hsu & Dung] Analysed String Transducer
IERel [Habegger] HTML String *-Patterns
Squirrel [Carme] DOM Tree Tree Automata
Habegger & Debarbieux DOM Tree Tree-Pattern Queries
20. SoftMealy: Document
Representation
Symbol Description
CAlph(x) String composed of only capitals
C1Alph(x) Strinng starting with a capital
Num(x) Numerical string
Html(x) An HTML tag
OAlph(x) String of alpha-numerical characters
Punc(x) Punctuation symbol
NL(n) n line feeds
Tab(n) n tabulations
Spc(n) n spaces
23. SoftMealy: Conclusion
● String-based wrapper induction algorithm
● Patterns which take format into account
→ Improvement over WIEN
● As WIEN & Stalker
– imposes much labeling
– “batch” approach
25. RoadRunner
● Input:
– Collection of sample pages
● Algorithm
– Induce structural pattern from the pages
● Output
– A DTD-like schema structure for the documents
28. RoadRunner
● Wraps regularities into a page pattern
– Compacts structure
● Structural item of the found schema NOT
mapped to a target schema
● Option: uses output as input of a mapping
mining algorithm
30. Dipre [Brin1998]
● Input:
– Example instances of a relation to be extracted
– A collection of web documents
● Output:
– Patterns to be applied to the collection
– (New) instances extracted using the patterns
32. Dipre
● Interesting cyclic process
● Very (too) simple patterns for IE
● Problem of over-generalizations
● Pattern set drifting from their extraction target
48. Other algorithms on trees
● Carme et al.
– inducing node selecting tree automata
● Marty et al.
– Tabluar descriptions of nodes to be selected
– Using classification techiques
51. WetDL
– Query
– Fetch
– Parse
– Extract
– Transform
– External
● Workflow description of a web navigation patterns
● An execution model
● A collection of meta-operators
52. Semantics of a WetDL workflow
● Nodes are processors
– Receive messages through a queue
– Process and dispatch the result messages
● A processor may generate 0, 1 or n messages
● Workflow terminates when all queues are empty
53. WebSource: execute WetDL flows
● Each node can:
– enqueue data (push)
– generate data (pull)
● Processing can occur:
– on push (forward chaining)
– on pull (backward chaining)
54. WetDL
● Simple description of navigation patterns
– Straightforward operators in the context of IE
● Powerful expressiveness (in particular for IE)
– We can describe most (if not all) web information
extraction tasks
60. Semabot
● Registry of “object” schemas and wrappers
● Wrappers generate “objects”
– Job offers, People, Products, etc.
● Crawler wraps pages and indexes objects
61. Semabot: Open problems
● Wrap the web into objects
– i.e. what we have seen in this seminar ;)
● Interpret (some of) the terms of the query
– “lyon” => http://en.wikipedia.org/wiki/Lyon
– “emploi” => http://en.wikipedia.org/wiki/Job_(role)
62. Information Extraction
● WHAT ?
– Make content adapted to human consumption as
content consumable by a target schema
● HOW ?
– Using machine learning approaches
63. Data Integration
● WHAT ?
– Make content adapted to human consumption as
content consumable by a target schema
● HOW ?
– Using machine learning approaches
to a source schema
65. Extracting = Mapping
Data model Query Super Model
String Regular Expressions / Automata
Tree Xpath Expressions
Relational data SQL/SPARQL Expressions
66. Wrapping HTML to RDF
<li id=”gs2”>
<b>Samsung Galaxy S II</b>
<i>300 EUR</i> <br />
Vendor: charly@example.com
</li>
● Samsung Galaxy S 300 EUR
Vendor: charly@example.com
http://phones.example.com/samsung/charly/#gs2
name price vendor
Samsung Galaxy S II
300 EUR
charly@example.com
67. Wrap-up
● Tour of information extraction
– Learning wrappers
– Building IE tasks
● Link with semantic web/open data
● Link with data integration
68. Perspectives
● Further explore the potential interactive learning
● Learning navigation patterns
● Search of “objects” rather than documents
● Extension of interaction cycle
– pattern generation
– some form of automated pattern evaluation
– continuous (re)learning