Wrapper induction is a technique to automatically generate wrappers to extract information from web sources. It involves learning extraction rules from labeled examples to construct a wrapper as a finite state machine or set of delimiters. Two main wrapper induction systems are WIEN, which defines wrapper classes including LR, and STALKER, which uses a more expressive model with extraction rules and landmarks to handle structure hierarchically. Remaining challenges include selecting informative examples, generating label pages automatically, and developing more expressive models.
Wrapper induction construct wrappers automatically to extract information from web sources
1. Wrapper Induction: Construct Outline:
wrappers automatically to extract
information from web sources • What is wrapper
• Wrapper Induction
• WIEN
Hongfei Qu • STALKER
Computing Science Department • Remaining Questions
Simon Fraser University • HTML DOM Tree
• Other Related Works
CMPT 882 Presentation • References
March 28, 2001
What is wrapper What is wrapper
• Wrapper is a procedure to extract all kinds of data • execLR(wrapper(<B>, </B>, <I>, </I>), page P):
from a specific web source m=0
• First find a vector of strings to delimit the extracted
while there are more occurrences in P of <B>
text
• <HTML><TITLE>Country Codes</TITLE> m=m+1
<BODY><B>Congo</B> <I>242</I><BR> for each (lk, rk) in {(<B>, </B>), (<I>, </I>)}
<B>Spain</B> <I>34</I><BR> scan in P to the next occurrence of lk in P;
<HR><B>END</B></BODY></HTML> save position as bm,k
• To extract pair (country, codes), we find a vector of
scan in P to the next occurrence of rk in P;
strings (<B>, </B>, <I>, </I>) to distinguish left &
right of extracted text. save position as e m,k
Return label{…(bm,1, e m,1), (bm,2, e m,2)…}
Wrapper Induction Wrapper Induction
• Motivations: hand-coded wrapper is • Actually we are trying to learn a vector of
tedious and error-prone. How about web delimiters, which is used to instantiate some
pages get changed? wrapper classes (templates), which describe
• Wrapper induction –- automatically the document structure
generate wrapper --- is a typical • Free text & Web pages
machine learning technology. • A good wrapper induction system should be:
• Input: a set E of example pages Pn and – Expressiveness: concern how the wrapper handles
a particular web site
the corresponding label pages Ln
– Efficiency: how many samples are needed? How
• Output: a wrapper w such that w(Pn) = much computational is required?
Ln
1
2. WIEN WIEN
• First wrapper induction system implemented • Procedure learnLR(examples E)
by U. Washington. Works for both Web page for each 1<= k <=K
and free text. for each u in Candl(k, E): if u is valid for the kth
• WIEN defines 6 wrapper classes (templates) to attribute in E, then lk = u and terminate the loop
express the structures of web sites. for each 1<= k <=K
• The simplest and powerful one is LR (left- for each u in Candr(k, E): if u is valid for the kth
right) wrapper class. It uses left- and right- attribute in E, then lr = u and terminate the loop
hand delimiter to extract the relevant
return LR wrapper(l1, r1 , …, lk, rk)
information
• Procedure Candl(k, E) returns candidates for lk by
• To extract tuples with K attributes from a set enumerating the suffixes of the shortest string occurring
of examples E, the learning algorithm is: to the left of each attribute k instances
WIEN WIEN
• Procedure Cand r(k, E) returns candidates for lr by • Which wrapper class do we choose for a web site?
enumerating the prefixes of the shortest string • How many examples are required? PAC model
occurring to the right of each attribute k instances; N: number of examples;
• Each wrapper class has a set of validating constraints e: accuracy parameter. 0 < e < 1
• Other wrapper classes: a: confidence parameter. 0 < a < 1
– HLRT: add head delimiter h & tail delimiter t For a learning wrapper W, if we want error(W) < e
with probability at least a, the PAC model for the LR
– OCLR: using open and close delimiers to indicate
class is:
the beginning and end of each tuple
N >= 1/(1-a) * (2K*ln( R ) - ln(1 - a ) ), where R is the
– HOCLRT: combination of HLRT and OCLR length of the shortest example.
– N-LR and N-HLRT: handle nested structure • A way to terminate the learning precedure
• Combination of 6 classes can handle 70% web sites • A loose bound compared with test results
STALKER STALKER
• A wrapper induction project by U. Southern • Landmarks: a sequence of tokens, argument
California. Only works for Web page. of some functions.
• More expressive and efficient than WIEN. SkipTo(<b>): start from beginning, skip
• Treat a web page as a tree-like structure and everything until find <b> landmarks
handle information extraction hierarchically SkipTo(<b>)SkipTo(<I>)
• Use disjunctions to deal with the variations. • These functions represent the rules to extract
Disjunctive rules are ordered lists of the information
individual disjuncts. The wrapper will • Start rule: identify the beginning of an
successively apply each disjunct in the list attribute
until it finds one that matches • End rule: identify the end of an attribute
2
3. STALKER STALKER
<body><p>Name:<b>Hongfei</b><p>ID:<b>1111</b>
• These SkipTo( ) functions represent a finite
<P>Address:<br><b>4000 Main St, Vancouver, BC, (604)333-3233
state machine model </b><br><b>3000 Hastings St, LA, CA, 1-805-486-5675</b></body>
• Extraction rules: get information
• Document Extraction rule: SkipTo(<br>)&
landmark SkipTo(</body>)
•
Si Sj
• Iteration rules: handle nested structure • Name ID List of Address
Iteration rule: SkipTo(<b>)
& SkipTo(</b>)
landmark
•
• St city province area_code phone extraction rule: either
Si SkipTo( ( ) or SkipTo( 1- )
•
STALKER Remaining Questions
• Use a sequential covering algorithm • Find more expressive model to express
• STALKER(examples) document structure
Set setRule be empty
While there are more examples • Select only the informative examples to
Get a disjunct D by learning examples learn a wrapper.(active learning? Data
Remove all examples covered D mining?)
Add D into setRule
Return setRule
• How to generate label pages automatically
• STALKER can handle 90% and more efficient. instead of hand-markup?
• Generate imperfect rules
HTML DOM Tree Other Related Works
• Using a DOM-like tree model on HTML tags • TrIAs---html tree
HTML • SOFTMEALY---first use disjunction rule and
Head Body finite state machine model
• WISK---works for web page and free text, more
Title LI LI LI
expressive than WIEN, decision-making is based
• The navigation methods are similar to XML on limited context. Slower.
DOM tree. Only works for web pages.
• SRV
• Using the tree path to extract information
• CRYSTAL
• Also can follow the document flow like
STALKER to extract information • RAPIER
• Get rid of imperfect rules and more efficient
3
4. References
• Nicholas Kushmerick, Wrapper Induction: Efficiency and
expressiveness, Artificial Intelligence 118, 2000
• Ion Muslea, Steven Minton, Craig A. Knoblock, A Hierarchical
Approach to Wrapper Induction, Conference Autonomous Agents,
Seattle, WA, 1999
• S. Soderland, Learning information extraction rules for semi-
structured and free text, Machine Learning 34, 1999
• C. Hsu, M. Dung, Generating finite-state transducers for
semistructured data extraction from the web, Information Systems
23, 1998
• M. Bauer, D.Dengler, TrIAs—An architecture for trainable
information assistants, Worksshop on AI and Information Integration,
Madison, WI, 1998
• D. Freitag, Information extraction from HTML: Application of a
general machine learning approach, AIII-98, Madison, WI, 1998
4