Identifying useful information from the
World Wide Web is important in Web
mining and Information Agents.
Wrappers are software modules that
help capture the semi-structured data
on the web into a structured format.
Wrapper can be coded either manually
or learnt from examples using a
technique called wrapper induction.
Wrappers for semi-structured Web
› Wrappers need to perform two kinds of tasks:
Executing automated navigation sequences
through Web sites to access the pages
containing the required data.
Generating data extraction programs for
obtaining the structured records from the
retrieved HTML pages.
› The vast majority of works dealing with
automatic and semi-automatic wrapper
generation have focused on the second
› The main problem with wrappers is that they can
become invalid when the Web sources change.
It can be divided into three main tasks:
› Detecting the changes on the source that
invalidate the current wrapper.
› Regenerating the automated navigation
sequences required to access the pages
containing the required data.
› Regenerating the data extraction programs
needed to extract the structured results from the
The ﬁrst task is called wrapper veriﬁcation.
Runtime Gadget Execution
Grab web Web
Templat N Template
e+ Extractor o
d Data Desired Unsupervised
Extract data from web pages by using
the pattern tree and previous web
› Compare to our schema on the terminal
paths in the DOM tree.
Find the same paths in the DOM tree.
Filter the paths without schematype (basic).
Finally, may obtain one or more path with
Input: P:a web page, T: Pattern Tree
Output: L: assign the id on the terminal paths in P
Transfer P into XML format
Foreach TP:termainal path in P
IF ID not equal to empty then
Add (TP,Value,ID) to L
Using XSD to check if the template of
web sources changes
› Using XSD(XML standard description) to
validate the XML
Validating the tag-based structure of XML is
The method can not validate the content of
Input: Pold: old web page, Pnew: new web page
Output: true or false
Xsd = XMLToXSD(XMLold)
› On the verification of web wrappers
› WEWRA: An algorithm for Wrapper
Verification, 2009 March, ML
Roshni Mohapatra, Kanagasabai
Rajaraman, and Sung Sam Yuan.
Efficient Wrapper Reinduction from
Dynamic Web Sources. WI’04
Alberto Pan, Juan Raposo, Manuel
A´lvarez , Vı´ctor Carneiro, Fernando
Bellas. Automatically maintaining
navigation sequences for querying semi-
structured web sources. Data &
Knowledge Engineering Volume 63, Issue
3, December 2007, Pages 795-810
› XML XSD
› Terminal value Basic ID
› Transfer the XML file into Schema File (XSD
Verifying the changes of XML is done using XSD
› Assign SetID for each terminated value
LetterDensity, DigitDensity, PunDensity,
Result: none or one setid number
XSD can detect the change of tag-base structure.
XSD cannot detect the change of semantic. See
› Assign basic id value
If the relation of two path that come respectively
web page and from pattern tree is one-one.
The result maybe is reject or accept.
If the relation is one-many, they will become a
For first extracted data, some data belong to one
But these data was possibly divided several basic id.
For assigning basic id value to terminal value, it’s a
Combine the number sequence of path for
terminal node into feature set
Collect more web pages
› For a web site, 10 query, N result pages.
XML partial path
› To resolve the gap between Pattern Tree and Web
Survey other papers
› Automatically maintaining wrappers for semi-
structured web sources. (Focus on generating a new
Juan Raposo, Alberto Pan, Manuel Álvarez, Justo
› Wrapper Maintenance: A Machine Learning
Kristina Lerman, Steven N. Minton, Craig A. Knoblock