4. Identifying useful information from the
World Wide Web is important in Web
mining and Information Agents.
Wrappers are software modules that
help capture the semi-structured data
on the web into a structured format.
Wrapper can be coded either manually
or learnt from examples using a
technique called wrapper induction.
4
5. Wrappers for semi-structured Web
sources
› Wrappers need to perform two kinds of tasks:
Executing automated navigation sequences
through Web sites to access the pages
containing the required data.
Generating data extraction programs for
obtaining the structured records from the
retrieved HTML pages.
› The vast majority of works dealing with
automatic and semi-automatic wrapper
generation have focused on the second
task.
5
6. Wrapper maintenance
› The main problem with wrappers is that they can
become invalid when the Web sources change.
It can be divided into three main tasks:
› Detecting the changes on the source that
invalidate the current wrapper.
› Regenerating the automated navigation
sequences required to access the pages
containing the required data.
› Regenerating the data extraction programs
needed to extract the structured results from the
HTML pages.
The first task is called wrapper verification.
6
7. Runtime Gadget Execution
Gadget’s profile
Grab web Web
pages Pages
Templat N Template
e+ Extractor o
Schema change
Yes
Extracte
d Data Desired Unsupervised
Data WI
New
Schema
Data Schema+
Matching Template
7
8. Extract data from web pages by using
the pattern tree and previous web
pages.
› Compare to our schema on the terminal
paths in the DOM tree.
› Steps:
Find the same paths in the DOM tree.
Filter the paths without schematype (basic).
Finally, may obtain one or more path with
schematype (basic).
8
9. Input: P:a web page, T: Pattern Tree
Output: L: assign the id on the terminal paths in P
Algorithm:
Transfer P into XML format
Foreach TP:termainal path in P
ID:=emty
CheckExist(TP,T,ID)
IF ID not equal to empty then
Add (TP,Value,ID) to L
END IF
END FOR
9
10. Using XSD to check if the template of
web sources changes
› Using XSD(XML standard description) to
validate the XML
Validating the tag-based structure of XML is
successful.
The method can not validate the content of
XML.
10
11. Input: Pold: old web page, Pnew: new web page
Output: true or false
Algorithm:
XMLold=HtmlToXML(Pold)
XMLnew=HtmlToXML(Pnew)
Xsd = XMLToXSD(XMLold)
IF(Validate(XMLnew,Xsd))
Success
ELSE
Miss
END IF
11
12. Paper:
› On the verification of web wrappers
› WEWRA: An algorithm for Wrapper
Verification, 2009 March, ML
Program:
12
13. Roshni Mohapatra, Kanagasabai
Rajaraman, and Sung Sam Yuan.
Efficient Wrapper Reinduction from
Dynamic Web Sources. WI’04
Alberto Pan, Juan Raposo, Manuel
A´lvarez , Vı´ctor Carneiro, Fernando
Bellas. Automatically maintaining
navigation sequences for querying semi-
structured web sources. Data &
Knowledge Engineering Volume 63, Issue
3, December 2007, Pages 795-810
13
14. Ongoing Work
› XML XSD
› Terminal value Basic ID
Future Work
14
15. Completed
› Transfer the XML file into Schema File (XSD
File)
Verifying the changes of XML is done using XSD
› Assign SetID for each terminated value
Five features:
LetterDensity, DigitDensity, PunDensity,
UpperLetterDensity, MeanWordLength,
MeanNumberToken
Cosine Relation
Result: none or one setid number
15
16. Issues:
› Verification:
XSD can detect the change of tag-base structure.
XSD cannot detect the change of semantic. See
Figure
› Assign basic id value
If the relation of two path that come respectively
web page and from pattern tree is one-one.
The result maybe is reject or accept.
If the relation is one-many, they will become a
classification problem.
For first extracted data, some data belong to one
field.
But these data was possibly divided several basic id.
For assigning basic id value to terminal value, it’s a
problem.
16
17. Combine the number sequence of path for
terminal node into feature set
Collect more web pages
› For a web site, 10 query, N result pages.
XML partial path
› To resolve the gap between Pattern Tree and Web
pages.
Survey other papers
› Automatically maintaining wrappers for semi-
structured web sources. (Focus on generating a new
training set.)
Juan Raposo, Alberto Pan, Manuel Álvarez, Justo
Hidalgo
› Wrapper Maintenance: A Machine Learning
Approach
Kristina Lerman, Steven N. Minton, Craig A. Knoblock
17