The document discusses wrapper induction and maintenance for extracting structured data from semi-structured web sources. It presents algorithms for:
1) Assigning basic IDs to terminal paths in a DOM tree by comparing them to patterns in a pattern tree and identifying features like word lengths.
2) Verifying if a wrapper is still valid by converting web pages to XML and using an XSD schema to check for changes in tag structure.
3) Future work ideas like combining path number sequences as features and collecting more pages from websites to improve the methods.
4. Identifying useful information from the
World Wide Web is important in Web
mining and Information Agents.
Wrappers are software modules that
help capture the semi-structured data
on the web into a structured format.
Wrapper can be coded either manually
or learnt from examples using a
technique called wrapper induction.
4
5. Wrappers for semi-structured Web
sources
› Wrappers need to perform two kinds of tasks:
Executing automated navigation sequences
through Web sites to access the pages
containing the required data.
Generating data extraction programs for
obtaining the structured records from the
retrieved HTML pages.
› The vast majority of works dealing with
automatic and semi-automatic wrapper
generation have focused on the second
task.
5
6. Wrapper maintenance
› The main problem with wrappers is that they can
become invalid when the Web sources change.
It can be divided into three main tasks:
› Detecting the changes on the source that
invalidate the current wrapper.
› Regenerating the automated navigation
sequences required to access the pages
containing the required data.
› Regenerating the data extraction programs
needed to extract the structured results from the
HTML pages.
The first task is called wrapper verification.
6
7. Runtime Gadget Execution
Gadget’s profile
Grab web Web
pages Pages
Templat N Template
e+ Extractor o
Schema change
Yes
Extracte
d Data Desired Unsupervised
Data WI
New
Schema
Data Schema+
Matching Template
7
8. Extract data from web pages by using
the pattern tree and previous web
pages.
› Compare to our schema on the terminal
paths in the DOM tree.
› Steps:
Find the same paths in the DOM tree.
Filter the paths without schematype (basic).
Finally, may obtain one or more path with
schematype (basic).
8
9. Input: P:a web page, T: Pattern Tree
Output: L: assign the id on the terminal paths in P
Algorithm:
Transfer P into XML format
Foreach TP:termainal path in P
ID:=emty
CheckExist(TP,T,ID)
IF ID not equal to empty then
Add (TP,Value,ID) to L
END IF
END FOR
9
10. Using XSD to check if the template of
web sources changes
› Using XSD(XML standard description) to
validate the XML
Validating the tag-based structure of XML is
successful.
The method can not validate the content of
XML.
10
11. Input: Pold: old web page, Pnew: new web page
Output: true or false
Algorithm:
XMLold=HtmlToXML(Pold)
XMLnew=HtmlToXML(Pnew)
Xsd = XMLToXSD(XMLold)
IF(Validate(XMLnew,Xsd))
Success
ELSE
Miss
END IF
11
12. Paper:
› On the verification of web wrappers
› WEWRA: An algorithm for Wrapper
Verification, 2009 March, ML
Program:
12
13. Roshni Mohapatra, Kanagasabai
Rajaraman, and Sung Sam Yuan.
Efficient Wrapper Reinduction from
Dynamic Web Sources. WI’04
Alberto Pan, Juan Raposo, Manuel
A´lvarez , Vı´ctor Carneiro, Fernando
Bellas. Automatically maintaining
navigation sequences for querying semi-
structured web sources. Data &
Knowledge Engineering Volume 63, Issue
3, December 2007, Pages 795-810
13
14. Ongoing Work
› XML XSD
› Terminal value Basic ID
Future Work
14
15. Completed
› Transfer the XML file into Schema File (XSD
File)
Verifying the changes of XML is done using XSD
› Assign SetID for each terminated value
Five features:
LetterDensity, DigitDensity, PunDensity,
UpperLetterDensity, MeanWordLength,
MeanNumberToken
Cosine Relation
Result: none or one setid number
15
16. Issues:
› Verification:
XSD can detect the change of tag-base structure.
XSD cannot detect the change of semantic. See
Figure
› Assign basic id value
If the relation of two path that come respectively
web page and from pattern tree is one-one.
The result maybe is reject or accept.
If the relation is one-many, they will become a
classification problem.
For first extracted data, some data belong to one
field.
But these data was possibly divided several basic id.
For assigning basic id value to terminal value, it’s a
problem.
16
17. Combine the number sequence of path for
terminal node into feature set
Collect more web pages
› For a web site, 10 query, N result pages.
XML partial path
› To resolve the gap between Pattern Tree and Web
pages.
Survey other papers
› Automatically maintaining wrappers for semi-
structured web sources. (Focus on generating a new
training set.)
Juan Raposo, Alberto Pan, Manuel Álvarez, Justo
Hidalgo
› Wrapper Maintenance: A Machine Learning
Approach
Kristina Lerman, Steven N. Minton, Craig A. Knoblock
17