Progress Report

 2009/10/09
 2009/11/24

2

 Introduction
 Ongoing work
 Future work

3

 Identifying useful information from the
World Wide Web is important in Web
mining and Information Agents.
 Wrappers are software modules that
help capture the semi-structured data
on the web into a structured format.
 Wrapper can be coded either manually
or learnt from examples using a
technique called wrapper induction.

4

 Wrappers for semi-structured Web
sources
› Wrappers need to perform two kinds of tasks:
 Executing automated navigation sequences
through Web sites to access the pages
containing the required data.
 Generating data extraction programs for
obtaining the structured records from the
retrieved HTML pages.
› The vast majority of works dealing with
automatic and semi-automatic wrapper
generation have focused on the second
task.
5

 Wrapper maintenance
› The main problem with wrappers is that they can
become invalid when the Web sources change.
 It can be divided into three main tasks:
› Detecting the changes on the source that
invalidate the current wrapper.
› Regenerating the automated navigation
sequences required to access the pages
containing the required data.
› Regenerating the data extraction programs
needed to extract the structured results from the
HTML pages.
 The ﬁrst task is called wrapper veriﬁcation.
6

Runtime Gadget Execution
Gadget’s profile
Grab web Web
pages Pages

Templat N Template
e+ Extractor o
Schema change

Yes
Extracte
d Data Desired Unsupervised
Data WI

New
Schema
Data Schema+
Matching Template
7

 Extract data from web pages by using
the pattern tree and previous web
pages.
› Compare to our schema on the terminal
paths in the DOM tree.
› Steps:
 Find the same paths in the DOM tree.
 Filter the paths without schematype (basic).
 Finally, may obtain one or more path with
schematype (basic).

8

 Input: P:a web page, T: Pattern Tree
 Output: L: assign the id on the terminal paths in P
 Algorithm:
Transfer P into XML format
Foreach TP:termainal path in P
ID:=emty
CheckExist(TP,T,ID)
IF ID not equal to empty then
Add (TP,Value,ID) to L
END IF
END FOR

9

 Using XSD to check if the template of
web sources changes
› Using XSD(XML standard description) to
validate the XML
 Validating the tag-based structure of XML is
successful.
 The method can not validate the content of
XML.

10

 Input: Pold: old web page, Pnew: new web page
 Output: true or false
 Algorithm:
XMLold=HtmlToXML(Pold)
XMLnew=HtmlToXML(Pnew)
Xsd = XMLToXSD(XMLold)
IF(Validate(XMLnew,Xsd))
Success
ELSE
Miss
END IF

11

 Paper:
› On the verification of web wrappers
› WEWRA: An algorithm for Wrapper
Verification, 2009 March, ML

 Program:

12

 Roshni Mohapatra, Kanagasabai
Rajaraman, and Sung Sam Yuan.
Efficient Wrapper Reinduction from
Dynamic Web Sources. WI’04
 Alberto Pan, Juan Raposo, Manuel
A´lvarez , Vı´ctor Carneiro, Fernando
Bellas. Automatically maintaining
navigation sequences for querying semi-
structured web sources. Data &
Knowledge Engineering Volume 63, Issue
3, December 2007, Pages 795-810

13

 Ongoing Work
› XML  XSD
› Terminal value  Basic ID
 Future Work

14

 Completed
› Transfer the XML file into Schema File (XSD
File)
 Verifying the changes of XML is done using XSD
› Assign SetID for each terminated value
 Five features:
 LetterDensity, DigitDensity, PunDensity,
UpperLetterDensity, MeanWordLength,
MeanNumberToken
 Cosine Relation
 Result: none or one setid number

15

 Issues:
› Verification:
 XSD can detect the change of tag-base structure.
 XSD cannot detect the change of semantic. See
Figure
› Assign basic id value
 If the relation of two path that come respectively
web page and from pattern tree is one-one.
 The result maybe is reject or accept.
 If the relation is one-many, they will become a
classification problem.
 For first extracted data, some data belong to one
field.
 But these data was possibly divided several basic id.
 For assigning basic id value to terminal value, it’s a
problem.
16

 Combine the number sequence of path for
terminal node into feature set
 Collect more web pages
› For a web site, 10 query, N result pages.
 XML partial path
› To resolve the gap between Pattern Tree and Web
pages.
 Survey other papers
› Automatically maintaining wrappers for semi-
structured web sources. (Focus on generating a new
training set.)
 Juan Raposo, Alberto Pan, Manuel Álvarez, Justo
Hidalgo
› Wrapper Maintenance: A Machine Learning
Approach
 Kristina Lerman, Steven N. Minton, Craig A. Knoblock
17

Before: After:
<Html> <html>
<body> <body>
<table> <table>
<tr> <tr>
<td>A<td> <td>
</tr> <strong>A</strong>
<tr> </td>
<td> </tr>
<strong>B</strong> <tr>
</td> <td>B</td>
</tr> </tr>
</table> </table>
</body> </body>
</html> </html> Back

18

Progress Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Progress Report

Similar to Progress Report (20)

Recently uploaded

Recently uploaded (20)

Progress Report