Progress Report


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Progress Report

  1. 1. 1
  2. 2.  2009/10/09  2009/11/24 2
  3. 3.  Introduction  Ongoing work  Future work 3
  4. 4.  Identifying useful information from the World Wide Web is important in Web mining and Information Agents.  Wrappers are software modules that help capture the semi-structured data on the web into a structured format.  Wrapper can be coded either manually or learnt from examples using a technique called wrapper induction. 4
  5. 5.  Wrappers for semi-structured Web sources › Wrappers need to perform two kinds of tasks:  Executing automated navigation sequences through Web sites to access the pages containing the required data.  Generating data extraction programs for obtaining the structured records from the retrieved HTML pages. › The vast majority of works dealing with automatic and semi-automatic wrapper generation have focused on the second task. 5
  6. 6.  Wrapper maintenance › The main problem with wrappers is that they can become invalid when the Web sources change.  It can be divided into three main tasks: › Detecting the changes on the source that invalidate the current wrapper. › Regenerating the automated navigation sequences required to access the pages containing the required data. › Regenerating the data extraction programs needed to extract the structured results from the HTML pages.  The first task is called wrapper verification. 6
  7. 7. Runtime Gadget Execution Gadget’s profile Grab web Web pages Pages Templat N Template e+ Extractor o Schema change Yes Extracte d Data Desired Unsupervised Data WI New Schema Data Schema+ Matching Template 7
  8. 8.  Extract data from web pages by using the pattern tree and previous web pages. › Compare to our schema on the terminal paths in the DOM tree. › Steps:  Find the same paths in the DOM tree.  Filter the paths without schematype (basic).  Finally, may obtain one or more path with schematype (basic). 8
  9. 9.  Input: P:a web page, T: Pattern Tree  Output: L: assign the id on the terminal paths in P  Algorithm: Transfer P into XML format Foreach TP:termainal path in P ID:=emty CheckExist(TP,T,ID) IF ID not equal to empty then Add (TP,Value,ID) to L END IF END FOR 9
  10. 10.  Using XSD to check if the template of web sources changes › Using XSD(XML standard description) to validate the XML  Validating the tag-based structure of XML is successful.  The method can not validate the content of XML. 10
  11. 11.  Input: Pold: old web page, Pnew: new web page  Output: true or false  Algorithm: XMLold=HtmlToXML(Pold) XMLnew=HtmlToXML(Pnew) Xsd = XMLToXSD(XMLold) IF(Validate(XMLnew,Xsd)) Success ELSE Miss END IF 11
  12. 12.  Paper: › On the verification of web wrappers › WEWRA: An algorithm for Wrapper Verification, 2009 March, ML  Program: 12
  13. 13.  Roshni Mohapatra, Kanagasabai Rajaraman, and Sung Sam Yuan. Efficient Wrapper Reinduction from Dynamic Web Sources. WI’04  Alberto Pan, Juan Raposo, Manuel A´lvarez , Vı´ctor Carneiro, Fernando Bellas. Automatically maintaining navigation sequences for querying semi- structured web sources. Data & Knowledge Engineering Volume 63, Issue 3, December 2007, Pages 795-810 13
  14. 14.  Ongoing Work › XML  XSD › Terminal value  Basic ID  Future Work 14
  15. 15.  Completed › Transfer the XML file into Schema File (XSD File)  Verifying the changes of XML is done using XSD › Assign SetID for each terminated value  Five features:  LetterDensity, DigitDensity, PunDensity, UpperLetterDensity, MeanWordLength, MeanNumberToken  Cosine Relation  Result: none or one setid number 15
  16. 16.  Issues: › Verification:  XSD can detect the change of tag-base structure.  XSD cannot detect the change of semantic. See Figure › Assign basic id value  If the relation of two path that come respectively web page and from pattern tree is one-one.  The result maybe is reject or accept.  If the relation is one-many, they will become a classification problem.  For first extracted data, some data belong to one field.  But these data was possibly divided several basic id.  For assigning basic id value to terminal value, it’s a problem. 16
  17. 17.  Combine the number sequence of path for terminal node into feature set  Collect more web pages › For a web site, 10 query, N result pages.  XML partial path › To resolve the gap between Pattern Tree and Web pages.  Survey other papers › Automatically maintaining wrappers for semi- structured web sources. (Focus on generating a new training set.)  Juan Raposo, Alberto Pan, Manuel Álvarez, Justo Hidalgo › Wrapper Maintenance: A Machine Learning Approach  Kristina Lerman, Steven N. Minton, Craig A. Knoblock 17
  18. 18. Before: After: <Html> <html> <body> <body> <table> <table> <tr> <tr> <td>A<td> <td> </tr> <strong>A</strong> <tr> </td> <td> </tr> <strong>B</strong> <tr> </td> <td>B</td> </tr> </tr> </table> </table> </body> </body> </html> </html> Back 18