Progress Report 20091009


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Progress Report 20091009

  1. 1. Progress Report<br />2009.10.09<br />Yen-Ling Lin<br />
  2. 2. Outline<br />Introduction<br />Ongoing work<br />Future work<br />
  3. 3. Introduction (1/3)<br />Identifying useful information from the World Wide Web is important in Web mining and Information Agents.<br />Wrappers are software modules that help capture the semi-structured data on the web into a structured format.<br />Wrapper can be coded either manually or learnt from examples using a technique called wrapper induction.<br />
  4. 4. Introduction (2/3)<br />Wrappers for semi-structured Web sources<br />Wrappers need to perform two kinds of tasks:<br />Executing automated navigation sequences through Web sites to access the pages containing the required data.<br />Generating data extraction programs for obtaining the structured records from the retrieved HTML pages.<br />The vast majority of works dealing with automatic and semi-automatic wrapper generation have focused on the second task.<br />
  5. 5. Introduction (3/3)<br />Wrapper maintenance<br />The main problem with wrappers is that they can become invalid when the Web sources change.<br />It can be divided into three main tasks:<br />Detecting the changes on the source that invalidate the current wrapper.<br />Regenerating the automated navigation sequences required to access the pages containing the required data.<br />Regenerating the data extraction programs needed to extract the structured results from the HTML pages.<br />The first task is called wrapper verification.<br />
  6. 6. Runtime Gadget Execution<br />Gadget’s profile<br />Grab web pages<br />Web Pages<br />Template + Schema<br />No<br />Extractor<br />Template change?<br />Yes<br />Extracted Data<br />Unsupervised WI<br />Desired Data<br />Schema Matching<br />New<br />Schema+<br />Template<br />Data<br />6<br />
  7. 7. Ongoing work(1/2)<br />Extract data from web pages by using the pattern tree and previous web pages.<br />Compare to our schema on the terminal paths in the DOM tree.<br />Steps:<br />Find the same paths in the DOM tree.<br />Filter the paths without schematype (basic).<br />Finally, may obtain one or more path with schematype (basic).<br />
  8. 8. Extract data from web pages by using the pattern tree<br />Input: P:a web page, T: Pattern Tree<br />Output: L: assign the id on the terminal paths in P<br />Algorithm:<br />Transfer P into XML format<br />ForeachTP:termainal path in P<br /> ID:=emty<br />CheckExist(TP,T,ID)<br />IF ID not equal to empty then<br /> Add (TP,Value,ID) to L<br />END IF<br />END FOR<br />
  9. 9. Ongoing work(2/2)<br />Using XSD to check if the template of web sources changes <br />Using XSD(XML standard description) to validate the XML<br />Validating the tag-based structure of XML is successful.<br />The method can not validate the content of XML.<br />
  10. 10. Using XSD to check if the template of web sources changes<br />Input: Pold: old web page, Pnew: new web page<br />Output: true or false<br />Algorithm:<br />XMLold=HtmlToXML(Pold)<br />XMLnew=HtmlToXML(Pnew)<br />Xsd = XMLToXSD(XMLold)<br /> IF(Validate(XMLnew,Xsd))<br /> Success<br /> ELSE<br /> Miss<br /> END IF <br />
  11. 11. Future work<br />Paper:<br />On the verification of web wrappers<br />WEWRA: An algorithm for Wrapper Verification, 2009 March, ML<br />Program:<br />
  12. 12. Reference<br />RoshniMohapatra, KanagasabaiRajaraman, and Sung Sam Yuan. Efficient Wrapper Reinduction from Dynamic Web Sources. WI’04<br />Alberto Pan, Juan Raposo, Manuel A´lvarez , Vı´ctorCarneiro, Fernando Bellas. Automatically maintaining navigation sequences for querying semi-structured web sources. Data & Knowledge EngineeringVolume 63, Issue 3, December 2007, Pages 795-810 <br />
  13. 13. Thanks for your time<br />