Introduction (1/3) Identifying useful information from the World Wide Web is important in Web mining and Information Agents. Wrappers are software modules that help capture the semi-structured data on the web into a structured format. Wrapper can be coded either manually or learnt from examples using a technique called wrapper induction.
Introduction (2/3) Wrappers for semi-structured Web sources Wrappers need to perform two kinds of tasks: Executing automated navigation sequences through Web sites to access the pages containing the required data. Generating data extraction programs for obtaining the structured records from the retrieved HTML pages. The vast majority of works dealing with automatic and semi-automatic wrapper generation have focused on the second task.
Introduction (3/3) Wrapper maintenance The main problem with wrappers is that they can become invalid when the Web sources change. It can be divided into three main tasks: Detecting the changes on the source that invalidate the current wrapper. Regenerating the automated navigation sequences required to access the pages containing the required data. Regenerating the data extraction programs needed to extract the structured results from the HTML pages. The ﬁrst task is called wrapper veriﬁcation.
Runtime Gadget Execution Gadget’s profile Grab web pages Web Pages Template + Schema No Extractor Template change? Yes Extracted Data Unsupervised WI Desired Data Schema Matching New Schema+ Template Data 6
Ongoing work(1/2) Extract data from web pages by using the pattern tree and previous web pages. Compare to our schema on the terminal paths in the DOM tree. Steps: Find the same paths in the DOM tree. Filter the paths without schematype (basic). Finally, may obtain one or more path with schematype (basic).
Extract data from web pages by using the pattern tree Input: P:a web page, T: Pattern Tree Output: L: assign the id on the terminal paths in P Algorithm: Transfer P into XML format ForeachTP:termainal path in P ID:=emty CheckExist(TP,T,ID) IF ID not equal to empty then Add (TP,Value,ID) to L END IF END FOR
Ongoing work(2/2) Using XSD to check if the template of web sources changes Using XSD(XML standard description) to validate the XML Validating the tag-based structure of XML is successful. The method can not validate the content of XML.
Using XSD to check if the template of web sources changes Input: Pold: old web page, Pnew: new web page Output: true or false Algorithm: XMLold=HtmlToXML(Pold) XMLnew=HtmlToXML(Pnew) Xsd = XMLToXSD(XMLold) IF(Validate(XMLnew,Xsd)) Success ELSE Miss END IF
Future work Paper: On the verification of web wrappers WEWRA: An algorithm for Wrapper Verification, 2009 March, ML Program:
Reference RoshniMohapatra, KanagasabaiRajaraman, and Sung Sam Yuan. Efficient Wrapper Reinduction from Dynamic Web Sources. WI’04 Alberto Pan, Juan Raposo, Manuel A´lvarez , Vı´ctorCarneiro, Fernando Bellas. Automatically maintaining navigation sequences for querying semi-structured web sources. Data & Knowledge EngineeringVolume 63, Issue 3, December 2007, Pages 795-810