1
 2009/10/09
 2009/11/24




               2
 Introduction
 Ongoing work
 Future work




                 3
 Identifying useful information from the
  World Wide Web is important in Web
  mining and Information Agents.
 Wrappers are software modules that
  help capture the semi-structured data
  on the web into a structured format.
 Wrapper can be coded either manually
  or learnt from examples using a
  technique called wrapper induction.

                                     4
   Wrappers for semi-structured Web
    sources
    › Wrappers need to perform two kinds of tasks:
       Executing automated navigation sequences
        through Web sites to access the pages
        containing the required data.
       Generating data extraction programs for
        obtaining the structured records from the
        retrieved HTML pages.
    › The vast majority of works dealing with
     automatic and semi-automatic wrapper
     generation have focused on the second
     task.
                                             5
   Wrapper maintenance
    › The main problem with wrappers is that they can
      become invalid when the Web sources change.
   It can be divided into three main tasks:
    › Detecting the changes on the source that
      invalidate the current wrapper.
    › Regenerating the automated navigation
      sequences required to access the pages
      containing the required data.
    › Regenerating the data extraction programs
      needed to extract the structured results from the
      HTML pages.
   The first task is called wrapper verification.
                                                  6
Runtime Gadget Execution
Gadget’s profile
                   Grab web            Web
                    pages             Pages


    Templat                    N    Template
      e+           Extractor   o
    Schema                          change

                                         Yes
   Extracte
    d Data         Desired         Unsupervised
                    Data                WI



                                                        New
                   Schema
                                     Data             Schema+
                   Matching                           Template
                                                  7
   Extract data from web pages by using
    the pattern tree and previous web
    pages.
    › Compare to our schema on the terminal
      paths in the DOM tree.
    › Steps:
       Find the same paths in the DOM tree.
       Filter the paths without schematype (basic).
       Finally, may obtain one or more path with
        schematype (basic).


                                                 8
   Input: P:a web page, T: Pattern Tree
   Output: L: assign the id on the terminal paths in P
   Algorithm:
    Transfer P into XML format
    Foreach TP:termainal path in P
        ID:=emty
        CheckExist(TP,T,ID)
        IF ID not equal to empty then
            Add (TP,Value,ID) to L
        END IF
    END FOR

                                                      9
   Using XSD to check if the template of
    web sources changes
    › Using XSD(XML standard description) to
      validate the XML
       Validating the tag-based structure of XML is
        successful.
       The method can not validate the content of
        XML.




                                                 10
   Input: Pold: old web page, Pnew: new web page
   Output: true or false
   Algorithm:
            XMLold=HtmlToXML(Pold)
            XMLnew=HtmlToXML(Pnew)
            Xsd = XMLToXSD(XMLold)
            IF(Validate(XMLnew,Xsd))
                 Success
            ELSE
                 Miss
            END IF

                                              11
   Paper:
    › On the verification of web wrappers
    › WEWRA: An algorithm for Wrapper
     Verification, 2009 March, ML


   Program:




                                            12
 Roshni Mohapatra, Kanagasabai
  Rajaraman, and Sung Sam Yuan.
  Efficient Wrapper Reinduction from
  Dynamic Web Sources. WI’04
 Alberto Pan, Juan Raposo, Manuel
  A´lvarez , Vı´ctor Carneiro, Fernando
  Bellas. Automatically maintaining
  navigation sequences for querying semi-
  structured web sources. Data &
  Knowledge Engineering Volume 63, Issue
  3, December 2007, Pages 795-810

                                     13
   Ongoing Work
    › XML  XSD
    › Terminal value  Basic ID
   Future Work




                                  14
   Completed
    › Transfer the XML file into Schema File (XSD
     File)
       Verifying the changes of XML is done using XSD
    › Assign SetID for each terminated value
       Five features:
         LetterDensity, DigitDensity, PunDensity,
          UpperLetterDensity, MeanWordLength,
          MeanNumberToken
       Cosine Relation
       Result: none or one setid number

                                                     15
   Issues:
    › Verification:
       XSD can detect the change of tag-base structure.
       XSD cannot detect the change of semantic. See
        Figure
    › Assign basic id value
       If the relation of two path that come respectively
        web page and from pattern tree is one-one.
         The result maybe is reject or accept.
       If the relation is one-many, they will become a
        classification problem.
       For first extracted data, some data belong to one
        field.
         But these data was possibly divided several basic id.
         For assigning basic id value to terminal value, it’s a
          problem.
                                                             16
   Combine the number sequence of path for
    terminal node into feature set
   Collect more web pages
    › For a web site, 10 query, N result pages.
   XML partial path
    › To resolve the gap between Pattern Tree and Web
      pages.
   Survey other papers
    › Automatically maintaining wrappers for semi-
      structured web sources. (Focus on generating a new
      training set.)
       Juan Raposo, Alberto Pan, Manuel Álvarez, Justo
        Hidalgo
    › Wrapper Maintenance: A Machine Learning
      Approach
       Kristina Lerman, Steven N. Minton, Craig A. Knoblock
                                                          17
Before:                         After:
<Html>                          <html>
<body>                          <body>
  <table>                          <table>
    <tr>                               <tr>
        <td>A<td>                         <td>
    </tr>                                   <strong>A</strong>
    <tr>                                  </td>
        <td>                           </tr>
           <strong>B</strong>          <tr>
        </td>                              <td>B</td>
    </tr>                              </tr>
  </table>                      </table>
</body>                         </body>
</html>                         </html>                     Back

                                                             18

Progress Report

  • 1.
  • 2.
  • 3.
     Introduction  Ongoingwork  Future work 3
  • 4.
     Identifying usefulinformation from the World Wide Web is important in Web mining and Information Agents.  Wrappers are software modules that help capture the semi-structured data on the web into a structured format.  Wrapper can be coded either manually or learnt from examples using a technique called wrapper induction. 4
  • 5.
    Wrappers for semi-structured Web sources › Wrappers need to perform two kinds of tasks:  Executing automated navigation sequences through Web sites to access the pages containing the required data.  Generating data extraction programs for obtaining the structured records from the retrieved HTML pages. › The vast majority of works dealing with automatic and semi-automatic wrapper generation have focused on the second task. 5
  • 6.
    Wrapper maintenance › The main problem with wrappers is that they can become invalid when the Web sources change.  It can be divided into three main tasks: › Detecting the changes on the source that invalidate the current wrapper. › Regenerating the automated navigation sequences required to access the pages containing the required data. › Regenerating the data extraction programs needed to extract the structured results from the HTML pages.  The first task is called wrapper verification. 6
  • 7.
    Runtime Gadget Execution Gadget’sprofile Grab web Web pages Pages Templat N Template e+ Extractor o Schema change Yes Extracte d Data Desired Unsupervised Data WI New Schema Data Schema+ Matching Template 7
  • 8.
    Extract data from web pages by using the pattern tree and previous web pages. › Compare to our schema on the terminal paths in the DOM tree. › Steps:  Find the same paths in the DOM tree.  Filter the paths without schematype (basic).  Finally, may obtain one or more path with schematype (basic). 8
  • 9.
    Input: P:a web page, T: Pattern Tree  Output: L: assign the id on the terminal paths in P  Algorithm: Transfer P into XML format Foreach TP:termainal path in P ID:=emty CheckExist(TP,T,ID) IF ID not equal to empty then Add (TP,Value,ID) to L END IF END FOR 9
  • 10.
    Using XSD to check if the template of web sources changes › Using XSD(XML standard description) to validate the XML  Validating the tag-based structure of XML is successful.  The method can not validate the content of XML. 10
  • 11.
    Input: Pold: old web page, Pnew: new web page  Output: true or false  Algorithm: XMLold=HtmlToXML(Pold) XMLnew=HtmlToXML(Pnew) Xsd = XMLToXSD(XMLold) IF(Validate(XMLnew,Xsd)) Success ELSE Miss END IF 11
  • 12.
    Paper: › On the verification of web wrappers › WEWRA: An algorithm for Wrapper Verification, 2009 March, ML  Program: 12
  • 13.
     Roshni Mohapatra,Kanagasabai Rajaraman, and Sung Sam Yuan. Efficient Wrapper Reinduction from Dynamic Web Sources. WI’04  Alberto Pan, Juan Raposo, Manuel A´lvarez , Vı´ctor Carneiro, Fernando Bellas. Automatically maintaining navigation sequences for querying semi- structured web sources. Data & Knowledge Engineering Volume 63, Issue 3, December 2007, Pages 795-810 13
  • 14.
    Ongoing Work › XML  XSD › Terminal value  Basic ID  Future Work 14
  • 15.
    Completed › Transfer the XML file into Schema File (XSD File)  Verifying the changes of XML is done using XSD › Assign SetID for each terminated value  Five features:  LetterDensity, DigitDensity, PunDensity, UpperLetterDensity, MeanWordLength, MeanNumberToken  Cosine Relation  Result: none or one setid number 15
  • 16.
    Issues: › Verification:  XSD can detect the change of tag-base structure.  XSD cannot detect the change of semantic. See Figure › Assign basic id value  If the relation of two path that come respectively web page and from pattern tree is one-one.  The result maybe is reject or accept.  If the relation is one-many, they will become a classification problem.  For first extracted data, some data belong to one field.  But these data was possibly divided several basic id.  For assigning basic id value to terminal value, it’s a problem. 16
  • 17.
    Combine the number sequence of path for terminal node into feature set  Collect more web pages › For a web site, 10 query, N result pages.  XML partial path › To resolve the gap between Pattern Tree and Web pages.  Survey other papers › Automatically maintaining wrappers for semi- structured web sources. (Focus on generating a new training set.)  Juan Raposo, Alberto Pan, Manuel Álvarez, Justo Hidalgo › Wrapper Maintenance: A Machine Learning Approach  Kristina Lerman, Steven N. Minton, Craig A. Knoblock 17
  • 18.
    Before: After: <Html> <html> <body> <body> <table> <table> <tr> <tr> <td>A<td> <td> </tr> <strong>A</strong> <tr> </td> <td> </tr> <strong>B</strong> <tr> </td> <td>B</td> </tr> </tr> </table> </table> </body> </body> </html> </html> Back 18