SlideShare a Scribd company logo
1 of 43
Download to read offline
DIADEM   domain-centric intelligent automated
         data extraction methodology



                                                                        OXPath
      Scalable, Memory-efficient Data
    Extraction from Web Applications

                                                           Andrew Sellers
     September 1st, 2011 @ Department of Computer Science, Oxford University
                                joint work with Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart
1

A Lingua Franca
      for
Web Extraction
                  2
3
Seattle




          4
5
6
7
OXPath » Lingua Franca for Web Extraction
1

A Call for Action in Web Extraction!
     Past: Form Filling + HTML Patterns


     Now: Interaction + DOM Patterns
         getting to the data requires interaction not just form filling
         identifying relevant data from rendered DOMs
            including computed style and geometric information
            access to all CSS properties, but less rich relations than in SXPath




                                                                                   8
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
   Seattle
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
                 /following::field()[1]/{click /}
       /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli

  To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./
                                                     9
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
                 /following::field()[1]/{click /}
       /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

  To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
    [//div.property-info//a/{click/}//div.home-description:<info=(.)>    10
z
OXPath » Lingua Franca for Web Extraction
1

Wrapper Babel
     Wrapper induction & data extraction systems
         each invent their own wrapper language
         often separate navigation and matching
     Main classes:
     (1) pattern matching + imperative navigation
            XPath
            Finite & Tree Automata
            Token Prefix/Suffix
     (2) Datalog
            E-Log (Lixto)
                                                    11
OXPath » Lingua Franca for Web Extraction
1

Why OXPath?

               scalability
                                                familiarity

        an XPath for
                                                  simplicity
       data extraction

                          web applications
                                                               12
2



    OXPath
             13
Start at kayak.co.uk:
doc("rightmove.co.uk")
To select an airport, type a few letters and select from completion list
   //field().destination/{"Sea" /}
            //div#smartbox//li[1]/{click /}
Submit the form




                                                                      14
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pagesattributes
       Extract the
/(//a[.=‘Next’]/{click /})* warnings
    Mouseover the ! to extract flight quality
   and//span.qualityWarningIcon/{mouseover /}
       for each flight
   //body.resultrow:<flight>
     Click on the details to extract layovers




                                                  15
2   OXPath » The Language


 Interactive Wrapper for Kayak
doc(‘‘kayak.co.uk’’)//input#origin/{“Lon” /}
     //div#smartbox//li[1]/{click /}
     //input#destination/{“Sea” /}
     //div#smartbox//li[1]/{click /}
     /descendant::field()[last()]/{click /}

      //*#stops1/{uncheck }/following::*#stops2/{click /}

      /(//a[.=’Next’])*

      //body.resultrow:<flight>
         [.//a.results_price:<price=string(.)>]
         [.//span.qualityWarningIcon/{mouseover /}
           //div.airqualitylist//tr:<warning=string(.)>]
         [.//a.resultdetaillink/{click /}
           //*.layover//td.airportCode:<layover=string(.)>]16
Your Amazon.co.uk              |   Deals of the Week       |   Gift Cards        |   Gifts & Wish Lists                                Usually dispatchedYour Ac
                                                                                                                                                                                               within 1

     Books
                                          Advanced     1       Browse          New & Future2                                 Audio         Bargain           Dispatched from and sold by Amaz
                                                                                                                                                         Special
                                                                                                  Bestsellers here. Sell on Amazon: First Month Subscription Free
                                      Hello. Sign in to get personalised recommendations. New Customer? Start  Paperbacks
                                                                                                                                                                         Sell Your
p All Departments                     Search                 Books
                                           Search Books Genres                   Releases Hyderabad
                                                                                              Hyderabad                      Books          Books         Offers           Books
                                                                                                                                                                    Basket 2 used from
                                                                                                                                                             4 new from £9.53             Wi
                                      Your Amazon.co.uk | Deals of the Week | Gift Cards | Gifts & Wish Lists                                 Your Account | Help

ks New Arrivals
    Shop All Departments              Search Books › "Hyderabad" Hyderabad & Future
                                           Books
                                       Advanced        Browse        New
                                                                                                        Bestsellers Paperbacks }  Basket Audio Wish ListBargain                          Special
        Any Release Date
                                         Search             Genres                Releases
                                      Hello Tim Furche. We have recommendations for you. (Not Tim?)               Sell on Amazon:/       Books
                                                                                                                                       Month             Books
                                                                                                                           lick First Special Subscription Free
                                                                                                                                                                                          Offers
                                       Advanced        Browse        New & Future                             Audio     {cBargain                  Sell Your
       Books
         Last 30 days (19)               Search        Genres         Releases
                                                                                      Bestsellers Paperbacks
                                      Tim's Amazon.co.uk | Deals of the Week | Gift Cards        |              3    Books
                                                                                                     Gifts & Wish Lists               Books                Your Account | Help
                                                                                                                                                      Offers        Books
                                                Showing 1 - 12 of 7,821 Results                                                                                 Sort by      Relevance
         Last 90 days (109)
w Arrivals Departments                      Books › "Hyderabad"
                                               History       Hyderabad                                                                    5
   Shop All
      Next 90 days (10)
        New Arrivals
                                      Search
                                         Books          › "Hyderabad"                                                                          Basket            Wish List
                                                                                                                                                                               Formats
ny Release Date
      Any Release Date                          1.
                                         Advanced       Browse             Hyderabad: A Biography by Audio
                                                                       New & Future                         NarendraBargain image Special
                                                                                                                          Luther (Paperback - Your Hardcover
                                                                                                                                                    Sell 14 Sep 2006)
       Books                                                                         Bestsellers Paperbacks            See larger
 ast Department (19)
     30 days (19)
        Last 30 days                      Search         Genres           Releases                          Books        Books          Offers       Books
                                                                           Buy new: £10.99
                                              Showing 1 - 12 of 7,821 Results
                                             Showing 1 - 12 of 7,821 Results                      6 Used & new from £9.53 by images
                                                                                                                Share your own customer
                                                                                                                               Sort       Relevance         Paperback
                                                                                                                                                                Sort by                      Relevance
         Last 90 days
     90 days (109) (109)
 ast ‹ Any Department                                                                                               Publisher: learn how customers can search inside this
            Next 90 days
       New Arrivals (10)
                                           Books        › "Hyderabad" › History dispatched within 1 to 2 months
                                                                         Usually                                                            book.
          Books
Next   90 days (10) Date
         Any Release             4          1.                           Hyderabad: A Biography by Narendra Luther (Paperback - 14 Sep 2006)
                                                                              Eligible for FREE Super Saver Delivery. Publisher!
            History (2,194)                  1.                                 Hyderabad: A&Biography bylike to read this book on Kindle
                                                                                                                  Narendra Luther
                                                                                                                 Tell the
                                                                                                                                                                        (Paperback - 14 Sep
       Department (2)
           Last 30 days                                                  Buy new: £10.99 6 Used new from £9.53I’d
            Biography (362)
        ‹ Any Department {click /}          Showing 1 - 12 of 2,195 Results                                                                   Sort by     Relevance
partment 90 days (18) (327)
     Last
      Books & Holiday
      Travel                                                                       Buy new: £10.99          6 UsedDon t neworfrom £9.53
                                                                         Usually dispatched within 1 to 2 months    & have a Kindle? Get a FREE
                                                                                                                                          your
         Next 90 days (2)                        book                                                             Kindle here,   download
ny   Department(2,194)(2,208)
          Study Books
           History                          1.              title:       Hyderabad: A Biography by Delivery. Luther (Paperback - 14 Sep 2006)
                                                                         Eligible for FREE Super Saver Narendra   Kindle Reading App.

Books Computing(362)
       Biography
    Department & Internet          (299)          2.                          Hyderabad dispatched within 1 to 2 months
                                                                                Usually Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010)
                                                                         Buy new: £10.99
                                                                          price:               6 Used & new from £9.53
                                                                              Buy new: £5.79FREE Super Saver Delivery.
                                                                                 Eligible for {
          Travel & Holiday (327)
 HistoryReference (518)
     ‹ Any Department
          (2,194)
                                                                         Usually dispatched withinl1 to 2 months
                                                                                                 c
       ‹ Books Books (2,208)
          Study
           Society, Politics &                                                                     i    5
 Biography (362) & (2,146) (299)
        Computing
         Philosophy Internet
        History
                                            2.                           Hyderabad Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010)
                                                                                                 ck
                                                                              Available for download now
         Reference &(327)
 Travel Science & (518) (1,601)
        & Countries Nature (1,058)
           Holiday Regions
                                                                                                    /}  Product details
                                                                         Buy new: £5.79
            Society, Politics &                                                                                       Paperback: 436 pages
       Business, History &
        Cultural Finance
 Study Books (2,208) (354)
         Philosophy (2,146)                                              Available for download now
               Social & Economic
               Law (1,061)                                                                                          publisher: OUP Oxford; New Ed edition (14 Sep 2006)
                                                                                                                     Publisher:
 Computing & Internet (299)                  2.                          Modern Hyderabad Travel John Law by English
                                                                                Hyderabad (Deccan) by Guide (Paperback - Guides (Kindle Edition - 13 Se
                                                                                                                 Offbeat 25 Nov 2009)
       Science & (74)
          History Nature (1,601)            2.                                                          Language
       Scientific, Finance & &
        Business, Technical
         World History
 Reference (518)(816) (725)
         Medical
          Law (1,061)                                                           Buy new:4 £5.79 new from £17.99 0195684346
                                                                         Buy new: £17.99  Used &       ISBN-10:
          Reference (50)                          3.                           Progress in Cryptology - INDOCRYPT 2010: 11th International Conference
        Religion & Spirituality
 Society,Scientific, Technical & (601)
          Politics History (493)
                      &                                                                                       ISBN-13: 978-0195684346
          Religious(816)                                                 Get iton Cryptology inifIndia, Hyderabad, India, December 12-15, 2010,
                                                                                by Friday, Oct 29  you order in the next 23 hours and choose express
                                                                         delivery.Available ... Computer ScienceDimensions: 20.8 x 13.6 x 2.4 cm
                                                                         Progress in Cryptology -downloadProduct / Security and Cryptology) by Guang
                                                                                              for INDOCRYPTnow 11th International Conference
           Medical
  Philosophy (2,146) (121)
        Art, Architecture &                 3.                                                                  2010:
          Political History
         Religion & Spirituality (601)
          Photography (192)                                                    Proceedings
 Science Art,Nature (1,601)
          & Architecture &                                               on Cryptology in India, Hyderabad, Average Customer Review: 2010, first to review this item
                                                                                                              India, December 12-15,
          North America (42)
       Health, Family &
                                                                               Gong and Kishan Chand Gupta (Paperback - 12 Dec Be the    2010)
                                                                         Proceedings ... Computer Science / Amazon Bestsellers Rank: 253,718 in Books (See Top 100 in Books)
                                                                                                            Security and Cryptology) by Guang
 Business,Photography &
          Archaeology (37)
             Finance (192)
          Lifestyle (371)                                                Gong Buy Kishan£48.99
                                                                              and new: Chand Gupta (Paperback - 12 Books > History > Countries & Regions > Asia > 1500-190
                                                                                                              #11 in Dec 2010)
  Law (1,061) (208) & (222)
         Health, Family
          Military History
       Fiction
           Lifestyle (371)                                                                                                 #29 in Books > History > Countries & Regions > Asia > 1900-194
           Europe (193)                     3.                           The new: £48.99 pre-order
                                                                         Buy Untold Charminar: Writings on Hyderabad by Syeda Imam
                                                                               Available for
 Scientific, Technical & (66)
          Fiction Drama &
        Poetry, (208)
           Academic History                                              (Hardcover - 30 May 2008)
           Criticism (191)
     Medical (816) Ireland (132)
          Poetry, Drama &                                                Available for pre-order Super Saver Would you like to update product info or give feedback on images?
                                                                               Eligible for FREE               Delivery.
           Britain &
               Criticism (191)
        Antiquarian, Rare &                  3.                          Buy new: £15.99 £13.59 Cryptology from £13.28
                                                                         Eligible for FREE Super SaverUsed & new - INDOCRYPT 2010: 11th International C
                                                                                   Progress in 6 Delivery.
 ReligionAntiquarian, Rare &(601)
          & Spirituality
          Other Historical                                                                                            See Complete Table of Contents
          Collectable (164)
             Subjects (70)                                               Get itIndia South Nelles Ma: India, Hyderabad, and choose express
                                                                                byon Cryptologyifin Including Andaman / Nicobar December 12-15, 201
                                                                                   Thursday, Oct 28      you order in the next 5 hours India, Islands,
            Collectable (164)
 Art, Architecture & (164) &
         Home & Garden Letters
           Essays, Journals,                4.
                                                  4.
                                                                         delivery.Proceedings ... Computer Science Islands,
                                                                         India South Nelles Ma: Including Andaman / Nicobar / Security and Cryptology) by
   PhotographyGarden (164)
          Home & (192)
             True Accounts (14)                                                Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore /
                                                                                                             More About the Author
         Food & Drink (81)                                               Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore /
                                                                         Only 1 left in stock - order soon.
                                                                         ChennaiGong (Madras)/ Thiruvananthapuram (Trivandrum) (Nellesand more.
                                                                               Chennai andThiruvananthapuram (Trivandrum) (Nelles Map) by
                                                                                   (Madras)/ Kishan Chand Gupta (Paperback - 12 writers, 2010) by      Dec Map)
          Food & Drink (81)
          Family History &
           Ancient &
 Health,Sports, Hobbies&&
          Sports, Hobbies                                                                                                  Discover books, learn about
             Civilisation (16)
   Lifestyle (371)
           Games (80)
            Games (80)                                                   Nelles (Map - 6 Oct -2010) 2010)
                                                                               Nelles (Map 6 Oct
           Maritime History (3)                                                  Buy new: £48.99                                      › Visit Amazon's Narendra Luther Page
                                                                         Hyderabad: Webster's 22 Used 22 Used & £3.40from by Icon Group
                                                                         Buy new: £7.95 £6.18 £6.18
                                                                               Buy new: £7.95 Timeline History, 710 - 2007 £3.40
                                                                                                          & new from new
         Calendars, Diaries, Annuals
          Calendars, Diaries, Annuals
 Fiction (208)                              4.
           & More (8)
            & More (8)
     Format
 Poetry,Mind, Body & Spirit (121)
          Drama &
          Mind, Body & Spirit (121)                                             byAvailable for pre-order
                                                                         Internationalby Thursday, 1 May orderyou order in hoursnext choose express
                                                                         Get itGet Thursday, Oct 28 if you 2009)in the next 2 the and 2 hours and choose express
                                                                                    it (Paperback - Oct 28 if
       Any Format
   Criticism & Lesbian(50)
          Gay (191)                                                      delivery.Eligible for FREE Super Saver Delivery.
                                                                               delivery.
                                                                         Buy new: £28.95 £27.50    4 Used & new from £27.50
                                                                                                                                                                                                   17
         Gay & Lesbian (50)
          Hardcover (903)
2   OXPath » The Language


➊ Actions: Browser Interaction
     Actions correspond to DOM events, e.g.,

     Document           doc("rightmove.co.uk")

     Click              {click}

     Fill               {“Sea”}

     Mouseover          {mouseover}

     Executed once on each context node
     Return    context nodes             (contextual actions) or
               root nodes for new DOM    (absolute actions)
                                                                   18
2   OXPath » The Language


➋ Extraction: Compact Tree Construction
     Extraction marker select nodes for extraction
        record markers:      :<flight>

        attribute markers:   :<price=string(.)>

     Extracted data has tree shape
        nesting of extraction markers in OXPath expression defines
        nesting of records and attribute-record associations in the output




                                                                             19
2   OXPath » The Language


➌ Iteration: Kleene Star
     Most web sites use pagination for results
        traversing paginated results require iteration
        ⇢ extraction from any unbounded component of a link graph


     Kleene Star from Regular XPath [Marx TODS ‘05]
        extended to OXPath, i.e., with action in the iterated expression
        /(//a[.=’Next’])*

        OXPath’s Page-at-a-time algorithm
           buffers in practice only a constant number of pages
           even for very large components
                                                                           20
2   OXPath » The Language


➋ Style: Querying Visual Attributes
     Access to all CSS properties via style axis

     Visibility      style::display or style::visibility

     Font size       style::font-size

     Geometry        style::top, style::left, ...

     Color           style::color or style::background-color

     Joins on style properties possible
        but: no rich spatial relations as in SXPath

                                                               21
2   OXPath » The Language


OXPath = XPath + 4

      action
                            extraction
              style
                              iteration

                                          22
3



    Analysis
               23
OXPath » Formal Properties
3

Semantics – Overview
     Semantics defined over relational structure
         as in XPath Leashed [Benedikt 08]
         but: extended to multiple documents and with action relations

            action relations form tree ⇢ no two actions lead to same DOM
         but: instead of single node set, set of relations for extracted tree


     Context extended by last match for parent extraction marker
         necessary to construct the extraction tree
         last match for sibling extraction used for
            compact specification of records with mixed attributes and nested records
                                                                                       24
results




…   …




                  25
N1    J estep/path KN (c)      = {c⇧⇧ | c⇧ ⌃ J estep KN (c) ⌦ c⇧⇧ ⌃ J path KN (c⇧ )}
N2    J axis::nodes KN (c)     = { n⇧ , c.p, c.l | Raxis (c.n, n⇧ ) ⌦ n⇧ ⌃ Rnodes }
N3    J step[q] KN (c)         = {c⇧ ⌃ J step KN (c) | J q KB ( c⇧ .n, c⇧ .l, c⇧ .l )}
N4    J step± [qp] KN (c)      = {c⇧ ⌃ J step± KN (c) | C = J step± KN (c)⌦
                                 J REWRITE± (qp,C, c⇧ ) KB ( c⇧ .n, c⇧ .l, c⇧ .l )}
N5    J {action /} KN (c.n)    = { n⇧ , c.p, c.l } with n⇧ such that Raction (c.n, n⇧ )
N6    J {action} KN (c)        = J AFP(action, c.n) KN (J {action /} KN (c))
N7    J step : M[= v] KN (c) = { c⇧ .n, c⇧ .p, OUT(c⇧ .n, M) | c⇧ ⌃ J step KN (c)}
N8    J (path) KN (c)        = {cr | r ⌅ 0 ⌥0 ⇤ s ⇤ r :
                               cs+1 ⌃ J path KN (cs ) ⌦ c0 = c}
N9    J (path) {v, w} KN (c) = {cr | v ⇤ r ⇤ w ⌥0 ⇤ s ⇤ r :
                               cs+1 ⌃ J path KN (cs ) ⌦ c0 = c}
N10   J (expr)[qp] KN (c)    = {c⇧ | c⇧ ⌃ C ⌦C = J expr KN (c)⌦
                               J REWRITE+ (qp,C, c⇧ ) KB ( c⇧ .n, c⇧ .l, c⇧ .l )}
N11   J doc(uri) KN (c)        = Fdoc (uri)
                                                                                         26
OXPath » Formal Properties
3

Page-At-A-Time (PAAT)
     Assures constant memory over number of pages traversed
     Experimentally confirmed small overhead over XPath
         overhead due to nested record extraction
         actions do not affect complexity
         Kleene star: small overhead over regular XPath
     Extracted records are never buffered
         but streamed as soon as they are matched
     Offers optimal page management by minimising buffer size



                                                                27
Web Images Videos Maps News Shopping Gmail more !

                                      1                                    world wide web
                                           Web Images Videos Maps News Shopping Gmail more !

                                         2 Web Images Videos
                                     Scholar Articles and patents
                                                                                 world wide web
                                                                          Maps Newsanytime
                                                                                     Shopping Gmail more
                                                                                                include cit

                                 1           3 Web Images patents
                                     The Scholar of Articles and Videos Maps world wide web
                                         diameter the world wide web Newsanytime     Shopping Gmail m
                                                                                                inclu
                                     R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907038, 1999
                                     arXiv:cond-mat/9907038v2 [cond-mat.dis-nn] 10 Sep 1999 The diameter
                                     Despite itsScholar role in communication,web
                                           The diameter of Articles and patents the world wide web (www)
                                                                   the world wide            world wide
                                                  increasing                                              web
                                                                                                     anytime
                                           R Albert, H Jeong,individual or institution can create websites with u
                                     controlled medium: any       AL Barabási - Arxiv preprint cond-mat/9907038,
                                     Cited arXiv:cond-mat/9907038v2 -[cond-mat.dis-nn]web versions
                                            by 2497 - diameter of theView as HTML - All 52 1999 The diam
                                                        Related articles                      10 Sep
                                                 The Scholar role in communication, the world wide web (
                                                                            world wide
                                           Despite its increasing Articles and patents                    anytime
                                                 R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907
                                           controlled medium: any individual or institution can create websites
                                     Self-similarity in World Wide Web traffic: evidence and poss
                                           CitedarXiv:cond-mat/9907038v2 - View as HTML - 10 Sep 1999 The
                                                  by 2497 - Related articles [cond-mat.dis-nn] All 52 versions
                                     ME Crovella, A The diameter of the communication,webworld wide w
                                                 Despite its increasing role in world wide 2002 - ieeexplore
                                                        Bestavros - Networking, IEEE/ACM ", the
            1       2            7    1                           12
                                     Abstract—Recently, the notionanyself-similarity institution can create webs
                                                 controlled medium: of individual or has been shown to apply
                                           Self-similarity in HIn this paper, we[cond-mat.dis-nn] 10cond-mat
                                                       R Albert, Jeong, AL Barabási - Arxiv preprint
                                                  network traffic.-World Wide Web traffic: evidence and
                                     local-area CitedarXiv:cond-mat/9907038v2 -show evidence that the subse
                                                        by 2497 Related articles View as HTML - All Sepversi 52 199
                                     is due to WorldDespiteWeb (WWW) transfers IEEE/ACMcharacteristics th
                                           ME Crovella, A Bestavros - Networking, can show ", 2002 - ieeex
                                                        Wide its increasing role in communication, the world w
                                     Cited Abstract—Recently,articles - BLof self-similarityversions evidence a
                                            by 3122 -controlled the notion Direct - All 56 institution can create
                                                        Related medium: any Wide Web has been shown to
                                                                                 individual or traffic:
                                                 Self-similarity in Worldpaper, we show evidence that the
                                           local-area Cited by 2497 - In this articles - View as HTML - All 52
                                                        network traffic. Related
                                           is due to Crovella, A Bestavros - Networking, IEEE/ACM ", 2002 -
                                                 ME
                                     World-wideWorld Wide Web (WWW) of self-similarity hascharacterist
                                                        web: The information transfers can show been
                                                            1
                                                 Abstract—Recently,articles - BL universe 56 versions show
                                           Cited by 3122 - Related the notion Direct - All
                                     T Berners-Lee, Self-similarity in WorldInternetwe show evidence tha
                                                 local-area network traffic. B " - paper, Web traffic: evide
                                                        R Cailliau, JF Groff, In this Wide ", 2010 - emeraldins
                                     Purpose – is due to Crovella, Web (W 3 ) initiative is a practical project 20
                                                  The ME World Wide Web (WWW) transfers IEEE/ACM charac
                                                        World-Wide A Bestavros - Networking, can show ", d
                                     information universe into - Related articles - BL universe 56 versions
                                           World-wide web: The informationof self-similarity has been
                                                 CitedAbstract—Recently, using available technology. This pape
                                                        by 3122 existence the notion Direct - All
    2                   7            the aims, data model, Cailliau, JF Groff, B this paper, we show evidence
                                           T Berners-Lee, Randnetwork traffic. In " -implement the “web” and
                                                       local-area protocols needed to Internet ", 2010 - emer

    1
        3   1       6   1
                            8          9           1              13          12              16
                                     Cited Purpose –is due to articles Wide Web 3 ) initiative isDirect - All 26 c
                                            by 1919 - Related World - Check Availability transferspractical proj
                                                        The World-Wide Web (W (WWW) - BL a can show v
                                           information universe web: The information Direct - All 56 versio
                                                 World-wide into existence using available technology. This
                                                                                                universe
                                                       Cited by 3122 - Related articlesB " - Internet ", 2010 -
                                                                                          - BL
                                 7               T Berners-Lee, R Cailliau, JF Groff,
                                           the aims, data model, and protocols needed to implement the “web
                                     [BOOK] Information architecture forWeb (W 3 ) initiativewebpractica
                                                                                   the world wide is a
                                           CitedPurpose – Related articles - Check Availability - BL Direct - All
                                                  by 1919 - The World-Wide
                                     L Rosenfeld, P World-wide web: The information universe
                                                 information universe - portal.acm.org
                                                        Morville - 1998 into existence using available technology.
                                     Information Architecture for the R and protocols needed to- implement the
                                                 the aims, data model,World Wide Web isB "webmasters, des
                                                                                                for Internet ", 20
                                                   Information site. It's for novice Groff, 3 ) initiative webpra
                                                       T Berners-Lee, Cailliau, JF
                                     involved inCitedPurpose – The World-Wide Checkworld wideBLfrom t
                                           [BOOK]building a web architecture for the designers who,
                2                                                                       web(W
                                                                                                       12
                                                        by 1919 - Related articles - Web Availability - is Direc
                                           L Rosenfeld, P Morville designed sites. It's for experienced web de
                                                       information universe   - portal.acm.org
                                                                                                                a
                                     the traps that result in poorly - 1998 into existence using available techno
                                     Cited InformationRelated articles architecture for Web world implemen
                                            by 1030 -the aims, data for- the World Wide- All 19 versions
                                                         Architecture model, andSearch
                                                          Information Library protocolsweb
                                                                                                   is for webmasters
                                                                                               needed to wide we
                                                 [BOOK] building a web site. It's for novice the designers who, f
                                           involved in
3                                                      Cited by 1919 - Related articles - Check Availability - BL
                                           the traps that result in poorly -designed sites. It's for experienced we
                                                 L Rosenfeld, P Morville 1998 - portal.acm.org
                    2   9            [BOOK] Weaving the Architecture for the World Wide Web is for webm
                                                                Web: The original design and ultimate de
                                           CitedInformationRelated articles - Library Search - All 19 versions
2   4       5               10        11
                                                  by 1030 -
                                                   1              14                          15
                                     inventor involved in building a web architecture for the world wid
                                                       [BOOK] Information site. It's for novice web designers w
                    1   1                        the traps that result in poorly -designed sites. It's for experience
                                                       L Rosenfeld, P Morville 1998 - portal.acm.org
1       3                        9         [BOOK] Weaving the Web: The original design and ultimat
                                                                                13
                                                 CitedInformation Related articles -the World Wide -Web is versio
                                                        by 1030 - Architecture for Library Search All 19 for w
                                           inventor involved in building a web site. It's for novice web design
                                                 [BOOK] Weaving the Web: The original design for exper
                                                       the traps that result in poorly designed sites. It's and ult
                                                       Cited by 1030 - Related articles - Library Search - All 19
                                                inventor
                                                       [BOOK]Weaving the Web: The original design an
                                                       inventor                                 28
OXPath » Formal Properties
3

 Summary of Complexity

Combined: PTIME-hard PTIME-hard

                                  Data: NLOGSPACE LOGSPACE
                                                               Extraction marker = n-ary, nested queries
                                                 Time                               Space

OXPath w/o Actions & Kleene
                   Actions = multiple pages    O( n6⋅q2 ) O(n4⋅q2)                 O( n5⋅q2 ) O(n3⋅q2)

OXPath w/o Kleene                             O( (p⋅n)6⋅q3 )                       O( n5⋅q3 )

OXPath w/o unbounded Kleene                   O( (p⋅n)6⋅q3 )    Contextual actionsO( n5⋅q∑3 ) prefix)
                                                                                  (action free

OXPath (full)                                 O( (p⋅n)6⋅q3 )                    O( n5⋅(q+d)3 )

                                                     Buffer bounded by page depth

                                                                                                       29
OXPath » Formal Properties
3

Page-At-A-Time (PAAT)

node-set semantics
                                 extraction semantics

              complexity close to XPath
                                       parallelizable


                                                        30
4



Evaluation
             31
100,000+ pages, millions of results




                                                                        #pages [1000] / #results [100,000]
              200                                               160
                                   memory
                        extracted matches                       140
              150            visited pages                      120
memory [MB]




                                                                100
              100                                               80
                                                                60
              50                                                40
                                                                20
               0                                                    0
                    0       2      4      6       8     10     12

Constant Memory
    (b) Millions of results
                                       time [h]
                                                                                              32
PAAT              browser initialization
 page rendering
                  2%

                       13%


                  2%

                       13%

           85%


browser bound                               33
700
time w/o page loading [sec]
                                                      OXPath
                              600       Web Content Extractor
                                                        Lixto
                              500          Visual Web Ripper
                                                Web Harvest
                                                  Chickenfoot
                              400
                              300
                              200
                              100
                               0
                                    0      20
                                                faster
                                                  40     60 80 100 120 140
                                                          #pages             34
1600
time w/o page loading [sec]

                              1400            OXPath
                                                Lixto
                              1200       Web Harvest
                                          Chickenfoot
                              1000
                              800
                              600
                              400
                              200
                                0

                                     even faster
                                     0   100 200 300 400 500 600 700 800
                                                Number of pages         35
350                  OXPath
                                     Lixto
              300             Web Harvest
                               Chickenfoot
memory [MB]



              250
              200
              150
               only hundreds of pages as
              100 tools fail for more pages
              other
               50
                 0

                         memory
                     0   100 200 300 400 500 600 700 800
                                    #pages             36
OXPath » System & Evaluation
3

Minimal

    constant memory
                                   minimal page buffer

            browser bound
                                            very low overhead



                                                           37
4


    OXPath
     Suite
             38
generalized to create a list of similar elements. An add
    OXPath » OXPath Suite Interface – Before Expression Refinement
       Figure 3.6: User
4
                            how many nodes this expression extracts. Clicking on
                            alternatively in ascending and descending order.
Visual OXPath: Semi-supervised Generation

                   Interaction recording


                 Generation of OXPath expressions
               ranked by robustness & specificity




                                           Figure 3.4: User Interface – Action Prope

                              Users can select another expression in the list th
                            suitable. They can also adapt the generated OXPa
                            expression manually. For assistance, the syntactic corre
                            and the number of nodes it selects is updated durin 39
4   OXPath » OXPath Suite


OXPath Tracer: Debugging



              Track selected DOM nodes



              Trace page management

                 Fine grained execution control
                    (filter, step-by-step, breaks)




                                                    40
4   OXPath » OXPath Suite


ONTOX: Ontology Population with OXPath




                                         41
DIADEM   domain-centric intelligent automated
         data extraction methodology
DIADEM   domain-centric intelligent automated
         data extraction methodology




Questions
    diadem-project.info

More Related Content

More from Giorgio Orsi

Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
 
Deos 2014 - Welcome
Deos 2014 - WelcomeDeos 2014 - Welcome
Deos 2014 - WelcomeGiorgio Orsi
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesGiorgio Orsi
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 PosterGiorgio Orsi
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)Giorgio Orsi
 
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)Giorgio Orsi
 
Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012Giorgio Orsi
 
OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012Giorgio Orsi
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem OntologyGiorgio Orsi
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentationGiorgio Orsi
 

More from Giorgio Orsi (20)

Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological Databases
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
 
Deos 2014 - Welcome
Deos 2014 - WelcomeDeos 2014 - Welcome
Deos 2014 - Welcome
 
Perv a ds-rr13
Perv a ds-rr13Perv a ds-rr13
Perv a ds-rr13
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web Databases
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
 
DIADEM WWW 2012
DIADEM WWW 2012DIADEM WWW 2012
DIADEM WWW 2012
 
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
 
Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012
 
OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...
 
Table Recognition
Table RecognitionTable Recognition
Table Recognition
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Gottlob ICDE 2011
Gottlob ICDE 2011Gottlob ICDE 2011
Gottlob ICDE 2011
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
 
Orsi PersDB11
Orsi PersDB11Orsi PersDB11
Orsi PersDB11
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

Interactive OXPath Wrapper for Data Extraction from Kayak.co.uk

  • 1. DIADEM domain-centric intelligent automated data extraction methodology OXPath Scalable, Memory-efficient Data Extraction from Web Applications Andrew Sellers September 1st, 2011 @ Department of Computer Science, Oxford University joint work with Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart
  • 2. 1 A Lingua Franca for Web Extraction 2
  • 3. 3
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. OXPath » Lingua Franca for Web Extraction 1 A Call for Action in Web Extraction! Past: Form Filling + HTML Patterns Now: Interaction + DOM Patterns getting to the data requires interaction not just form filling identifying relevant data from rendered DOMs including computed style and geometric information access to all CSS properties, but less rich relations than in SXPath 8
  • 9. The nesting in the result mirrors the structure of the OX pression: extraction markers in a predicate (title, source sent attributes to the last marker outside the predicate (sto Kleene Star. Finally, we add the Kleene star, as in [1 example, the following expression queries Google for “O Seattle traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli To limit the range of the Kleene star, one can specify up lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo /descendant::input[@type=’checkbox’][2]/{uncheck}/follo //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./ 9
  • 10. The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> 10 z
  • 11. OXPath » Lingua Franca for Web Extraction 1 Wrapper Babel Wrapper induction & data extraction systems each invent their own wrapper language often separate navigation and matching Main classes: (1) pattern matching + imperative navigation XPath Finite & Tree Automata Token Prefix/Suffix (2) Datalog E-Log (Lixto) 11
  • 12. OXPath » Lingua Franca for Web Extraction 1 Why OXPath? scalability familiarity an XPath for simplicity data extraction web applications 12
  • 13. 2 OXPath 13
  • 14. Start at kayak.co.uk: doc("rightmove.co.uk") To select an airport, type a few letters and select from completion list //field().destination/{"Sea" /} //div#smartbox//li[1]/{click /} Submit the form 14
  • 15. Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } On all result pagesattributes Extract the /(//a[.=‘Next’]/{click /})* warnings Mouseover the ! to extract flight quality and//span.qualityWarningIcon/{mouseover /} for each flight //body.resultrow:<flight> Click on the details to extract layovers 15
  • 16. 2 OXPath » The Language Interactive Wrapper for Kayak doc(‘‘kayak.co.uk’’)//input#origin/{“Lon” /} //div#smartbox//li[1]/{click /} //input#destination/{“Sea” /} //div#smartbox//li[1]/{click /} /descendant::field()[last()]/{click /} //*#stops1/{uncheck }/following::*#stops2/{click /} /(//a[.=’Next’])* //body.resultrow:<flight> [.//a.results_price:<price=string(.)>] [.//span.qualityWarningIcon/{mouseover /} //div.airqualitylist//tr:<warning=string(.)>] [.//a.resultdetaillink/{click /} //*.layover//td.airportCode:<layover=string(.)>]16
  • 17. Your Amazon.co.uk | Deals of the Week | Gift Cards | Gifts & Wish Lists Usually dispatchedYour Ac within 1 Books Advanced 1 Browse New & Future2 Audio Bargain Dispatched from and sold by Amaz Special Bestsellers here. Sell on Amazon: First Month Subscription Free Hello. Sign in to get personalised recommendations. New Customer? Start Paperbacks Sell Your p All Departments Search Books Search Books Genres Releases Hyderabad Hyderabad Books Books Offers Books Basket 2 used from 4 new from £9.53 Wi Your Amazon.co.uk | Deals of the Week | Gift Cards | Gifts & Wish Lists Your Account | Help ks New Arrivals Shop All Departments Search Books › "Hyderabad" Hyderabad & Future Books Advanced Browse New Bestsellers Paperbacks } Basket Audio Wish ListBargain Special Any Release Date Search Genres Releases Hello Tim Furche. We have recommendations for you. (Not Tim?) Sell on Amazon:/ Books Month Books lick First Special Subscription Free Offers Advanced Browse New & Future Audio {cBargain Sell Your Books Last 30 days (19) Search Genres Releases Bestsellers Paperbacks Tim's Amazon.co.uk | Deals of the Week | Gift Cards | 3 Books Gifts & Wish Lists Books Your Account | Help Offers Books Showing 1 - 12 of 7,821 Results Sort by Relevance Last 90 days (109) w Arrivals Departments Books › "Hyderabad" History Hyderabad 5 Shop All Next 90 days (10) New Arrivals Search Books › "Hyderabad" Basket Wish List Formats ny Release Date Any Release Date 1. Advanced Browse Hyderabad: A Biography by Audio New & Future NarendraBargain image Special Luther (Paperback - Your Hardcover Sell 14 Sep 2006) Books Bestsellers Paperbacks See larger ast Department (19) 30 days (19) Last 30 days Search Genres Releases Books Books Offers Books Buy new: £10.99 Showing 1 - 12 of 7,821 Results Showing 1 - 12 of 7,821 Results 6 Used & new from £9.53 by images Share your own customer Sort Relevance Paperback Sort by Relevance Last 90 days 90 days (109) (109) ast ‹ Any Department Publisher: learn how customers can search inside this Next 90 days New Arrivals (10) Books › "Hyderabad" › History dispatched within 1 to 2 months Usually book. Books Next 90 days (10) Date Any Release 4 1. Hyderabad: A Biography by Narendra Luther (Paperback - 14 Sep 2006) Eligible for FREE Super Saver Delivery. Publisher! History (2,194) 1. Hyderabad: A&Biography bylike to read this book on Kindle Narendra Luther Tell the (Paperback - 14 Sep Department (2) Last 30 days Buy new: £10.99 6 Used new from £9.53I’d Biography (362) ‹ Any Department {click /} Showing 1 - 12 of 2,195 Results Sort by Relevance partment 90 days (18) (327) Last Books & Holiday Travel Buy new: £10.99 6 UsedDon t neworfrom £9.53 Usually dispatched within 1 to 2 months & have a Kindle? Get a FREE your Next 90 days (2) book Kindle here, download ny Department(2,194)(2,208) Study Books History 1. title: Hyderabad: A Biography by Delivery. Luther (Paperback - 14 Sep 2006) Eligible for FREE Super Saver Narendra Kindle Reading App. Books Computing(362) Biography Department & Internet (299) 2. Hyderabad dispatched within 1 to 2 months Usually Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010) Buy new: £10.99 price: 6 Used & new from £9.53 Buy new: £5.79FREE Super Saver Delivery. Eligible for { Travel & Holiday (327) HistoryReference (518) ‹ Any Department (2,194) Usually dispatched withinl1 to 2 months c ‹ Books Books (2,208) Study Society, Politics & i 5 Biography (362) & (2,146) (299) Computing Philosophy Internet History 2. Hyderabad Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010) ck Available for download now Reference &(327) Travel Science & (518) (1,601) & Countries Nature (1,058) Holiday Regions /} Product details Buy new: £5.79 Society, Politics & Paperback: 436 pages Business, History & Cultural Finance Study Books (2,208) (354) Philosophy (2,146) Available for download now Social & Economic Law (1,061) publisher: OUP Oxford; New Ed edition (14 Sep 2006) Publisher: Computing & Internet (299) 2. Modern Hyderabad Travel John Law by English Hyderabad (Deccan) by Guide (Paperback - Guides (Kindle Edition - 13 Se Offbeat 25 Nov 2009) Science & (74) History Nature (1,601) 2. Language Scientific, Finance & & Business, Technical World History Reference (518)(816) (725) Medical Law (1,061) Buy new:4 £5.79 new from £17.99 0195684346 Buy new: £17.99 Used & ISBN-10: Reference (50) 3. Progress in Cryptology - INDOCRYPT 2010: 11th International Conference Religion & Spirituality Society,Scientific, Technical & (601) Politics History (493) & ISBN-13: 978-0195684346 Religious(816) Get iton Cryptology inifIndia, Hyderabad, India, December 12-15, 2010, by Friday, Oct 29 you order in the next 23 hours and choose express delivery.Available ... Computer ScienceDimensions: 20.8 x 13.6 x 2.4 cm Progress in Cryptology -downloadProduct / Security and Cryptology) by Guang for INDOCRYPTnow 11th International Conference Medical Philosophy (2,146) (121) Art, Architecture & 3. 2010: Political History Religion & Spirituality (601) Photography (192) Proceedings Science Art,Nature (1,601) & Architecture & on Cryptology in India, Hyderabad, Average Customer Review: 2010, first to review this item India, December 12-15, North America (42) Health, Family & Gong and Kishan Chand Gupta (Paperback - 12 Dec Be the 2010) Proceedings ... Computer Science / Amazon Bestsellers Rank: 253,718 in Books (See Top 100 in Books) Security and Cryptology) by Guang Business,Photography & Archaeology (37) Finance (192) Lifestyle (371) Gong Buy Kishan£48.99 and new: Chand Gupta (Paperback - 12 Books > History > Countries & Regions > Asia > 1500-190 #11 in Dec 2010) Law (1,061) (208) & (222) Health, Family Military History Fiction Lifestyle (371) #29 in Books > History > Countries & Regions > Asia > 1900-194 Europe (193) 3. The new: £48.99 pre-order Buy Untold Charminar: Writings on Hyderabad by Syeda Imam Available for Scientific, Technical & (66) Fiction Drama & Poetry, (208) Academic History (Hardcover - 30 May 2008) Criticism (191) Medical (816) Ireland (132) Poetry, Drama & Available for pre-order Super Saver Would you like to update product info or give feedback on images? Eligible for FREE Delivery. Britain & Criticism (191) Antiquarian, Rare & 3. Buy new: £15.99 £13.59 Cryptology from £13.28 Eligible for FREE Super SaverUsed & new - INDOCRYPT 2010: 11th International C Progress in 6 Delivery. ReligionAntiquarian, Rare &(601) & Spirituality Other Historical See Complete Table of Contents Collectable (164) Subjects (70) Get itIndia South Nelles Ma: India, Hyderabad, and choose express byon Cryptologyifin Including Andaman / Nicobar December 12-15, 201 Thursday, Oct 28 you order in the next 5 hours India, Islands, Collectable (164) Art, Architecture & (164) & Home & Garden Letters Essays, Journals, 4. 4. delivery.Proceedings ... Computer Science Islands, India South Nelles Ma: Including Andaman / Nicobar / Security and Cryptology) by PhotographyGarden (164) Home & (192) True Accounts (14) Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore / More About the Author Food & Drink (81) Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore / Only 1 left in stock - order soon. ChennaiGong (Madras)/ Thiruvananthapuram (Trivandrum) (Nellesand more. Chennai andThiruvananthapuram (Trivandrum) (Nelles Map) by (Madras)/ Kishan Chand Gupta (Paperback - 12 writers, 2010) by Dec Map) Food & Drink (81) Family History & Ancient & Health,Sports, Hobbies&& Sports, Hobbies Discover books, learn about Civilisation (16) Lifestyle (371) Games (80) Games (80) Nelles (Map - 6 Oct -2010) 2010) Nelles (Map 6 Oct Maritime History (3) Buy new: £48.99 › Visit Amazon's Narendra Luther Page Hyderabad: Webster's 22 Used 22 Used & £3.40from by Icon Group Buy new: £7.95 £6.18 £6.18 Buy new: £7.95 Timeline History, 710 - 2007 £3.40 & new from new Calendars, Diaries, Annuals Calendars, Diaries, Annuals Fiction (208) 4. & More (8) & More (8) Format Poetry,Mind, Body & Spirit (121) Drama & Mind, Body & Spirit (121) byAvailable for pre-order Internationalby Thursday, 1 May orderyou order in hoursnext choose express Get itGet Thursday, Oct 28 if you 2009)in the next 2 the and 2 hours and choose express it (Paperback - Oct 28 if Any Format Criticism & Lesbian(50) Gay (191) delivery.Eligible for FREE Super Saver Delivery. delivery. Buy new: £28.95 £27.50 4 Used & new from £27.50 17 Gay & Lesbian (50) Hardcover (903)
  • 18. 2 OXPath » The Language ➊ Actions: Browser Interaction Actions correspond to DOM events, e.g., Document doc("rightmove.co.uk") Click {click} Fill {“Sea”} Mouseover {mouseover} Executed once on each context node Return context nodes (contextual actions) or root nodes for new DOM (absolute actions) 18
  • 19. 2 OXPath » The Language ➋ Extraction: Compact Tree Construction Extraction marker select nodes for extraction record markers: :<flight> attribute markers: :<price=string(.)> Extracted data has tree shape nesting of extraction markers in OXPath expression defines nesting of records and attribute-record associations in the output 19
  • 20. 2 OXPath » The Language ➌ Iteration: Kleene Star Most web sites use pagination for results traversing paginated results require iteration ⇢ extraction from any unbounded component of a link graph Kleene Star from Regular XPath [Marx TODS ‘05] extended to OXPath, i.e., with action in the iterated expression /(//a[.=’Next’])* OXPath’s Page-at-a-time algorithm buffers in practice only a constant number of pages even for very large components 20
  • 21. 2 OXPath » The Language ➋ Style: Querying Visual Attributes Access to all CSS properties via style axis Visibility style::display or style::visibility Font size style::font-size Geometry style::top, style::left, ... Color style::color or style::background-color Joins on style properties possible but: no rich spatial relations as in SXPath 21
  • 22. 2 OXPath » The Language OXPath = XPath + 4 action extraction style iteration 22
  • 23. 3 Analysis 23
  • 24. OXPath » Formal Properties 3 Semantics – Overview Semantics defined over relational structure as in XPath Leashed [Benedikt 08] but: extended to multiple documents and with action relations action relations form tree ⇢ no two actions lead to same DOM but: instead of single node set, set of relations for extracted tree Context extended by last match for parent extraction marker necessary to construct the extraction tree last match for sibling extraction used for compact specification of records with mixed attributes and nested records 24
  • 25. results … … 25
  • 26. N1 J estep/path KN (c) = {c⇧⇧ | c⇧ ⌃ J estep KN (c) ⌦ c⇧⇧ ⌃ J path KN (c⇧ )} N2 J axis::nodes KN (c) = { n⇧ , c.p, c.l | Raxis (c.n, n⇧ ) ⌦ n⇧ ⌃ Rnodes } N3 J step[q] KN (c) = {c⇧ ⌃ J step KN (c) | J q KB ( c⇧ .n, c⇧ .l, c⇧ .l )} N4 J step± [qp] KN (c) = {c⇧ ⌃ J step± KN (c) | C = J step± KN (c)⌦ J REWRITE± (qp,C, c⇧ ) KB ( c⇧ .n, c⇧ .l, c⇧ .l )} N5 J {action /} KN (c.n) = { n⇧ , c.p, c.l } with n⇧ such that Raction (c.n, n⇧ ) N6 J {action} KN (c) = J AFP(action, c.n) KN (J {action /} KN (c)) N7 J step : M[= v] KN (c) = { c⇧ .n, c⇧ .p, OUT(c⇧ .n, M) | c⇧ ⌃ J step KN (c)} N8 J (path) KN (c) = {cr | r ⌅ 0 ⌥0 ⇤ s ⇤ r : cs+1 ⌃ J path KN (cs ) ⌦ c0 = c} N9 J (path) {v, w} KN (c) = {cr | v ⇤ r ⇤ w ⌥0 ⇤ s ⇤ r : cs+1 ⌃ J path KN (cs ) ⌦ c0 = c} N10 J (expr)[qp] KN (c) = {c⇧ | c⇧ ⌃ C ⌦C = J expr KN (c)⌦ J REWRITE+ (qp,C, c⇧ ) KB ( c⇧ .n, c⇧ .l, c⇧ .l )} N11 J doc(uri) KN (c) = Fdoc (uri) 26
  • 27. OXPath » Formal Properties 3 Page-At-A-Time (PAAT) Assures constant memory over number of pages traversed Experimentally confirmed small overhead over XPath overhead due to nested record extraction actions do not affect complexity Kleene star: small overhead over regular XPath Extracted records are never buffered but streamed as soon as they are matched Offers optimal page management by minimising buffer size 27
  • 28. Web Images Videos Maps News Shopping Gmail more ! 1 world wide web Web Images Videos Maps News Shopping Gmail more ! 2 Web Images Videos Scholar Articles and patents world wide web Maps Newsanytime Shopping Gmail more include cit 1 3 Web Images patents The Scholar of Articles and Videos Maps world wide web diameter the world wide web Newsanytime Shopping Gmail m inclu R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907038, 1999 arXiv:cond-mat/9907038v2 [cond-mat.dis-nn] 10 Sep 1999 The diameter Despite itsScholar role in communication,web The diameter of Articles and patents the world wide web (www) the world wide world wide increasing web anytime R Albert, H Jeong,individual or institution can create websites with u controlled medium: any AL Barabási - Arxiv preprint cond-mat/9907038, Cited arXiv:cond-mat/9907038v2 -[cond-mat.dis-nn]web versions by 2497 - diameter of theView as HTML - All 52 1999 The diam Related articles 10 Sep The Scholar role in communication, the world wide web ( world wide Despite its increasing Articles and patents anytime R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907 controlled medium: any individual or institution can create websites Self-similarity in World Wide Web traffic: evidence and poss CitedarXiv:cond-mat/9907038v2 - View as HTML - 10 Sep 1999 The by 2497 - Related articles [cond-mat.dis-nn] All 52 versions ME Crovella, A The diameter of the communication,webworld wide w Despite its increasing role in world wide 2002 - ieeexplore Bestavros - Networking, IEEE/ACM ", the 1 2 7 1 12 Abstract—Recently, the notionanyself-similarity institution can create webs controlled medium: of individual or has been shown to apply Self-similarity in HIn this paper, we[cond-mat.dis-nn] 10cond-mat R Albert, Jeong, AL Barabási - Arxiv preprint network traffic.-World Wide Web traffic: evidence and local-area CitedarXiv:cond-mat/9907038v2 -show evidence that the subse by 2497 Related articles View as HTML - All Sepversi 52 199 is due to WorldDespiteWeb (WWW) transfers IEEE/ACMcharacteristics th ME Crovella, A Bestavros - Networking, can show ", 2002 - ieeex Wide its increasing role in communication, the world w Cited Abstract—Recently,articles - BLof self-similarityversions evidence a by 3122 -controlled the notion Direct - All 56 institution can create Related medium: any Wide Web has been shown to individual or traffic: Self-similarity in Worldpaper, we show evidence that the local-area Cited by 2497 - In this articles - View as HTML - All 52 network traffic. Related is due to Crovella, A Bestavros - Networking, IEEE/ACM ", 2002 - ME World-wideWorld Wide Web (WWW) of self-similarity hascharacterist web: The information transfers can show been 1 Abstract—Recently,articles - BL universe 56 versions show Cited by 3122 - Related the notion Direct - All T Berners-Lee, Self-similarity in WorldInternetwe show evidence tha local-area network traffic. B " - paper, Web traffic: evide R Cailliau, JF Groff, In this Wide ", 2010 - emeraldins Purpose – is due to Crovella, Web (W 3 ) initiative is a practical project 20 The ME World Wide Web (WWW) transfers IEEE/ACM charac World-Wide A Bestavros - Networking, can show ", d information universe into - Related articles - BL universe 56 versions World-wide web: The informationof self-similarity has been CitedAbstract—Recently, using available technology. This pape by 3122 existence the notion Direct - All 2 7 the aims, data model, Cailliau, JF Groff, B this paper, we show evidence T Berners-Lee, Randnetwork traffic. In " -implement the “web” and local-area protocols needed to Internet ", 2010 - emer 1 3 1 6 1 8 9 1 13 12 16 Cited Purpose –is due to articles Wide Web 3 ) initiative isDirect - All 26 c by 1919 - Related World - Check Availability transferspractical proj The World-Wide Web (W (WWW) - BL a can show v information universe web: The information Direct - All 56 versio World-wide into existence using available technology. This universe Cited by 3122 - Related articlesB " - Internet ", 2010 - - BL 7 T Berners-Lee, R Cailliau, JF Groff, the aims, data model, and protocols needed to implement the “web [BOOK] Information architecture forWeb (W 3 ) initiativewebpractica the world wide is a CitedPurpose – Related articles - Check Availability - BL Direct - All by 1919 - The World-Wide L Rosenfeld, P World-wide web: The information universe information universe - portal.acm.org Morville - 1998 into existence using available technology. Information Architecture for the R and protocols needed to- implement the the aims, data model,World Wide Web isB "webmasters, des for Internet ", 20 Information site. It's for novice Groff, 3 ) initiative webpra T Berners-Lee, Cailliau, JF involved inCitedPurpose – The World-Wide Checkworld wideBLfrom t [BOOK]building a web architecture for the designers who, 2 web(W 12 by 1919 - Related articles - Web Availability - is Direc L Rosenfeld, P Morville designed sites. It's for experienced web de information universe - portal.acm.org a the traps that result in poorly - 1998 into existence using available techno Cited InformationRelated articles architecture for Web world implemen by 1030 -the aims, data for- the World Wide- All 19 versions Architecture model, andSearch Information Library protocolsweb is for webmasters needed to wide we [BOOK] building a web site. It's for novice the designers who, f involved in 3 Cited by 1919 - Related articles - Check Availability - BL the traps that result in poorly -designed sites. It's for experienced we L Rosenfeld, P Morville 1998 - portal.acm.org 2 9 [BOOK] Weaving the Architecture for the World Wide Web is for webm Web: The original design and ultimate de CitedInformationRelated articles - Library Search - All 19 versions 2 4 5 10 11 by 1030 - 1 14 15 inventor involved in building a web architecture for the world wid [BOOK] Information site. It's for novice web designers w 1 1 the traps that result in poorly -designed sites. It's for experience L Rosenfeld, P Morville 1998 - portal.acm.org 1 3 9 [BOOK] Weaving the Web: The original design and ultimat 13 CitedInformation Related articles -the World Wide -Web is versio by 1030 - Architecture for Library Search All 19 for w inventor involved in building a web site. It's for novice web design [BOOK] Weaving the Web: The original design for exper the traps that result in poorly designed sites. It's and ult Cited by 1030 - Related articles - Library Search - All 19 inventor [BOOK]Weaving the Web: The original design an inventor 28
  • 29. OXPath » Formal Properties 3 Summary of Complexity Combined: PTIME-hard PTIME-hard Data: NLOGSPACE LOGSPACE Extraction marker = n-ary, nested queries Time Space OXPath w/o Actions & Kleene Actions = multiple pages O( n6⋅q2 ) O(n4⋅q2) O( n5⋅q2 ) O(n3⋅q2) OXPath w/o Kleene O( (p⋅n)6⋅q3 ) O( n5⋅q3 ) OXPath w/o unbounded Kleene O( (p⋅n)6⋅q3 ) Contextual actionsO( n5⋅q∑3 ) prefix) (action free OXPath (full) O( (p⋅n)6⋅q3 ) O( n5⋅(q+d)3 ) Buffer bounded by page depth 29
  • 30. OXPath » Formal Properties 3 Page-At-A-Time (PAAT) node-set semantics extraction semantics complexity close to XPath parallelizable 30
  • 32. 100,000+ pages, millions of results #pages [1000] / #results [100,000] 200 160 memory extracted matches 140 150 visited pages 120 memory [MB] 100 100 80 60 50 40 20 0 0 0 2 4 6 8 10 12 Constant Memory (b) Millions of results time [h] 32
  • 33. PAAT browser initialization page rendering 2% 13% 2% 13% 85% browser bound 33
  • 34. 700 time w/o page loading [sec] OXPath 600 Web Content Extractor Lixto 500 Visual Web Ripper Web Harvest Chickenfoot 400 300 200 100 0 0 20 faster 40 60 80 100 120 140 #pages 34
  • 35. 1600 time w/o page loading [sec] 1400 OXPath Lixto 1200 Web Harvest Chickenfoot 1000 800 600 400 200 0 even faster 0 100 200 300 400 500 600 700 800 Number of pages 35
  • 36. 350 OXPath Lixto 300 Web Harvest Chickenfoot memory [MB] 250 200 150 only hundreds of pages as 100 tools fail for more pages other 50 0 memory 0 100 200 300 400 500 600 700 800 #pages 36
  • 37. OXPath » System & Evaluation 3 Minimal constant memory minimal page buffer browser bound very low overhead 37
  • 38. 4 OXPath Suite 38
  • 39. generalized to create a list of similar elements. An add OXPath » OXPath Suite Interface – Before Expression Refinement Figure 3.6: User 4 how many nodes this expression extracts. Clicking on alternatively in ascending and descending order. Visual OXPath: Semi-supervised Generation Interaction recording Generation of OXPath expressions ranked by robustness & specificity Figure 3.4: User Interface – Action Prope Users can select another expression in the list th suitable. They can also adapt the generated OXPa expression manually. For assistance, the syntactic corre and the number of nodes it selects is updated durin 39
  • 40. 4 OXPath » OXPath Suite OXPath Tracer: Debugging Track selected DOM nodes Trace page management Fine grained execution control (filter, step-by-step, breaks) 40
  • 41. 4 OXPath » OXPath Suite ONTOX: Ontology Population with OXPath 41
  • 42. DIADEM domain-centric intelligent automated data extraction methodology
  • 43. DIADEM domain-centric intelligent automated data extraction methodology Questions diadem-project.info