• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Effective Web Scraping with OXPath
 

Effective Web Scraping with OXPath

on

  • 1,695 views

OXPath presentation at WWW 2013 Rio de Janeiro

OXPath presentation at WWW 2013 Rio de Janeiro

Statistics

Views

Total Views
1,695
Views on SlideShare
1,487
Embed Views
208

Actions

Likes
1
Downloads
8
Comments
0

4 Embeds 208

http://eventifier.co 204
http://eventifier.com 2
https://twitter.com 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Effective Web Scraping with OXPath Effective Web Scraping with OXPath Presentation Transcript

    • DIADEM domain-centric intelligent automateddata extraction methodologyEffective Web Scraping withhttp://oxpath.orgGiovanni Grasso - Oxford UniversityMay 15th, 2013 @ WWW developer trackjoint work with Tim Furche, Christian Schallhart,Wednesday, 15 May 13
    • OXPath » Lingua Franca for Web Extraction1A Call for Action in Web Extraction!Past: Form Filling + HTML PatternsNow: Interaction + DOM Patternsgetting to the data requires interaction not just form fillingidentifying relevant data from rendered DOMsacross several pagesaccess to all CSS properties (computed style)2Wednesday, 15 May 13
    • 3The nesting in the result mirrors the structure of the OXpression: extraction markers in a predicate (title, sourcesent attributes to the last marker outside the predicate (stoKleene Star. Finally, we add the Kleene star, as in [1example, the following expression queries Google for “Otraverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{cliTo limit the range of the Kleene star, one can specify uplower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo/descendant::input[@type=’checkbox’][2]/{uncheck}/folloWednesday, 15 May 13
    • 3SeattleThe nesting in the result mirrors the structure of the OXpression: extraction markers in a predicate (title, sourcesent attributes to the last marker outside the predicate (stoKleene Star. Finally, we add the Kleene star, as in [1example, the following expression queries Google for “Otraverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{cliTo limit the range of the Kleene star, one can specify uplower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo/descendant::input[@type=’checkbox’][2]/{uncheck}/folloWednesday, 15 May 13
    • 4Wednesday, 15 May 13
    • 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
    • 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
    • 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
    • OXPath » Lingua Franca for Web Extraction1Wrapper BabelWrapper induction & data extraction systemseach invent their own wrapper languageor use its own ad-hoc tool or proprietary languageMainly pattern matching + imperative navigationmix of XPath & external flow controllimited interaction with complex interfaces(simple) form filling & submitfocus on automation via visual interfaceslimited extraction languageno multiway navigation5Wednesday, 15 May 13
    • 1 OXPath » Lingua Franca for Web ExtractionWhy OXPath?6an XPath fordata extractionsimplicitylearnablefamiliarityscalabilityWednesday, 15 May 13
    • OXPath » The Language2OXPath = XPath + 47action iterationextractionstyleOXPathWednesday, 15 May 13
    • 8Wednesday, 15 May 13
    • 8Start at kayak.co.uk:doc("kayak.co.uk")To select an airport, type a few letters and select from completion list//field().destination/{"Sea" /}//div#smartbox//li[1]/{click /}This will submit the formWednesday, 15 May 13
    • 9Wednesday, 15 May 13
    • 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }Wednesday, 15 May 13
    • 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }On all result pages/(//a[.=‘Next’]/{click /})*Wednesday, 15 May 13
    • 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }On all result pages/(//a[.=‘Next’]/{click /})*and for each flight//body.resultrow:<flight>Wednesday, 15 May 13
    • 9Extract the attributesWednesday, 15 May 13
    • 9Extract the attributesMouseover the ! to extract flight quality warnings//span.qualityWarningIcon/{mouseover /}Wednesday, 15 May 13
    • 9Extract the attributesMouseover the ! to extract flight quality warnings//span.qualityWarningIcon/{mouseover /}Click on the details to extract layoversWednesday, 15 May 13
    • OXPath » The Language2Actions correspond to DOM events, e.g.,Executed once on each context nodeReturn context nodes for contextual actions orroot nodes for new DOM absolute actions {click/}➊ Actions: Browser Interaction10DocumentClickFillMouseoverdoc("google.com"){click}{“Rio”}{mouseover}Wednesday, 15 May 13
    • OXPath » The Language2Extraction marker select nodes for extractionrecord markers: :<flight>attribute markers: :<price=string(.)>Extracted data has tree shapenesting of extraction markers in OXPath expression definesnesting of records and attribute-record associations in the output➋ Extraction: Compact Tree Construction11Wednesday, 15 May 13
    • Wednesday, 15 May 13
    • OXPath » The Language2Most web sites use pagination techniques for resultstraversing paginated results require iteration⇢ extraction from any unbounded component of a link graphKleene Star with action in the iterated expressionOXPath’s evaluation algorithmbuffers in practice only a constant number of pages➌ Iteration: Kleene Star13/(//a[.=’Next’]/{click /})*/(//body/{scroll /})* ( infinite scroll )Wednesday, 15 May 13
    • OXPath » The Language2Access to all computed style CSS properties via style axis➍ Style: Querying Visual Attributes14VisibilityFont sizeGeometryColorstyle::display or style::visibilitystyle::font-sizestyle::top, style::left, ...style::color or style::background-colorWednesday, 15 May 13
    • 3Evaluation15Wednesday, 15 May 13
    • 0501001502000 2 4 6 8 10 12020406080100120140160memory[MB]#pages[1000]/#results[100,000]time [h]memoryextracted matchesvisited pages(b) Millions of resultsConstant Memory16100,000+ pages, millions of resultsWednesday, 15 May 13
    • 172%13%85%page rendering browser initialization OXPathit’s the browserWednesday, 15 May 13
    • 01002003004005006007000 20 40 60 80 100 120 140timew/opageloading[sec]#pagesOXPathWeb Content ExtractorLixtoVisual Web RipperWeb HarvestChickenfoot(b) Norm. evaluation time, <150 p.faster 18Wednesday, 15 May 13
    • 020040060080010001200140016000 100 200 300 400 500 600 700 800timew/opageloading[sec]Number of pagesOXPathLixtoWeb HarvestChickenfoot(c) Norm. evaluation time, <850 p.even faster 19Wednesday, 15 May 13
    • 0501001502002503003500 100 200 300 400 500 600 700 800memory[MB]#pagesOXPathLixtoWeb HarvestChickenfootmemory 20Wednesday, 15 May 13
    • 0501001502002503003500 100 200 300 400 500 600 700 800memory[MB]#pagesOXPathLixtoWeb HarvestChickenfootmemory 20only hundreds of pages asother tools fail for more pagesWednesday, 15 May 13
    • OXPath » System & Evaluation3Evaluation21constant memoryvery low overhead onXPathminimal page bufferbrowser boundfastWednesday, 15 May 13
    • 4OXPathUser stories22Wednesday, 15 May 13
    • 4DIADEMUnsupervised Domain-specific Web ObjectsExtractionpresented @ World Wide Web 2012 (WWW’12)23Wednesday, 15 May 13
    • 24DIADEM data extraction methodologydomain-centric intelligent automatedWednesday, 15 May 13
    • 25DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
    • 26DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
    • 27DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
    • 28DIADEM data extraction methodologydomain-centric intelligent automated:=ngFlatTextContext-drivenblock analysis3Energy Performance ChartMapsFloor plansWednesday, 15 May 13
    • OXPathWrapperCloud extractionData integration429DIADEM data extraction methodologydomain-centric intelligent automated:=Single entity (details) pagesTables2Objectdentification & alignmentgesFlatTextContext-drivenblock analysis3Energy Performance ChartMapsFloor plansWednesday, 15 May 13
    • Wrapper induction in DIADEM430Induced Wrapper (partial)doc(‘wwagency.co.uk’)//select#sale_type_id/{0/}//button.formbtn/{click /}(//div.pagenumlinks[last()]//a[last()]/{click /})*//div.proplist_wrap:<RECORD>[.//span.prop_price:<PRICE=string(.)>][.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>][.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>][.//strong.orange:<POSTCODE=string(.)>]//div.prop_img/a/{click /}//body[.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>][.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>][.//a.~Map view)]/@href:<MAP=string(.)>][.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]Wednesday, 15 May 13
    • 4DEQAQuestion Answeringon the Deep Webpresented @International Semantic Web Conference 2012 (ISWC’12)31Wednesday, 15 May 13
    • 32Kindergarden_BWhite_Road1,499,950 £gr:Offeringrdf:typedd:hasPriceKindergarden_Adbp:nearDomain SpecificTriple StoreQuestion:House near a Kindergarden under 2,000,000 £?OXPathOXPathTBSLWhite_RoadAnswer:15dd:bedrooms1,499,950 £dd:hasPricedbp:near Kindergarden_ALinking-MetricOXPathWednesday, 15 May 13
    • OXPath » DEQA: Question Answering on the Deep Web433RDF Wrapper (partial)doc(‘wwagency.co.uk’)........//div.proplist_wrap:<gr:Offering>[.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>].....[.//strong.orange:<vcard:postal-code=string(.)>]....[.//div.prop_img/a/@href:<foaf:page=string(.)>]//div.prop_img/a/{click /}//body[.//div#propertypage_copy/p[last()-1]:<gr:description=string(.)>][.//a.~Map view)]/@href:<wgs84:lat=extractLat(.)>][.//a.~Map view)]/@href:<wgs84:long=extractLong(.)>]Wednesday, 15 May 13
    • OXPath » DEQA: Question Answering on the Deep Web434Question translation to SPARQLEdwardian houses close to supermarket for less than1,000,000 in Oxfordshire6 FILTER(regex(?y1,’Abingdon’,’i’)) .}In that case, TBSL first performs a restriction by class (“House”), then it fithe town name “Abingdon” from the street address and it performs a filter onnumber of rooms. Note that most QA systems would not be sufficiently powto include such filters.Another example is “Edwardian houses close to supermarket for less1,000,000 in Oxfordshire”, which was translated to the following query:SELECT ?x0 WHERE {2 ?x0 <http://dbpedia.org/property/near> ?y2 .?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .?y2 a <http://linkedgeodata.org/ontology/Supermarket> .8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .FILTER(regex(?y0,’Oxfordshire’,’i’)) .10 FILTER(regex(?y,’Edwardian ’,’i’)) .FILTER(?y1 < 1000000) .12 }In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
    • 5Hands-on35Wednesday, 15 May 13
    • 5Version 1.1 available on http://oxpath.org (via code.google)JAVAMaven archetype and Command Line Interface with examplesOutput in XML, RDF and Relational DB, custom output handlerBased on HTMLUnitsome limitations (e.g., no style axis)Ongoing workWebDriver-based implementation, Javascript in the next futureVisual Interface (record-and-play) as Firefox ExtensionAny feedback is welcome! Get in touch with meOXPath Engine36Wednesday, 15 May 13
    • Live Demo37Wednesday, 15 May 13
    • Questions?oxpath.org38Wednesday, 15 May 13