Effective Web Scraping with OXPath

2,390 views

Published on

OXPath presentation at WWW 2013 Rio de Janeiro

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,390
On SlideShare
0
From Embeds
0
Number of Embeds
213
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Effective Web Scraping with OXPath

  1. 1. DIADEM domain-centric intelligent automateddata extraction methodologyEffective Web Scraping withhttp://oxpath.orgGiovanni Grasso - Oxford UniversityMay 15th, 2013 @ WWW developer trackjoint work with Tim Furche, Christian Schallhart,Wednesday, 15 May 13
  2. 2. OXPath » Lingua Franca for Web Extraction1A Call for Action in Web Extraction!Past: Form Filling + HTML PatternsNow: Interaction + DOM Patternsgetting to the data requires interaction not just form fillingidentifying relevant data from rendered DOMsacross several pagesaccess to all CSS properties (computed style)2Wednesday, 15 May 13
  3. 3. 3The nesting in the result mirrors the structure of the OXpression: extraction markers in a predicate (title, sourcesent attributes to the last marker outside the predicate (stoKleene Star. Finally, we add the Kleene star, as in [1example, the following expression queries Google for “Otraverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{cliTo limit the range of the Kleene star, one can specify uplower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo/descendant::input[@type=’checkbox’][2]/{uncheck}/folloWednesday, 15 May 13
  4. 4. 3SeattleThe nesting in the result mirrors the structure of the OXpression: extraction markers in a predicate (title, sourcesent attributes to the last marker outside the predicate (stoKleene Star. Finally, we add the Kleene star, as in [1example, the following expression queries Google for “Otraverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{cliTo limit the range of the Kleene star, one can specify uplower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo/descendant::input[@type=’checkbox’][2]/{uncheck}/folloWednesday, 15 May 13
  5. 5. 4Wednesday, 15 May 13
  6. 6. 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
  7. 7. 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
  8. 8. 4The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//div.home-description:<info=(.)>zWednesday, 15 May 13
  9. 9. OXPath » Lingua Franca for Web Extraction1Wrapper BabelWrapper induction & data extraction systemseach invent their own wrapper languageor use its own ad-hoc tool or proprietary languageMainly pattern matching + imperative navigationmix of XPath & external flow controllimited interaction with complex interfaces(simple) form filling & submitfocus on automation via visual interfaceslimited extraction languageno multiway navigation5Wednesday, 15 May 13
  10. 10. 1 OXPath » Lingua Franca for Web ExtractionWhy OXPath?6an XPath fordata extractionsimplicitylearnablefamiliarityscalabilityWednesday, 15 May 13
  11. 11. OXPath » The Language2OXPath = XPath + 47action iterationextractionstyleOXPathWednesday, 15 May 13
  12. 12. 8Wednesday, 15 May 13
  13. 13. 8Start at kayak.co.uk:doc("kayak.co.uk")To select an airport, type a few letters and select from completion list//field().destination/{"Sea" /}//div#smartbox//li[1]/{click /}This will submit the formWednesday, 15 May 13
  14. 14. 9Wednesday, 15 May 13
  15. 15. 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }Wednesday, 15 May 13
  16. 16. 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }On all result pages/(//a[.=‘Next’]/{click /})*Wednesday, 15 May 13
  17. 17. 9Refine the results by unchecking the “2+ stops”://*#stops2/{uncheck }On all result pages/(//a[.=‘Next’]/{click /})*and for each flight//body.resultrow:<flight>Wednesday, 15 May 13
  18. 18. 9Extract the attributesWednesday, 15 May 13
  19. 19. 9Extract the attributesMouseover the ! to extract flight quality warnings//span.qualityWarningIcon/{mouseover /}Wednesday, 15 May 13
  20. 20. 9Extract the attributesMouseover the ! to extract flight quality warnings//span.qualityWarningIcon/{mouseover /}Click on the details to extract layoversWednesday, 15 May 13
  21. 21. OXPath » The Language2Actions correspond to DOM events, e.g.,Executed once on each context nodeReturn context nodes for contextual actions orroot nodes for new DOM absolute actions {click/}➊ Actions: Browser Interaction10DocumentClickFillMouseoverdoc("google.com"){click}{“Rio”}{mouseover}Wednesday, 15 May 13
  22. 22. OXPath » The Language2Extraction marker select nodes for extractionrecord markers: :<flight>attribute markers: :<price=string(.)>Extracted data has tree shapenesting of extraction markers in OXPath expression definesnesting of records and attribute-record associations in the output➋ Extraction: Compact Tree Construction11Wednesday, 15 May 13
  23. 23. Wednesday, 15 May 13
  24. 24. OXPath » The Language2Most web sites use pagination techniques for resultstraversing paginated results require iteration⇢ extraction from any unbounded component of a link graphKleene Star with action in the iterated expressionOXPath’s evaluation algorithmbuffers in practice only a constant number of pages➌ Iteration: Kleene Star13/(//a[.=’Next’]/{click /})*/(//body/{scroll /})* ( infinite scroll )Wednesday, 15 May 13
  25. 25. OXPath » The Language2Access to all computed style CSS properties via style axis➍ Style: Querying Visual Attributes14VisibilityFont sizeGeometryColorstyle::display or style::visibilitystyle::font-sizestyle::top, style::left, ...style::color or style::background-colorWednesday, 15 May 13
  26. 26. 3Evaluation15Wednesday, 15 May 13
  27. 27. 0501001502000 2 4 6 8 10 12020406080100120140160memory[MB]#pages[1000]/#results[100,000]time [h]memoryextracted matchesvisited pages(b) Millions of resultsConstant Memory16100,000+ pages, millions of resultsWednesday, 15 May 13
  28. 28. 172%13%85%page rendering browser initialization OXPathit’s the browserWednesday, 15 May 13
  29. 29. 01002003004005006007000 20 40 60 80 100 120 140timew/opageloading[sec]#pagesOXPathWeb Content ExtractorLixtoVisual Web RipperWeb HarvestChickenfoot(b) Norm. evaluation time, <150 p.faster 18Wednesday, 15 May 13
  30. 30. 020040060080010001200140016000 100 200 300 400 500 600 700 800timew/opageloading[sec]Number of pagesOXPathLixtoWeb HarvestChickenfoot(c) Norm. evaluation time, <850 p.even faster 19Wednesday, 15 May 13
  31. 31. 0501001502002503003500 100 200 300 400 500 600 700 800memory[MB]#pagesOXPathLixtoWeb HarvestChickenfootmemory 20Wednesday, 15 May 13
  32. 32. 0501001502002503003500 100 200 300 400 500 600 700 800memory[MB]#pagesOXPathLixtoWeb HarvestChickenfootmemory 20only hundreds of pages asother tools fail for more pagesWednesday, 15 May 13
  33. 33. OXPath » System & Evaluation3Evaluation21constant memoryvery low overhead onXPathminimal page bufferbrowser boundfastWednesday, 15 May 13
  34. 34. 4OXPathUser stories22Wednesday, 15 May 13
  35. 35. 4DIADEMUnsupervised Domain-specific Web ObjectsExtractionpresented @ World Wide Web 2012 (WWW’12)23Wednesday, 15 May 13
  36. 36. 24DIADEM data extraction methodologydomain-centric intelligent automatedWednesday, 15 May 13
  37. 37. 25DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
  38. 38. 26DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
  39. 39. 27DIADEM data extraction methodologydomain-centric intelligent automated:=Wednesday, 15 May 13
  40. 40. 28DIADEM data extraction methodologydomain-centric intelligent automated:=ngFlatTextContext-drivenblock analysis3Energy Performance ChartMapsFloor plansWednesday, 15 May 13
  41. 41. OXPathWrapperCloud extractionData integration429DIADEM data extraction methodologydomain-centric intelligent automated:=Single entity (details) pagesTables2Objectdentification & alignmentgesFlatTextContext-drivenblock analysis3Energy Performance ChartMapsFloor plansWednesday, 15 May 13
  42. 42. Wrapper induction in DIADEM430Induced Wrapper (partial)doc(‘wwagency.co.uk’)//select#sale_type_id/{0/}//button.formbtn/{click /}(//div.pagenumlinks[last()]//a[last()]/{click /})*//div.proplist_wrap:<RECORD>[.//span.prop_price:<PRICE=string(.)>][.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>][.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>][.//strong.orange:<POSTCODE=string(.)>]//div.prop_img/a/{click /}//body[.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>][.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>][.//a.~Map view)]/@href:<MAP=string(.)>][.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]Wednesday, 15 May 13
  43. 43. 4DEQAQuestion Answeringon the Deep Webpresented @International Semantic Web Conference 2012 (ISWC’12)31Wednesday, 15 May 13
  44. 44. 32Kindergarden_BWhite_Road1,499,950 £gr:Offeringrdf:typedd:hasPriceKindergarden_Adbp:nearDomain SpecificTriple StoreQuestion:House near a Kindergarden under 2,000,000 £?OXPathOXPathTBSLWhite_RoadAnswer:15dd:bedrooms1,499,950 £dd:hasPricedbp:near Kindergarden_ALinking-MetricOXPathWednesday, 15 May 13
  45. 45. OXPath » DEQA: Question Answering on the Deep Web433RDF Wrapper (partial)doc(‘wwagency.co.uk’)........//div.proplist_wrap:<gr:Offering>[.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>].....[.//strong.orange:<vcard:postal-code=string(.)>]....[.//div.prop_img/a/@href:<foaf:page=string(.)>]//div.prop_img/a/{click /}//body[.//div#propertypage_copy/p[last()-1]:<gr:description=string(.)>][.//a.~Map view)]/@href:<wgs84:lat=extractLat(.)>][.//a.~Map view)]/@href:<wgs84:long=extractLong(.)>]Wednesday, 15 May 13
  46. 46. OXPath » DEQA: Question Answering on the Deep Web434Question translation to SPARQLEdwardian houses close to supermarket for less than1,000,000 in Oxfordshire6 FILTER(regex(?y1,’Abingdon’,’i’)) .}In that case, TBSL first performs a restriction by class (“House”), then it fithe town name “Abingdon” from the street address and it performs a filter onnumber of rooms. Note that most QA systems would not be sufficiently powto include such filters.Another example is “Edwardian houses close to supermarket for less1,000,000 in Oxfordshire”, which was translated to the following query:SELECT ?x0 WHERE {2 ?x0 <http://dbpedia.org/property/near> ?y2 .?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .?y2 a <http://linkedgeodata.org/ontology/Supermarket> .8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .FILTER(regex(?y0,’Oxfordshire’,’i’)) .10 FILTER(regex(?y,’Edwardian ’,’i’)) .FILTER(?y1 < 1000000) .12 }In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
  47. 47. 5Hands-on35Wednesday, 15 May 13
  48. 48. 5Version 1.1 available on http://oxpath.org (via code.google)JAVAMaven archetype and Command Line Interface with examplesOutput in XML, RDF and Relational DB, custom output handlerBased on HTMLUnitsome limitations (e.g., no style axis)Ongoing workWebDriver-based implementation, Javascript in the next futureVisual Interface (record-and-play) as Firefox ExtensionAny feedback is welcome! Get in touch with meOXPath Engine36Wednesday, 15 May 13
  49. 49. Live Demo37Wednesday, 15 May 13
  50. 50. Questions?oxpath.org38Wednesday, 15 May 13

×