SlideShare a Scribd company logo
1 of 50
Download to read offline
DIADEM domain-centric intelligent automated
data extraction methodology
Effective Web Scraping with
http://oxpath.org
Giovanni Grasso - Oxford University
May 15th, 2013 @ WWW developer track
joint work with Tim Furche, Christian Schallhart,
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction
1
A Call for Action in Web Extraction!
Past: Form Filling + HTML Patterns
Now: Interaction + DOM Patterns
getting to the data requires interaction not just form filling
identifying relevant data from rendered DOMs
across several pages
access to all CSS properties (computed style)
2
Wednesday, 15 May 13
3
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
3
Seattle
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
4
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction
1
Wrapper Babel
Wrapper induction & data extraction systems
each invent their own wrapper language
or use its own ad-hoc tool or proprietary language
Mainly pattern matching + imperative navigation
mix of XPath & external flow control
limited interaction with complex interfaces
(simple) form filling & submit
focus on automation via visual interfaces
limited extraction language
no multiway navigation
5
Wednesday, 15 May 13
1 OXPath » Lingua Franca for Web Extraction
Why OXPath?
6
an XPath for
data extraction
simplicity
learnable
familiarity
scalability
Wednesday, 15 May 13
OXPath » The Language
2
OXPath = XPath + 4
7
action iteration
extractionstyle
OXPath
Wednesday, 15 May 13
8
Wednesday, 15 May 13
8
Start at kayak.co.uk:
doc("kayak.co.uk")
To select an airport, type a few letters and select from completion list
//field().destination/{"Sea" /}
//div#smartbox//li[1]/{click /}
This will submit the form
Wednesday, 15 May 13
9
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
and for each flight
//body.resultrow:<flight>
Wednesday, 15 May 13
9
Extract the attributes
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Click on the details to extract layovers
Wednesday, 15 May 13
OXPath » The Language
2
Actions correspond to DOM events, e.g.,
Executed once on each context node
Return context nodes for contextual actions or
root nodes for new DOM absolute actions {click/}
➊ Actions: Browser Interaction
10
Document
Click
Fill
Mouseover
doc("google.com")
{click}
{“Rio”}
{mouseover}
Wednesday, 15 May 13
OXPath » The Language
2
Extraction marker select nodes for extraction
record markers: :<flight>
attribute markers: :<price=string(.)>
Extracted data has tree shape
nesting of extraction markers in OXPath expression defines
nesting of records and attribute-record associations in the output
➋ Extraction: Compact Tree Construction
11
Wednesday, 15 May 13
Wednesday, 15 May 13
OXPath » The Language
2
Most web sites use pagination techniques for results
traversing paginated results require iteration
⇢ extraction from any unbounded component of a link graph
Kleene Star with action in the iterated expression
OXPath’s evaluation algorithm
buffers in practice only a constant number of pages
➌ Iteration: Kleene Star
13
/(//a[.=’Next’]/{click /})*
/(//body/{scroll /})* ( infinite scroll )
Wednesday, 15 May 13
OXPath » The Language
2
Access to all computed style CSS properties via style axis
➍ Style: Querying Visual Attributes
14
Visibility
Font size
Geometry
Color
style::display or style::visibility
style::font-size
style::top, style::left, ...
style::color or style::background-color
Wednesday, 15 May 13
3
Evaluation
15
Wednesday, 15 May 13
0
50
100
150
200
0 2 4 6 8 10 12
0
20
40
60
80
100
120
140
160
memory[MB]
#pages[1000]/#results[100,000]
time [h]
memory
extracted matches
visited pages
(b) Millions of resultsConstant Memory16
100,000+ pages, millions of results
Wednesday, 15 May 13
17
2%
13%
85%
page rendering browser initialization OXPath
it’s the browser
Wednesday, 15 May 13
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140
timew/opageloading[sec]
#pages
OXPath
Web Content Extractor
Lixto
Visual Web Ripper
Web Harvest
Chickenfoot
(b) Norm. evaluation time, <150 p.
faster 18
Wednesday, 15 May 13
0
200
400
600
800
1000
1200
1400
1600
0 100 200 300 400 500 600 700 800
timew/opageloading[sec]
Number of pages
OXPath
Lixto
Web Harvest
Chickenfoot
(c) Norm. evaluation time, <850 p.even faster 19
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
memory[MB]
#pages
OXPath
Lixto
Web Harvest
Chickenfoot
memory 20
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
memory[MB]
#pages
OXPath
Lixto
Web Harvest
Chickenfoot
memory 20
only hundreds of pages as
other tools fail for more pages
Wednesday, 15 May 13
OXPath » System & Evaluation
3
Evaluation
21
constant memory
very low overhead on
XPath
minimal page buffer
browser bound
fast
Wednesday, 15 May 13
4
OXPath
User stories
22
Wednesday, 15 May 13
4
DIADEM
Unsupervised Domain-
specific Web Objects
Extraction
presented @ World Wide Web 2012 (WWW’12)
23
Wednesday, 15 May 13
24
DIADEM data extraction methodology
domain-centric intelligent automated
Wednesday, 15 May 13
25
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
26
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
27
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
28
DIADEM data extraction methodology
domain-centric intelligent automated
:=
ng
FlatText
Context-driven
block analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
OXPath
Wrapper
Cloud extraction
Data integration
4
29
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Single entity (details) pages
Tables
2
Object
dentification & alignment
ges
FlatText
Context-driven
block analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
Wrapper induction in DIADEM
4
30
Induced Wrapper (partial)
doc(‘wwagency.co.uk’)//select#sale_type_id/{0/}
//button.formbtn/{click /}
(//div.pagenumlinks[last()]//a[last()]/{click /})*
//div.proplist_wrap:<RECORD>
[.//span.prop_price:<PRICE=string(.)>]
[.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>]
[.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>]
[.//strong.orange:<POSTCODE=string(.)>]
//div.prop_img/a/{click /}//body
[.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>]
[.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>]
[.//a.~'Map view')]/@href:<MAP=string(.)>]
[.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]
Wednesday, 15 May 13
4
DEQA
Question Answering
on the Deep Web
presented @International Semantic Web Conference 2012 (ISWC’12)
31
Wednesday, 15 May 13
32
Kindergarden_B
White_Road
1,499,950 £
gr:Offering
rdf:type
dd:hasPrice
Kindergarden_Adbp:near
Domain Specific
Triple Store
Question:
House near a Kindergarden under 2,000,000 £?
OXPath
OXPath
TBSL
White_Road
Answer:
15
dd:bedrooms
1,499,950 £
dd:hasPrice
dbp:near Kindergarden_A
Linking-Metric
OXPath
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web
4
33
RDF Wrapper (partial)
doc(‘wwagency.co.uk’)....
....
//div.proplist_wrap:<gr:Offering>
[.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>]
.....
[.//strong.orange:<vcard:postal-code=string(.)>]
....
[.//div.prop_img/a/@href:<foaf:page=string(.)>]
//div.prop_img/a/{click /}//body
[.//div#propertypage_copy/
p[last()-1]:<gr:description=string(.)>]
[.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>]
[.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web
4
34
Question translation to SPARQL
Edwardian houses close to supermarket for less than
1,000,000 in Oxfordshire
6 FILTER(regex(?y1,’Abingdon’,’i’)) .
}
In that case, TBSL first performs a restriction by class (“House”), then it fi
the town name “Abingdon” from the street address and it performs a filter on
number of rooms. Note that most QA systems would not be sufficiently pow
to include such filters.
Another example is “Edwardian houses close to supermarket for less
1,000,000 in Oxfordshire”, which was translated to the following query:
SELECT ?x0 WHERE {
2 ?x0 <http://dbpedia.org/property/near> ?y2 .
?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .
4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .
?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .
6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .
?y2 a <http://linkedgeodata.org/ontology/Supermarket> .
8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .
FILTER(regex(?y0,’Oxfordshire’,’i’)) .
10 FILTER(regex(?y,’Edwardian ’,’i’)) .
FILTER(?y1 < 1000000) .
12 }
In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
5
Hands-on
35
Wednesday, 15 May 13
5
Version 1.1 available on http://oxpath.org (via code.google)
JAVA
Maven archetype and Command Line Interface with examples
Output in XML, RDF and Relational DB, custom output handler
Based on HTMLUnit
some limitations (e.g., no style axis)
Ongoing work
WebDriver-based implementation, Javascript in the next future
Visual Interface (record-and-play) as Firefox Extension
Any feedback is welcome! Get in touch with me
OXPath Engine
36
Wednesday, 15 May 13
Live Demo
37
Wednesday, 15 May 13
Questions?
oxpath.org
38
Wednesday, 15 May 13

More Related Content

Similar to Effective Web Scraping with OXPath

Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsMathieu d'Aquin
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For XmlLars Trieloff
 
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Hans-Joerg Happel
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesEelco Visser
 
Introduction to Prototype JS Framework
Introduction to Prototype JS FrameworkIntroduction to Prototype JS Framework
Introduction to Prototype JS FrameworkMohd Imran
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsOWASP Kyiv
 
XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)Per Henrik Lausten
 
InduSoft SCADA Best Practices
InduSoft SCADA Best PracticesInduSoft SCADA Best Practices
InduSoft SCADA Best PracticesAVEVA
 
XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>Arun Gupta
 
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...FaithWestdorp
 
Making Pretty Charts in Splunk
Making Pretty Charts in SplunkMaking Pretty Charts in Splunk
Making Pretty Charts in SplunkSplunk
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesCHOOSE
 
Build Your Own Tools
Build Your Own ToolsBuild Your Own Tools
Build Your Own ToolsShugo Maeda
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersAlessandro Sanino
 

Similar to Effective Web Scraping with OXPath (20)

Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics Tools
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
treeview
treeviewtreeview
treeview
 
treeview
treeviewtreeview
treeview
 
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 
Introduction to Prototype JS Framework
Introduction to Prototype JS FrameworkIntroduction to Prototype JS Framework
Introduction to Prototype JS Framework
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tips
 
XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)
 
InduSoft SCADA Best Practices
InduSoft SCADA Best PracticesInduSoft SCADA Best Practices
InduSoft SCADA Best Practices
 
Osmit2009 Mapnik
Osmit2009 MapnikOsmit2009 Mapnik
Osmit2009 Mapnik
 
XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>
 
"If I knew then what I know now"
"If I knew then what I know now""If I knew then what I know now"
"If I knew then what I know now"
 
Html 5 and css 3
Html 5 and css 3Html 5 and css 3
Html 5 and css 3
 
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
 
Making Pretty Charts in Splunk
Making Pretty Charts in SplunkMaking Pretty Charts in Splunk
Making Pretty Charts in Splunk
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
Build Your Own Tools
Build Your Own ToolsBuild Your Own Tools
Build Your Own Tools
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
Dartprogramming
DartprogrammingDartprogramming
Dartprogramming
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Effective Web Scraping with OXPath

  • 1. DIADEM domain-centric intelligent automated data extraction methodology Effective Web Scraping with http://oxpath.org Giovanni Grasso - Oxford University May 15th, 2013 @ WWW developer track joint work with Tim Furche, Christian Schallhart, Wednesday, 15 May 13
  • 2. OXPath » Lingua Franca for Web Extraction 1 A Call for Action in Web Extraction! Past: Form Filling + HTML Patterns Now: Interaction + DOM Patterns getting to the data requires interaction not just form filling identifying relevant data from rendered DOMs across several pages access to all CSS properties (computed style) 2 Wednesday, 15 May 13
  • 3. 3 The nesting in the result mirrors the structure of the OX pression: extraction markers in a predicate (title, source sent attributes to the last marker outside the predicate (sto Kleene Star. Finally, we add the Kleene star, as in [1 example, the following expression queries Google for “O traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli To limit the range of the Kleene star, one can specify up lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo /descendant::input[@type=’checkbox’][2]/{uncheck}/follo Wednesday, 15 May 13
  • 4. 3 Seattle The nesting in the result mirrors the structure of the OX pression: extraction markers in a predicate (title, source sent attributes to the last marker outside the predicate (sto Kleene Star. Finally, we add the Kleene star, as in [1 example, the following expression queries Google for “O traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli To limit the range of the Kleene star, one can specify up lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo /descendant::input[@type=’checkbox’][2]/{uncheck}/follo Wednesday, 15 May 13
  • 6. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 7. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 8. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 9. OXPath » Lingua Franca for Web Extraction 1 Wrapper Babel Wrapper induction & data extraction systems each invent their own wrapper language or use its own ad-hoc tool or proprietary language Mainly pattern matching + imperative navigation mix of XPath & external flow control limited interaction with complex interfaces (simple) form filling & submit focus on automation via visual interfaces limited extraction language no multiway navigation 5 Wednesday, 15 May 13
  • 10. 1 OXPath » Lingua Franca for Web Extraction Why OXPath? 6 an XPath for data extraction simplicity learnable familiarity scalability Wednesday, 15 May 13
  • 11. OXPath » The Language 2 OXPath = XPath + 4 7 action iteration extractionstyle OXPath Wednesday, 15 May 13
  • 13. 8 Start at kayak.co.uk: doc("kayak.co.uk") To select an airport, type a few letters and select from completion list //field().destination/{"Sea" /} //div#smartbox//li[1]/{click /} This will submit the form Wednesday, 15 May 13
  • 15. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } Wednesday, 15 May 13
  • 16. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } On all result pages /(//a[.=‘Next’]/{click /})* Wednesday, 15 May 13
  • 17. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } On all result pages /(//a[.=‘Next’]/{click /})* and for each flight //body.resultrow:<flight> Wednesday, 15 May 13
  • 19. 9 Extract the attributes Mouseover the ! to extract flight quality warnings //span.qualityWarningIcon/{mouseover /} Wednesday, 15 May 13
  • 20. 9 Extract the attributes Mouseover the ! to extract flight quality warnings //span.qualityWarningIcon/{mouseover /} Click on the details to extract layovers Wednesday, 15 May 13
  • 21. OXPath » The Language 2 Actions correspond to DOM events, e.g., Executed once on each context node Return context nodes for contextual actions or root nodes for new DOM absolute actions {click/} ➊ Actions: Browser Interaction 10 Document Click Fill Mouseover doc("google.com") {click} {“Rio”} {mouseover} Wednesday, 15 May 13
  • 22. OXPath » The Language 2 Extraction marker select nodes for extraction record markers: :<flight> attribute markers: :<price=string(.)> Extracted data has tree shape nesting of extraction markers in OXPath expression defines nesting of records and attribute-record associations in the output ➋ Extraction: Compact Tree Construction 11 Wednesday, 15 May 13
  • 24. OXPath » The Language 2 Most web sites use pagination techniques for results traversing paginated results require iteration ⇢ extraction from any unbounded component of a link graph Kleene Star with action in the iterated expression OXPath’s evaluation algorithm buffers in practice only a constant number of pages ➌ Iteration: Kleene Star 13 /(//a[.=’Next’]/{click /})* /(//body/{scroll /})* ( infinite scroll ) Wednesday, 15 May 13
  • 25. OXPath » The Language 2 Access to all computed style CSS properties via style axis ➍ Style: Querying Visual Attributes 14 Visibility Font size Geometry Color style::display or style::visibility style::font-size style::top, style::left, ... style::color or style::background-color Wednesday, 15 May 13
  • 27. 0 50 100 150 200 0 2 4 6 8 10 12 0 20 40 60 80 100 120 140 160 memory[MB] #pages[1000]/#results[100,000] time [h] memory extracted matches visited pages (b) Millions of resultsConstant Memory16 100,000+ pages, millions of results Wednesday, 15 May 13
  • 28. 17 2% 13% 85% page rendering browser initialization OXPath it’s the browser Wednesday, 15 May 13
  • 29. 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 140 timew/opageloading[sec] #pages OXPath Web Content Extractor Lixto Visual Web Ripper Web Harvest Chickenfoot (b) Norm. evaluation time, <150 p. faster 18 Wednesday, 15 May 13
  • 30. 0 200 400 600 800 1000 1200 1400 1600 0 100 200 300 400 500 600 700 800 timew/opageloading[sec] Number of pages OXPath Lixto Web Harvest Chickenfoot (c) Norm. evaluation time, <850 p.even faster 19 Wednesday, 15 May 13
  • 31. 0 50 100 150 200 250 300 350 0 100 200 300 400 500 600 700 800 memory[MB] #pages OXPath Lixto Web Harvest Chickenfoot memory 20 Wednesday, 15 May 13
  • 32. 0 50 100 150 200 250 300 350 0 100 200 300 400 500 600 700 800 memory[MB] #pages OXPath Lixto Web Harvest Chickenfoot memory 20 only hundreds of pages as other tools fail for more pages Wednesday, 15 May 13
  • 33. OXPath » System & Evaluation 3 Evaluation 21 constant memory very low overhead on XPath minimal page buffer browser bound fast Wednesday, 15 May 13
  • 35. 4 DIADEM Unsupervised Domain- specific Web Objects Extraction presented @ World Wide Web 2012 (WWW’12) 23 Wednesday, 15 May 13
  • 36. 24 DIADEM data extraction methodology domain-centric intelligent automated Wednesday, 15 May 13
  • 37. 25 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 38. 26 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 39. 27 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 40. 28 DIADEM data extraction methodology domain-centric intelligent automated := ng FlatText Context-driven block analysis 3 Energy Performance Chart Maps Floor plans Wednesday, 15 May 13
  • 41. OXPath Wrapper Cloud extraction Data integration 4 29 DIADEM data extraction methodology domain-centric intelligent automated := Single entity (details) pages Tables 2 Object dentification & alignment ges FlatText Context-driven block analysis 3 Energy Performance Chart Maps Floor plans Wednesday, 15 May 13
  • 42. Wrapper induction in DIADEM 4 30 Induced Wrapper (partial) doc(‘wwagency.co.uk’)//select#sale_type_id/{0/} //button.formbtn/{click /} (//div.pagenumlinks[last()]//a[last()]/{click /})* //div.proplist_wrap:<RECORD> [.//span.prop_price:<PRICE=string(.)>] [.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>] [.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>] [.//strong.orange:<POSTCODE=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>] [.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>] [.//a.~'Map view')]/@href:<MAP=string(.)>] [.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>] Wednesday, 15 May 13
  • 43. 4 DEQA Question Answering on the Deep Web presented @International Semantic Web Conference 2012 (ISWC’12) 31 Wednesday, 15 May 13
  • 44. 32 Kindergarden_B White_Road 1,499,950 £ gr:Offering rdf:type dd:hasPrice Kindergarden_Adbp:near Domain Specific Triple Store Question: House near a Kindergarden under 2,000,000 £? OXPath OXPath TBSL White_Road Answer: 15 dd:bedrooms 1,499,950 £ dd:hasPrice dbp:near Kindergarden_A Linking-Metric OXPath Wednesday, 15 May 13
  • 45. OXPath » DEQA: Question Answering on the Deep Web 4 33 RDF Wrapper (partial) doc(‘wwagency.co.uk’).... .... //div.proplist_wrap:<gr:Offering> [.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>] ..... [.//strong.orange:<vcard:postal-code=string(.)>] .... [.//div.prop_img/a/@href:<foaf:page=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/ p[last()-1]:<gr:description=string(.)>] [.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>] [.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>] Wednesday, 15 May 13
  • 46. OXPath » DEQA: Question Answering on the Deep Web 4 34 Question translation to SPARQL Edwardian houses close to supermarket for less than 1,000,000 in Oxfordshire 6 FILTER(regex(?y1,’Abingdon’,’i’)) . } In that case, TBSL first performs a restriction by class (“House”), then it fi the town name “Abingdon” from the street address and it performs a filter on number of rooms. Note that most QA systems would not be sufficiently pow to include such filters. Another example is “Edwardian houses close to supermarket for less 1,000,000 in Oxfordshire”, which was translated to the following query: SELECT ?x0 WHERE { 2 ?x0 <http://dbpedia.org/property/near> ?y2 . ?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> . 4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 . ?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 . 6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 . ?y2 a <http://linkedgeodata.org/ontology/Supermarket> . 8 ?x0 <http://purl.org/goodrelations/v1#description> ?y . FILTER(regex(?y0,’Oxfordshire’,’i’)) . 10 FILTER(regex(?y,’Edwardian ’,’i’)) . FILTER(?y1 < 1000000) . 12 } In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
  • 48. 5 Version 1.1 available on http://oxpath.org (via code.google) JAVA Maven archetype and Command Line Interface with examples Output in XML, RDF and Relational DB, custom output handler Based on HTMLUnit some limitations (e.g., no style axis) Ongoing work WebDriver-based implementation, Javascript in the next future Visual Interface (record-and-play) as Firefox Extension Any feedback is welcome! Get in touch with me OXPath Engine 36 Wednesday, 15 May 13