SlideShare a Scribd company logo
1 of 50
Download to read offline
DIADEM domain-centric intelligent automated
data extraction methodology
Effective Web Scraping with
http://oxpath.org
Giovanni Grasso - Oxford University
May 15th, 2013 @ WWW developer track
joint work with Tim Furche, Christian Schallhart,
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction
1
A Call for Action in Web Extraction!
Past: Form Filling + HTML Patterns
Now: Interaction + DOM Patterns
getting to the data requires interaction not just form filling
identifying relevant data from rendered DOMs
across several pages
access to all CSS properties (computed style)
2
Wednesday, 15 May 13
3
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
3
Seattle
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
4
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction
1
Wrapper Babel
Wrapper induction & data extraction systems
each invent their own wrapper language
or use its own ad-hoc tool or proprietary language
Mainly pattern matching + imperative navigation
mix of XPath & external flow control
limited interaction with complex interfaces
(simple) form filling & submit
focus on automation via visual interfaces
limited extraction language
no multiway navigation
5
Wednesday, 15 May 13
1 OXPath » Lingua Franca for Web Extraction
Why OXPath?
6
an XPath for
data extraction
simplicity
learnable
familiarity
scalability
Wednesday, 15 May 13
OXPath » The Language
2
OXPath = XPath + 4
7
action iteration
extractionstyle
OXPath
Wednesday, 15 May 13
8
Wednesday, 15 May 13
8
Start at kayak.co.uk:
doc("kayak.co.uk")
To select an airport, type a few letters and select from completion list
//field().destination/{"Sea" /}
//div#smartbox//li[1]/{click /}
This will submit the form
Wednesday, 15 May 13
9
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
and for each flight
//body.resultrow:<flight>
Wednesday, 15 May 13
9
Extract the attributes
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Click on the details to extract layovers
Wednesday, 15 May 13
OXPath » The Language
2
Actions correspond to DOM events, e.g.,
Executed once on each context node
Return context nodes for contextual actions or
root nodes for new DOM absolute actions {click/}
➊ Actions: Browser Interaction
10
Document
Click
Fill
Mouseover
doc("google.com")
{click}
{“Rio”}
{mouseover}
Wednesday, 15 May 13
OXPath » The Language
2
Extraction marker select nodes for extraction
record markers: :<flight>
attribute markers: :<price=string(.)>
Extracted data has tree shape
nesting of extraction markers in OXPath expression defines
nesting of records and attribute-record associations in the output
➋ Extraction: Compact Tree Construction
11
Wednesday, 15 May 13
Wednesday, 15 May 13
OXPath » The Language
2
Most web sites use pagination techniques for results
traversing paginated results require iteration
⇢ extraction from any unbounded component of a link graph
Kleene Star with action in the iterated expression
OXPath’s evaluation algorithm
buffers in practice only a constant number of pages
➌ Iteration: Kleene Star
13
/(//a[.=’Next’]/{click /})*
/(//body/{scroll /})* ( infinite scroll )
Wednesday, 15 May 13
OXPath » The Language
2
Access to all computed style CSS properties via style axis
➍ Style: Querying Visual Attributes
14
Visibility
Font size
Geometry
Color
style::display or style::visibility
style::font-size
style::top, style::left, ...
style::color or style::background-color
Wednesday, 15 May 13
3
Evaluation
15
Wednesday, 15 May 13
0
50
100
150
200
0 2 4 6 8 10 12
0
20
40
60
80
100
120
140
160
memory[MB]
#pages[1000]/#results[100,000]
time [h]
memory
extracted matches
visited pages
(b) Millions of resultsConstant Memory16
100,000+ pages, millions of results
Wednesday, 15 May 13
17
2%
13%
85%
page rendering browser initialization OXPath
it’s the browser
Wednesday, 15 May 13
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140
timew/opageloading[sec]
#pages
OXPath
Web Content Extractor
Lixto
Visual Web Ripper
Web Harvest
Chickenfoot
(b) Norm. evaluation time, <150 p.
faster 18
Wednesday, 15 May 13
0
200
400
600
800
1000
1200
1400
1600
0 100 200 300 400 500 600 700 800
timew/opageloading[sec]
Number of pages
OXPath
Lixto
Web Harvest
Chickenfoot
(c) Norm. evaluation time, <850 p.even faster 19
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
memory[MB]
#pages
OXPath
Lixto
Web Harvest
Chickenfoot
memory 20
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
memory[MB]
#pages
OXPath
Lixto
Web Harvest
Chickenfoot
memory 20
only hundreds of pages as
other tools fail for more pages
Wednesday, 15 May 13
OXPath » System & Evaluation
3
Evaluation
21
constant memory
very low overhead on
XPath
minimal page buffer
browser bound
fast
Wednesday, 15 May 13
4
OXPath
User stories
22
Wednesday, 15 May 13
4
DIADEM
Unsupervised Domain-
specific Web Objects
Extraction
presented @ World Wide Web 2012 (WWW’12)
23
Wednesday, 15 May 13
24
DIADEM data extraction methodology
domain-centric intelligent automated
Wednesday, 15 May 13
25
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
26
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
27
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Wednesday, 15 May 13
28
DIADEM data extraction methodology
domain-centric intelligent automated
:=
ng
FlatText
Context-driven
block analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
OXPath
Wrapper
Cloud extraction
Data integration
4
29
DIADEM data extraction methodology
domain-centric intelligent automated
:=
Single entity (details) pages
Tables
2
Object
dentification & alignment
ges
FlatText
Context-driven
block analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
Wrapper induction in DIADEM
4
30
Induced Wrapper (partial)
doc(‘wwagency.co.uk’)//select#sale_type_id/{0/}
//button.formbtn/{click /}
(//div.pagenumlinks[last()]//a[last()]/{click /})*
//div.proplist_wrap:<RECORD>
[.//span.prop_price:<PRICE=string(.)>]
[.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>]
[.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>]
[.//strong.orange:<POSTCODE=string(.)>]
//div.prop_img/a/{click /}//body
[.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>]
[.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>]
[.//a.~'Map view')]/@href:<MAP=string(.)>]
[.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]
Wednesday, 15 May 13
4
DEQA
Question Answering
on the Deep Web
presented @International Semantic Web Conference 2012 (ISWC’12)
31
Wednesday, 15 May 13
32
Kindergarden_B
White_Road
1,499,950 £
gr:Offering
rdf:type
dd:hasPrice
Kindergarden_Adbp:near
Domain Specific
Triple Store
Question:
House near a Kindergarden under 2,000,000 £?
OXPath
OXPath
TBSL
White_Road
Answer:
15
dd:bedrooms
1,499,950 £
dd:hasPrice
dbp:near Kindergarden_A
Linking-Metric
OXPath
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web
4
33
RDF Wrapper (partial)
doc(‘wwagency.co.uk’)....
....
//div.proplist_wrap:<gr:Offering>
[.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>]
.....
[.//strong.orange:<vcard:postal-code=string(.)>]
....
[.//div.prop_img/a/@href:<foaf:page=string(.)>]
//div.prop_img/a/{click /}//body
[.//div#propertypage_copy/
p[last()-1]:<gr:description=string(.)>]
[.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>]
[.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web
4
34
Question translation to SPARQL
Edwardian houses close to supermarket for less than
1,000,000 in Oxfordshire
6 FILTER(regex(?y1,’Abingdon’,’i’)) .
}
In that case, TBSL first performs a restriction by class (“House”), then it fi
the town name “Abingdon” from the street address and it performs a filter on
number of rooms. Note that most QA systems would not be sufficiently pow
to include such filters.
Another example is “Edwardian houses close to supermarket for less
1,000,000 in Oxfordshire”, which was translated to the following query:
SELECT ?x0 WHERE {
2 ?x0 <http://dbpedia.org/property/near> ?y2 .
?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .
4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .
?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .
6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .
?y2 a <http://linkedgeodata.org/ontology/Supermarket> .
8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .
FILTER(regex(?y0,’Oxfordshire’,’i’)) .
10 FILTER(regex(?y,’Edwardian ’,’i’)) .
FILTER(?y1 < 1000000) .
12 }
In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
5
Hands-on
35
Wednesday, 15 May 13
5
Version 1.1 available on http://oxpath.org (via code.google)
JAVA
Maven archetype and Command Line Interface with examples
Output in XML, RDF and Relational DB, custom output handler
Based on HTMLUnit
some limitations (e.g., no style axis)
Ongoing work
WebDriver-based implementation, Javascript in the next future
Visual Interface (record-and-play) as Firefox Extension
Any feedback is welcome! Get in touch with me
OXPath Engine
36
Wednesday, 15 May 13
Live Demo
37
Wednesday, 15 May 13
Questions?
oxpath.org
38
Wednesday, 15 May 13

More Related Content

Similar to Effective Web Scraping with OXPath

Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Hans-Joerg Happel
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
Eelco Visser
 

Similar to Effective Web Scraping with OXPath (20)

Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics Tools
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
treeview
treeviewtreeview
treeview
 
treeview
treeviewtreeview
treeview
 
Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...Semantic Result Formats: Automatically Transforming Structured Data into usef...
Semantic Result Formats: Automatically Transforming Structured Data into usef...
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 
Introduction to Prototype JS Framework
Introduction to Prototype JS FrameworkIntroduction to Prototype JS Framework
Introduction to Prototype JS Framework
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tips
 
XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)XPages Extension Library - Create an app in 1 hour (almost)
XPages Extension Library - Create an app in 1 hour (almost)
 
InduSoft SCADA Best Practices
InduSoft SCADA Best PracticesInduSoft SCADA Best Practices
InduSoft SCADA Best Practices
 
Osmit2009 Mapnik
Osmit2009 MapnikOsmit2009 Mapnik
Osmit2009 Mapnik
 
XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>XML-Free Programming : Java Server and Client Development without &lt;>
XML-Free Programming : Java Server and Client Development without &lt;>
 
"If I knew then what I know now"
"If I knew then what I know now""If I knew then what I know now"
"If I knew then what I know now"
 
Html 5 and css 3
Html 5 and css 3Html 5 and css 3
Html 5 and css 3
 
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
 
Making Pretty Charts in Splunk
Making Pretty Charts in SplunkMaking Pretty Charts in Splunk
Making Pretty Charts in Splunk
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
Build Your Own Tools
Build Your Own ToolsBuild Your Own Tools
Build Your Own Tools
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
Dartprogramming
DartprogrammingDartprogramming
Dartprogramming
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Effective Web Scraping with OXPath

  • 1. DIADEM domain-centric intelligent automated data extraction methodology Effective Web Scraping with http://oxpath.org Giovanni Grasso - Oxford University May 15th, 2013 @ WWW developer track joint work with Tim Furche, Christian Schallhart, Wednesday, 15 May 13
  • 2. OXPath » Lingua Franca for Web Extraction 1 A Call for Action in Web Extraction! Past: Form Filling + HTML Patterns Now: Interaction + DOM Patterns getting to the data requires interaction not just form filling identifying relevant data from rendered DOMs across several pages access to all CSS properties (computed style) 2 Wednesday, 15 May 13
  • 3. 3 The nesting in the result mirrors the structure of the OX pression: extraction markers in a predicate (title, source sent attributes to the last marker outside the predicate (sto Kleene Star. Finally, we add the Kleene star, as in [1 example, the following expression queries Google for “O traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli To limit the range of the Kleene star, one can specify up lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo /descendant::input[@type=’checkbox’][2]/{uncheck}/follo Wednesday, 15 May 13
  • 4. 3 Seattle The nesting in the result mirrors the structure of the OX pression: extraction markers in a predicate (title, source sent attributes to the last marker outside the predicate (sto Kleene Star. Finally, we add the Kleene star, as in [1 example, the following expression queries Google for “O traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{cli To limit the range of the Kleene star, one can specify up lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo /descendant::input[@type=’checkbox’][2]/{uncheck}/follo Wednesday, 15 May 13
  • 6. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 7. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 8. 4 The nesting in the result mirrors the structure of the OXPath ex- pression: extraction markers in a predicate (title, source) repre- sent attributes to the last marker outside the predicate (story). Kleene Star. Finally, we add the Kleene star, as in [12]. For example, the following expression queries Google for “Oxford”, traverses all accessible result pages and extracts all links. doc("google.com")/descendant::field()[1]/{"Oxford"} /following::field()[1]/{click /} /( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )* To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, e.g., (...)*{3,8}. doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{ /descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc //div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters" //input[following-sibling[contains(.,"Multi Family")]]/{uncheck/} /(//span.arrowNext/a/{click/})*//ul#search-results/li:<property> [//div.property-info//a/{click/}//div.home-description:<info=(.)> z Wednesday, 15 May 13
  • 9. OXPath » Lingua Franca for Web Extraction 1 Wrapper Babel Wrapper induction & data extraction systems each invent their own wrapper language or use its own ad-hoc tool or proprietary language Mainly pattern matching + imperative navigation mix of XPath & external flow control limited interaction with complex interfaces (simple) form filling & submit focus on automation via visual interfaces limited extraction language no multiway navigation 5 Wednesday, 15 May 13
  • 10. 1 OXPath » Lingua Franca for Web Extraction Why OXPath? 6 an XPath for data extraction simplicity learnable familiarity scalability Wednesday, 15 May 13
  • 11. OXPath » The Language 2 OXPath = XPath + 4 7 action iteration extractionstyle OXPath Wednesday, 15 May 13
  • 13. 8 Start at kayak.co.uk: doc("kayak.co.uk") To select an airport, type a few letters and select from completion list //field().destination/{"Sea" /} //div#smartbox//li[1]/{click /} This will submit the form Wednesday, 15 May 13
  • 15. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } Wednesday, 15 May 13
  • 16. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } On all result pages /(//a[.=‘Next’]/{click /})* Wednesday, 15 May 13
  • 17. 9 Refine the results by unchecking the “2+ stops”: //*#stops2/{uncheck } On all result pages /(//a[.=‘Next’]/{click /})* and for each flight //body.resultrow:<flight> Wednesday, 15 May 13
  • 19. 9 Extract the attributes Mouseover the ! to extract flight quality warnings //span.qualityWarningIcon/{mouseover /} Wednesday, 15 May 13
  • 20. 9 Extract the attributes Mouseover the ! to extract flight quality warnings //span.qualityWarningIcon/{mouseover /} Click on the details to extract layovers Wednesday, 15 May 13
  • 21. OXPath » The Language 2 Actions correspond to DOM events, e.g., Executed once on each context node Return context nodes for contextual actions or root nodes for new DOM absolute actions {click/} ➊ Actions: Browser Interaction 10 Document Click Fill Mouseover doc("google.com") {click} {“Rio”} {mouseover} Wednesday, 15 May 13
  • 22. OXPath » The Language 2 Extraction marker select nodes for extraction record markers: :<flight> attribute markers: :<price=string(.)> Extracted data has tree shape nesting of extraction markers in OXPath expression defines nesting of records and attribute-record associations in the output ➋ Extraction: Compact Tree Construction 11 Wednesday, 15 May 13
  • 24. OXPath » The Language 2 Most web sites use pagination techniques for results traversing paginated results require iteration ⇢ extraction from any unbounded component of a link graph Kleene Star with action in the iterated expression OXPath’s evaluation algorithm buffers in practice only a constant number of pages ➌ Iteration: Kleene Star 13 /(//a[.=’Next’]/{click /})* /(//body/{scroll /})* ( infinite scroll ) Wednesday, 15 May 13
  • 25. OXPath » The Language 2 Access to all computed style CSS properties via style axis ➍ Style: Querying Visual Attributes 14 Visibility Font size Geometry Color style::display or style::visibility style::font-size style::top, style::left, ... style::color or style::background-color Wednesday, 15 May 13
  • 27. 0 50 100 150 200 0 2 4 6 8 10 12 0 20 40 60 80 100 120 140 160 memory[MB] #pages[1000]/#results[100,000] time [h] memory extracted matches visited pages (b) Millions of resultsConstant Memory16 100,000+ pages, millions of results Wednesday, 15 May 13
  • 28. 17 2% 13% 85% page rendering browser initialization OXPath it’s the browser Wednesday, 15 May 13
  • 29. 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 140 timew/opageloading[sec] #pages OXPath Web Content Extractor Lixto Visual Web Ripper Web Harvest Chickenfoot (b) Norm. evaluation time, <150 p. faster 18 Wednesday, 15 May 13
  • 30. 0 200 400 600 800 1000 1200 1400 1600 0 100 200 300 400 500 600 700 800 timew/opageloading[sec] Number of pages OXPath Lixto Web Harvest Chickenfoot (c) Norm. evaluation time, <850 p.even faster 19 Wednesday, 15 May 13
  • 31. 0 50 100 150 200 250 300 350 0 100 200 300 400 500 600 700 800 memory[MB] #pages OXPath Lixto Web Harvest Chickenfoot memory 20 Wednesday, 15 May 13
  • 32. 0 50 100 150 200 250 300 350 0 100 200 300 400 500 600 700 800 memory[MB] #pages OXPath Lixto Web Harvest Chickenfoot memory 20 only hundreds of pages as other tools fail for more pages Wednesday, 15 May 13
  • 33. OXPath » System & Evaluation 3 Evaluation 21 constant memory very low overhead on XPath minimal page buffer browser bound fast Wednesday, 15 May 13
  • 35. 4 DIADEM Unsupervised Domain- specific Web Objects Extraction presented @ World Wide Web 2012 (WWW’12) 23 Wednesday, 15 May 13
  • 36. 24 DIADEM data extraction methodology domain-centric intelligent automated Wednesday, 15 May 13
  • 37. 25 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 38. 26 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 39. 27 DIADEM data extraction methodology domain-centric intelligent automated := Wednesday, 15 May 13
  • 40. 28 DIADEM data extraction methodology domain-centric intelligent automated := ng FlatText Context-driven block analysis 3 Energy Performance Chart Maps Floor plans Wednesday, 15 May 13
  • 41. OXPath Wrapper Cloud extraction Data integration 4 29 DIADEM data extraction methodology domain-centric intelligent automated := Single entity (details) pages Tables 2 Object dentification & alignment ges FlatText Context-driven block analysis 3 Energy Performance Chart Maps Floor plans Wednesday, 15 May 13
  • 42. Wrapper induction in DIADEM 4 30 Induced Wrapper (partial) doc(‘wwagency.co.uk’)//select#sale_type_id/{0/} //button.formbtn/{click /} (//div.pagenumlinks[last()]//a[last()]/{click /})* //div.proplist_wrap:<RECORD> [.//span.prop_price:<PRICE=string(.)>] [.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>] [.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>] [.//strong.orange:<POSTCODE=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>] [.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>] [.//a.~'Map view')]/@href:<MAP=string(.)>] [.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>] Wednesday, 15 May 13
  • 43. 4 DEQA Question Answering on the Deep Web presented @International Semantic Web Conference 2012 (ISWC’12) 31 Wednesday, 15 May 13
  • 44. 32 Kindergarden_B White_Road 1,499,950 £ gr:Offering rdf:type dd:hasPrice Kindergarden_Adbp:near Domain Specific Triple Store Question: House near a Kindergarden under 2,000,000 £? OXPath OXPath TBSL White_Road Answer: 15 dd:bedrooms 1,499,950 £ dd:hasPrice dbp:near Kindergarden_A Linking-Metric OXPath Wednesday, 15 May 13
  • 45. OXPath » DEQA: Question Answering on the Deep Web 4 33 RDF Wrapper (partial) doc(‘wwagency.co.uk’).... .... //div.proplist_wrap:<gr:Offering> [.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>] ..... [.//strong.orange:<vcard:postal-code=string(.)>] .... [.//div.prop_img/a/@href:<foaf:page=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/ p[last()-1]:<gr:description=string(.)>] [.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>] [.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>] Wednesday, 15 May 13
  • 46. OXPath » DEQA: Question Answering on the Deep Web 4 34 Question translation to SPARQL Edwardian houses close to supermarket for less than 1,000,000 in Oxfordshire 6 FILTER(regex(?y1,’Abingdon’,’i’)) . } In that case, TBSL first performs a restriction by class (“House”), then it fi the town name “Abingdon” from the street address and it performs a filter on number of rooms. Note that most QA systems would not be sufficiently pow to include such filters. Another example is “Edwardian houses close to supermarket for less 1,000,000 in Oxfordshire”, which was translated to the following query: SELECT ?x0 WHERE { 2 ?x0 <http://dbpedia.org/property/near> ?y2 . ?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> . 4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 . ?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 . 6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 . ?y2 a <http://linkedgeodata.org/ontology/Supermarket> . 8 ?x0 <http://purl.org/goodrelations/v1#description> ?y . FILTER(regex(?y0,’Oxfordshire’,’i’)) . 10 FILTER(regex(?y,’Edwardian ’,’i’)) . FILTER(?y1 < 1000000) . 12 } In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
  • 48. 5 Version 1.1 available on http://oxpath.org (via code.google) JAVA Maven archetype and Command Line Interface with examples Output in XML, RDF and Relational DB, custom output handler Based on HTMLUnit some limitations (e.g., no style axis) Ongoing work WebDriver-based implementation, Javascript in the next future Visual Interface (record-and-play) as Firefox Extension Any feedback is welcome! Get in touch with me OXPath Engine 36 Wednesday, 15 May 13