Scanning the Internet for External Cloud Exposures via SSL Certs
Effective Web Scraping with OXPath
1. DIADEM domain-centric intelligent automated
data extraction methodology
Effective Web Scraping with
http://oxpath.org
Giovanni Grasso - Oxford University
May 15th, 2013 @ WWW developer track
joint work with Tim Furche, Christian Schallhart,
Wednesday, 15 May 13
2. OXPath » Lingua Franca for Web Extraction
1
A Call for Action in Web Extraction!
Past: Form Filling + HTML Patterns
Now: Interaction + DOM Patterns
getting to the data requires interaction not just form filling
identifying relevant data from rendered DOMs
across several pages
access to all CSS properties (computed style)
2
Wednesday, 15 May 13
3. 3
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
4. 3
Seattle
The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
Wednesday, 15 May 13
6. 4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
7. 4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
8. 4
The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>
z
Wednesday, 15 May 13
9. OXPath » Lingua Franca for Web Extraction
1
Wrapper Babel
Wrapper induction & data extraction systems
each invent their own wrapper language
or use its own ad-hoc tool or proprietary language
Mainly pattern matching + imperative navigation
mix of XPath & external flow control
limited interaction with complex interfaces
(simple) form filling & submit
focus on automation via visual interfaces
limited extraction language
no multiway navigation
5
Wednesday, 15 May 13
10. 1 OXPath » Lingua Franca for Web Extraction
Why OXPath?
6
an XPath for
data extraction
simplicity
learnable
familiarity
scalability
Wednesday, 15 May 13
11. OXPath » The Language
2
OXPath = XPath + 4
7
action iteration
extractionstyle
OXPath
Wednesday, 15 May 13
13. 8
Start at kayak.co.uk:
doc("kayak.co.uk")
To select an airport, type a few letters and select from completion list
//field().destination/{"Sea" /}
//div#smartbox//li[1]/{click /}
This will submit the form
Wednesday, 15 May 13
15. 9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
Wednesday, 15 May 13
16. 9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
Wednesday, 15 May 13
17. 9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pages
/(//a[.=‘Next’]/{click /})*
and for each flight
//body.resultrow:<flight>
Wednesday, 15 May 13
20. 9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Click on the details to extract layovers
Wednesday, 15 May 13
21. OXPath » The Language
2
Actions correspond to DOM events, e.g.,
Executed once on each context node
Return context nodes for contextual actions or
root nodes for new DOM absolute actions {click/}
➊ Actions: Browser Interaction
10
Document
Click
Fill
Mouseover
doc("google.com")
{click}
{“Rio”}
{mouseover}
Wednesday, 15 May 13
22. OXPath » The Language
2
Extraction marker select nodes for extraction
record markers: :<flight>
attribute markers: :<price=string(.)>
Extracted data has tree shape
nesting of extraction markers in OXPath expression defines
nesting of records and attribute-record associations in the output
➋ Extraction: Compact Tree Construction
11
Wednesday, 15 May 13
24. OXPath » The Language
2
Most web sites use pagination techniques for results
traversing paginated results require iteration
⇢ extraction from any unbounded component of a link graph
Kleene Star with action in the iterated expression
OXPath’s evaluation algorithm
buffers in practice only a constant number of pages
➌ Iteration: Kleene Star
13
/(//a[.=’Next’]/{click /})*
/(//body/{scroll /})* ( infinite scroll )
Wednesday, 15 May 13
25. OXPath » The Language
2
Access to all computed style CSS properties via style axis
➍ Style: Querying Visual Attributes
14
Visibility
Font size
Geometry
Color
style::display or style::visibility
style::font-size
style::top, style::left, ...
style::color or style::background-color
Wednesday, 15 May 13
45. OXPath » DEQA: Question Answering on the Deep Web
4
33
RDF Wrapper (partial)
doc(‘wwagency.co.uk’)....
....
//div.proplist_wrap:<gr:Offering>
[.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>]
.....
[.//strong.orange:<vcard:postal-code=string(.)>]
....
[.//div.prop_img/a/@href:<foaf:page=string(.)>]
//div.prop_img/a/{click /}//body
[.//div#propertypage_copy/
p[last()-1]:<gr:description=string(.)>]
[.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>]
[.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]
Wednesday, 15 May 13
46. OXPath » DEQA: Question Answering on the Deep Web
4
34
Question translation to SPARQL
Edwardian houses close to supermarket for less than
1,000,000 in Oxfordshire
6 FILTER(regex(?y1,’Abingdon’,’i’)) .
}
In that case, TBSL first performs a restriction by class (“House”), then it fi
the town name “Abingdon” from the street address and it performs a filter on
number of rooms. Note that most QA systems would not be sufficiently pow
to include such filters.
Another example is “Edwardian houses close to supermarket for less
1,000,000 in Oxfordshire”, which was translated to the following query:
SELECT ?x0 WHERE {
2 ?x0 <http://dbpedia.org/property/near> ?y2 .
?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .
4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .
?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .
6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .
?y2 a <http://linkedgeodata.org/ontology/Supermarket> .
8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .
FILTER(regex(?y0,’Oxfordshire’,’i’)) .
10 FILTER(regex(?y,’Edwardian ’,’i’)) .
FILTER(?y1 < 1000000) .
12 }
In that case, the links to LinkedGeoData were used by selecting the “near” pWednesday, 15 May 13
48. 5
Version 1.1 available on http://oxpath.org (via code.google)
JAVA
Maven archetype and Command Line Interface with examples
Output in XML, RDF and Relational DB, custom output handler
Based on HTMLUnit
some limitations (e.g., no style axis)
Ongoing work
WebDriver-based implementation, Javascript in the next future
Visual Interface (record-and-play) as Firefox Extension
Any feedback is welcome! Get in touch with me
OXPath Engine
36
Wednesday, 15 May 13