1. The document summarizes OXPath, a domain-specific language for extracting structured data from web applications through interactive queries. It describes OXPath's advantages over previous wrapper languages, including its use of XPath for navigation and extraction as well as support for interaction through imperative commands.
2. The document provides an example of an interactive OXPath query to extract flight information from the Kayak website. It demonstrates how OXPath allows specifying navigation steps and extraction markers to scrape relevant data across multiple pages.
3. Finally, the document argues that OXPath can serve as a "lingua franca" or common language for web extraction, addressing limitations of previous systems that each defined their own proprietary
Interactive OXPath Wrapper for Data Extraction from Kayak.co.uk
1. DIADEM domain-centric intelligent automated
data extraction methodology
OXPath
Scalable, Memory-efficient Data
Extraction from Web Applications
Andrew Sellers
September 1st, 2011 @ Department of Computer Science, Oxford University
joint work with Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart
8. OXPath » Lingua Franca for Web Extraction
1
A Call for Action in Web Extraction!
Past: Form Filling + HTML Patterns
Now: Interaction + DOM Patterns
getting to the data requires interaction not just form filling
identifying relevant data from rendered DOMs
including computed style and geometric information
access to all CSS properties, but less rich relations than in SXPath
8
9. The nesting in the result mirrors the structure of the OX
pression: extraction markers in a predicate (title, source
sent attributes to the last marker outside the predicate (sto
Kleene Star. Finally, we add the Kleene star, as in [1
example, the following expression queries Google for “O
Seattle
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{cli
To limit the range of the Kleene star, one can specify up
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/fo
/descendant::input[@type=’checkbox’][2]/{uncheck}/follo
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./
9
10. The nesting in the result mirrors the structure of the OXPath ex-
pression: extraction markers in a predicate (title, source) repre-
sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. For
example, the following expression queries Google for “Oxford”,
traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}
/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper and
lower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{
/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{unc
//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"
//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}
/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)> 10
z
11. OXPath » Lingua Franca for Web Extraction
1
Wrapper Babel
Wrapper induction & data extraction systems
each invent their own wrapper language
often separate navigation and matching
Main classes:
(1) pattern matching + imperative navigation
XPath
Finite & Tree Automata
Token Prefix/Suffix
(2) Datalog
E-Log (Lixto)
11
12. OXPath » Lingua Franca for Web Extraction
1
Why OXPath?
scalability
familiarity
an XPath for
simplicity
data extraction
web applications
12
14. Start at kayak.co.uk:
doc("rightmove.co.uk")
To select an airport, type a few letters and select from completion list
//field().destination/{"Sea" /}
//div#smartbox//li[1]/{click /}
Submit the form
14
15. Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
On all result pagesattributes
Extract the
/(//a[.=‘Next’]/{click /})* warnings
Mouseover the ! to extract flight quality
and//span.qualityWarningIcon/{mouseover /}
for each flight
//body.resultrow:<flight>
Click on the details to extract layovers
15
17. Your Amazon.co.uk | Deals of the Week | Gift Cards | Gifts & Wish Lists Usually dispatchedYour Ac
within 1
Books
Advanced 1 Browse New & Future2 Audio Bargain Dispatched from and sold by Amaz
Special
Bestsellers here. Sell on Amazon: First Month Subscription Free
Hello. Sign in to get personalised recommendations. New Customer? Start Paperbacks
Sell Your
p All Departments Search Books
Search Books Genres Releases Hyderabad
Hyderabad Books Books Offers Books
Basket 2 used from
4 new from £9.53 Wi
Your Amazon.co.uk | Deals of the Week | Gift Cards | Gifts & Wish Lists Your Account | Help
ks New Arrivals
Shop All Departments Search Books › "Hyderabad" Hyderabad & Future
Books
Advanced Browse New
Bestsellers Paperbacks } Basket Audio Wish ListBargain Special
Any Release Date
Search Genres Releases
Hello Tim Furche. We have recommendations for you. (Not Tim?) Sell on Amazon:/ Books
Month Books
lick First Special Subscription Free
Offers
Advanced Browse New & Future Audio {cBargain Sell Your
Books
Last 30 days (19) Search Genres Releases
Bestsellers Paperbacks
Tim's Amazon.co.uk | Deals of the Week | Gift Cards | 3 Books
Gifts & Wish Lists Books Your Account | Help
Offers Books
Showing 1 - 12 of 7,821 Results Sort by Relevance
Last 90 days (109)
w Arrivals Departments Books › "Hyderabad"
History Hyderabad 5
Shop All
Next 90 days (10)
New Arrivals
Search
Books › "Hyderabad" Basket Wish List
Formats
ny Release Date
Any Release Date 1.
Advanced Browse Hyderabad: A Biography by Audio
New & Future NarendraBargain image Special
Luther (Paperback - Your Hardcover
Sell 14 Sep 2006)
Books Bestsellers Paperbacks See larger
ast Department (19)
30 days (19)
Last 30 days Search Genres Releases Books Books Offers Books
Buy new: £10.99
Showing 1 - 12 of 7,821 Results
Showing 1 - 12 of 7,821 Results 6 Used & new from £9.53 by images
Share your own customer
Sort Relevance Paperback
Sort by Relevance
Last 90 days
90 days (109) (109)
ast ‹ Any Department Publisher: learn how customers can search inside this
Next 90 days
New Arrivals (10)
Books › "Hyderabad" › History dispatched within 1 to 2 months
Usually book.
Books
Next 90 days (10) Date
Any Release 4 1. Hyderabad: A Biography by Narendra Luther (Paperback - 14 Sep 2006)
Eligible for FREE Super Saver Delivery. Publisher!
History (2,194) 1. Hyderabad: A&Biography bylike to read this book on Kindle
Narendra Luther
Tell the
(Paperback - 14 Sep
Department (2)
Last 30 days Buy new: £10.99 6 Used new from £9.53I’d
Biography (362)
‹ Any Department {click /} Showing 1 - 12 of 2,195 Results Sort by Relevance
partment 90 days (18) (327)
Last
Books & Holiday
Travel Buy new: £10.99 6 UsedDon t neworfrom £9.53
Usually dispatched within 1 to 2 months & have a Kindle? Get a FREE
your
Next 90 days (2) book Kindle here, download
ny Department(2,194)(2,208)
Study Books
History 1. title: Hyderabad: A Biography by Delivery. Luther (Paperback - 14 Sep 2006)
Eligible for FREE Super Saver Narendra Kindle Reading App.
Books Computing(362)
Biography
Department & Internet (299) 2. Hyderabad dispatched within 1 to 2 months
Usually Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010)
Buy new: £10.99
price: 6 Used & new from £9.53
Buy new: £5.79FREE Super Saver Delivery.
Eligible for {
Travel & Holiday (327)
HistoryReference (518)
‹ Any Department
(2,194)
Usually dispatched withinl1 to 2 months
c
‹ Books Books (2,208)
Study
Society, Politics & i 5
Biography (362) & (2,146) (299)
Computing
Philosophy Internet
History
2. Hyderabad Travel Guide by Offbeat Guides (Kindle Edition - 13 Sep 2010)
ck
Available for download now
Reference &(327)
Travel Science & (518) (1,601)
& Countries Nature (1,058)
Holiday Regions
/} Product details
Buy new: £5.79
Society, Politics & Paperback: 436 pages
Business, History &
Cultural Finance
Study Books (2,208) (354)
Philosophy (2,146) Available for download now
Social & Economic
Law (1,061) publisher: OUP Oxford; New Ed edition (14 Sep 2006)
Publisher:
Computing & Internet (299) 2. Modern Hyderabad Travel John Law by English
Hyderabad (Deccan) by Guide (Paperback - Guides (Kindle Edition - 13 Se
Offbeat 25 Nov 2009)
Science & (74)
History Nature (1,601) 2. Language
Scientific, Finance & &
Business, Technical
World History
Reference (518)(816) (725)
Medical
Law (1,061) Buy new:4 £5.79 new from £17.99 0195684346
Buy new: £17.99 Used & ISBN-10:
Reference (50) 3. Progress in Cryptology - INDOCRYPT 2010: 11th International Conference
Religion & Spirituality
Society,Scientific, Technical & (601)
Politics History (493)
& ISBN-13: 978-0195684346
Religious(816) Get iton Cryptology inifIndia, Hyderabad, India, December 12-15, 2010,
by Friday, Oct 29 you order in the next 23 hours and choose express
delivery.Available ... Computer ScienceDimensions: 20.8 x 13.6 x 2.4 cm
Progress in Cryptology -downloadProduct / Security and Cryptology) by Guang
for INDOCRYPTnow 11th International Conference
Medical
Philosophy (2,146) (121)
Art, Architecture & 3. 2010:
Political History
Religion & Spirituality (601)
Photography (192) Proceedings
Science Art,Nature (1,601)
& Architecture & on Cryptology in India, Hyderabad, Average Customer Review: 2010, first to review this item
India, December 12-15,
North America (42)
Health, Family &
Gong and Kishan Chand Gupta (Paperback - 12 Dec Be the 2010)
Proceedings ... Computer Science / Amazon Bestsellers Rank: 253,718 in Books (See Top 100 in Books)
Security and Cryptology) by Guang
Business,Photography &
Archaeology (37)
Finance (192)
Lifestyle (371) Gong Buy Kishan£48.99
and new: Chand Gupta (Paperback - 12 Books > History > Countries & Regions > Asia > 1500-190
#11 in Dec 2010)
Law (1,061) (208) & (222)
Health, Family
Military History
Fiction
Lifestyle (371) #29 in Books > History > Countries & Regions > Asia > 1900-194
Europe (193) 3. The new: £48.99 pre-order
Buy Untold Charminar: Writings on Hyderabad by Syeda Imam
Available for
Scientific, Technical & (66)
Fiction Drama &
Poetry, (208)
Academic History (Hardcover - 30 May 2008)
Criticism (191)
Medical (816) Ireland (132)
Poetry, Drama & Available for pre-order Super Saver Would you like to update product info or give feedback on images?
Eligible for FREE Delivery.
Britain &
Criticism (191)
Antiquarian, Rare & 3. Buy new: £15.99 £13.59 Cryptology from £13.28
Eligible for FREE Super SaverUsed & new - INDOCRYPT 2010: 11th International C
Progress in 6 Delivery.
ReligionAntiquarian, Rare &(601)
& Spirituality
Other Historical See Complete Table of Contents
Collectable (164)
Subjects (70) Get itIndia South Nelles Ma: India, Hyderabad, and choose express
byon Cryptologyifin Including Andaman / Nicobar December 12-15, 201
Thursday, Oct 28 you order in the next 5 hours India, Islands,
Collectable (164)
Art, Architecture & (164) &
Home & Garden Letters
Essays, Journals, 4.
4.
delivery.Proceedings ... Computer Science Islands,
India South Nelles Ma: Including Andaman / Nicobar / Security and Cryptology) by
PhotographyGarden (164)
Home & (192)
True Accounts (14) Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore /
More About the Author
Food & Drink (81) Lakshadweep. Special map: Goa. City maps: Hyderabad / Bangalore /
Only 1 left in stock - order soon.
ChennaiGong (Madras)/ Thiruvananthapuram (Trivandrum) (Nellesand more.
Chennai andThiruvananthapuram (Trivandrum) (Nelles Map) by
(Madras)/ Kishan Chand Gupta (Paperback - 12 writers, 2010) by Dec Map)
Food & Drink (81)
Family History &
Ancient &
Health,Sports, Hobbies&&
Sports, Hobbies Discover books, learn about
Civilisation (16)
Lifestyle (371)
Games (80)
Games (80) Nelles (Map - 6 Oct -2010) 2010)
Nelles (Map 6 Oct
Maritime History (3) Buy new: £48.99 › Visit Amazon's Narendra Luther Page
Hyderabad: Webster's 22 Used 22 Used & £3.40from by Icon Group
Buy new: £7.95 £6.18 £6.18
Buy new: £7.95 Timeline History, 710 - 2007 £3.40
& new from new
Calendars, Diaries, Annuals
Calendars, Diaries, Annuals
Fiction (208) 4.
& More (8)
& More (8)
Format
Poetry,Mind, Body & Spirit (121)
Drama &
Mind, Body & Spirit (121) byAvailable for pre-order
Internationalby Thursday, 1 May orderyou order in hoursnext choose express
Get itGet Thursday, Oct 28 if you 2009)in the next 2 the and 2 hours and choose express
it (Paperback - Oct 28 if
Any Format
Criticism & Lesbian(50)
Gay (191) delivery.Eligible for FREE Super Saver Delivery.
delivery.
Buy new: £28.95 £27.50 4 Used & new from £27.50
17
Gay & Lesbian (50)
Hardcover (903)
18. 2 OXPath » The Language
➊ Actions: Browser Interaction
Actions correspond to DOM events, e.g.,
Document doc("rightmove.co.uk")
Click {click}
Fill {“Sea”}
Mouseover {mouseover}
Executed once on each context node
Return context nodes (contextual actions) or
root nodes for new DOM (absolute actions)
18
19. 2 OXPath » The Language
➋ Extraction: Compact Tree Construction
Extraction marker select nodes for extraction
record markers: :<flight>
attribute markers: :<price=string(.)>
Extracted data has tree shape
nesting of extraction markers in OXPath expression defines
nesting of records and attribute-record associations in the output
19
20. 2 OXPath » The Language
➌ Iteration: Kleene Star
Most web sites use pagination for results
traversing paginated results require iteration
⇢ extraction from any unbounded component of a link graph
Kleene Star from Regular XPath [Marx TODS ‘05]
extended to OXPath, i.e., with action in the iterated expression
/(//a[.=’Next’])*
OXPath’s Page-at-a-time algorithm
buffers in practice only a constant number of pages
even for very large components
20
21. 2 OXPath » The Language
➋ Style: Querying Visual Attributes
Access to all CSS properties via style axis
Visibility style::display or style::visibility
Font size style::font-size
Geometry style::top, style::left, ...
Color style::color or style::background-color
Joins on style properties possible
but: no rich spatial relations as in SXPath
21
22. 2 OXPath » The Language
OXPath = XPath + 4
action
extraction
style
iteration
22
24. OXPath » Formal Properties
3
Semantics – Overview
Semantics defined over relational structure
as in XPath Leashed [Benedikt 08]
but: extended to multiple documents and with action relations
action relations form tree ⇢ no two actions lead to same DOM
but: instead of single node set, set of relations for extracted tree
Context extended by last match for parent extraction marker
necessary to construct the extraction tree
last match for sibling extraction used for
compact specification of records with mixed attributes and nested records
24
27. OXPath » Formal Properties
3
Page-At-A-Time (PAAT)
Assures constant memory over number of pages traversed
Experimentally confirmed small overhead over XPath
overhead due to nested record extraction
actions do not affect complexity
Kleene star: small overhead over regular XPath
Extracted records are never buffered
but streamed as soon as they are matched
Offers optimal page management by minimising buffer size
27
28. Web Images Videos Maps News Shopping Gmail more !
1 world wide web
Web Images Videos Maps News Shopping Gmail more !
2 Web Images Videos
Scholar Articles and patents
world wide web
Maps Newsanytime
Shopping Gmail more
include cit
1 3 Web Images patents
The Scholar of Articles and Videos Maps world wide web
diameter the world wide web Newsanytime Shopping Gmail m
inclu
R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907038, 1999
arXiv:cond-mat/9907038v2 [cond-mat.dis-nn] 10 Sep 1999 The diameter
Despite itsScholar role in communication,web
The diameter of Articles and patents the world wide web (www)
the world wide world wide
increasing web
anytime
R Albert, H Jeong,individual or institution can create websites with u
controlled medium: any AL Barabási - Arxiv preprint cond-mat/9907038,
Cited arXiv:cond-mat/9907038v2 -[cond-mat.dis-nn]web versions
by 2497 - diameter of theView as HTML - All 52 1999 The diam
Related articles 10 Sep
The Scholar role in communication, the world wide web (
world wide
Despite its increasing Articles and patents anytime
R Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907
controlled medium: any individual or institution can create websites
Self-similarity in World Wide Web traffic: evidence and poss
CitedarXiv:cond-mat/9907038v2 - View as HTML - 10 Sep 1999 The
by 2497 - Related articles [cond-mat.dis-nn] All 52 versions
ME Crovella, A The diameter of the communication,webworld wide w
Despite its increasing role in world wide 2002 - ieeexplore
Bestavros - Networking, IEEE/ACM ", the
1 2 7 1 12
Abstract—Recently, the notionanyself-similarity institution can create webs
controlled medium: of individual or has been shown to apply
Self-similarity in HIn this paper, we[cond-mat.dis-nn] 10cond-mat
R Albert, Jeong, AL Barabási - Arxiv preprint
network traffic.-World Wide Web traffic: evidence and
local-area CitedarXiv:cond-mat/9907038v2 -show evidence that the subse
by 2497 Related articles View as HTML - All Sepversi 52 199
is due to WorldDespiteWeb (WWW) transfers IEEE/ACMcharacteristics th
ME Crovella, A Bestavros - Networking, can show ", 2002 - ieeex
Wide its increasing role in communication, the world w
Cited Abstract—Recently,articles - BLof self-similarityversions evidence a
by 3122 -controlled the notion Direct - All 56 institution can create
Related medium: any Wide Web has been shown to
individual or traffic:
Self-similarity in Worldpaper, we show evidence that the
local-area Cited by 2497 - In this articles - View as HTML - All 52
network traffic. Related
is due to Crovella, A Bestavros - Networking, IEEE/ACM ", 2002 -
ME
World-wideWorld Wide Web (WWW) of self-similarity hascharacterist
web: The information transfers can show been
1
Abstract—Recently,articles - BL universe 56 versions show
Cited by 3122 - Related the notion Direct - All
T Berners-Lee, Self-similarity in WorldInternetwe show evidence tha
local-area network traffic. B " - paper, Web traffic: evide
R Cailliau, JF Groff, In this Wide ", 2010 - emeraldins
Purpose – is due to Crovella, Web (W 3 ) initiative is a practical project 20
The ME World Wide Web (WWW) transfers IEEE/ACM charac
World-Wide A Bestavros - Networking, can show ", d
information universe into - Related articles - BL universe 56 versions
World-wide web: The informationof self-similarity has been
CitedAbstract—Recently, using available technology. This pape
by 3122 existence the notion Direct - All
2 7 the aims, data model, Cailliau, JF Groff, B this paper, we show evidence
T Berners-Lee, Randnetwork traffic. In " -implement the “web” and
local-area protocols needed to Internet ", 2010 - emer
1
3 1 6 1
8 9 1 13 12 16
Cited Purpose –is due to articles Wide Web 3 ) initiative isDirect - All 26 c
by 1919 - Related World - Check Availability transferspractical proj
The World-Wide Web (W (WWW) - BL a can show v
information universe web: The information Direct - All 56 versio
World-wide into existence using available technology. This
universe
Cited by 3122 - Related articlesB " - Internet ", 2010 -
- BL
7 T Berners-Lee, R Cailliau, JF Groff,
the aims, data model, and protocols needed to implement the “web
[BOOK] Information architecture forWeb (W 3 ) initiativewebpractica
the world wide is a
CitedPurpose – Related articles - Check Availability - BL Direct - All
by 1919 - The World-Wide
L Rosenfeld, P World-wide web: The information universe
information universe - portal.acm.org
Morville - 1998 into existence using available technology.
Information Architecture for the R and protocols needed to- implement the
the aims, data model,World Wide Web isB "webmasters, des
for Internet ", 20
Information site. It's for novice Groff, 3 ) initiative webpra
T Berners-Lee, Cailliau, JF
involved inCitedPurpose – The World-Wide Checkworld wideBLfrom t
[BOOK]building a web architecture for the designers who,
2 web(W
12
by 1919 - Related articles - Web Availability - is Direc
L Rosenfeld, P Morville designed sites. It's for experienced web de
information universe - portal.acm.org
a
the traps that result in poorly - 1998 into existence using available techno
Cited InformationRelated articles architecture for Web world implemen
by 1030 -the aims, data for- the World Wide- All 19 versions
Architecture model, andSearch
Information Library protocolsweb
is for webmasters
needed to wide we
[BOOK] building a web site. It's for novice the designers who, f
involved in
3 Cited by 1919 - Related articles - Check Availability - BL
the traps that result in poorly -designed sites. It's for experienced we
L Rosenfeld, P Morville 1998 - portal.acm.org
2 9 [BOOK] Weaving the Architecture for the World Wide Web is for webm
Web: The original design and ultimate de
CitedInformationRelated articles - Library Search - All 19 versions
2 4 5 10 11
by 1030 -
1 14 15
inventor involved in building a web architecture for the world wid
[BOOK] Information site. It's for novice web designers w
1 1 the traps that result in poorly -designed sites. It's for experience
L Rosenfeld, P Morville 1998 - portal.acm.org
1 3 9 [BOOK] Weaving the Web: The original design and ultimat
13
CitedInformation Related articles -the World Wide -Web is versio
by 1030 - Architecture for Library Search All 19 for w
inventor involved in building a web site. It's for novice web design
[BOOK] Weaving the Web: The original design for exper
the traps that result in poorly designed sites. It's and ult
Cited by 1030 - Related articles - Library Search - All 19
inventor
[BOOK]Weaving the Web: The original design an
inventor 28
39. generalized to create a list of similar elements. An add
OXPath » OXPath Suite Interface – Before Expression Refinement
Figure 3.6: User
4
how many nodes this expression extracts. Clicking on
alternatively in ascending and descending order.
Visual OXPath: Semi-supervised Generation
Interaction recording
Generation of OXPath expressions
ranked by robustness & specificity
Figure 3.4: User Interface – Action Prope
Users can select another expression in the list th
suitable. They can also adapt the generated OXPa
expression manually. For assistance, the syntactic corre
and the number of nodes it selects is updated durin 39
40. 4 OXPath » OXPath Suite
OXPath Tracer: Debugging
Track selected DOM nodes
Trace page management
Fine grained execution control
(filter, step-by-step, breaks)
40
41. 4 OXPath » OXPath Suite
ONTOX: Ontology Population with OXPath
41
42. DIADEM domain-centric intelligent automated
data extraction methodology
43. DIADEM domain-centric intelligent automated
data extraction methodology
Questions
diadem-project.info