SlideShare a Scribd company logo
1 of 36
Download to read offline
XPath (1.0) for web scraping
Paul Tremberth, 17 October 2015, PyCon FR
1
Who am I?
I’m currently Head of Support at
Scrapinghub.
I got introduced to Python through web
scraping.
You can find me on StackOverflow: “xpath”,
“scrapy”, “lxml” tags.
I have a few repos on Github.
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 2
Talk outline
● What is XPath?
● Location paths
● HTML data extraction examples
● Advanced use-cases
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 3
What is XPath?
4
XPath is a language
"XPath is a language for addressing parts of an
XML document" — XML Path Language 1.0
XPath data model is a tree of nodes:
● element nodes (<p>...</p>)
● attribute nodes (href="page.html" )
● text nodes ("Some Title")
● comment nodes (<!-- a comment --> )
(and 3 other types that we won’t cover here.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 5
Why learn XPath?
● navigate everywhere inside a DOM tree
● a must-have skill for accurate web data
extraction
● more powerful than CSS selectors
○ fine-grained look at the text content
○ complex conditioning with axes
● extensible with custom functions
(we won’t cover that in this talk though)
Also, it’s kind of fun :-)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 6
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
XPath Data Model: sample HTML
7
XPath Data Model (cont.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
root
meta5
html1
head2
title3
body6
div7
p8
p10
br15 Apparem
ment.16
Ceci...
9 Est-ce..11
Ceci...4
div17
Rien...18
a19
.21
comment22
element node
http-equiv
a12 ?14
content
href
attribute node
un lien13
comment node
href autre lien20
text node
nodes have an order,
the document order,
the order they appear in
the XML/HTML source
8
XPath return types
XPath expressions can return different
things:
● node-sets (most common case, and
most often element nodes)
● strings
● numbers (floating point)
● booleans
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 9
Example XPath expressions
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
/html/head/title
//meta/@content
//div/p
//div/a/@href
//div/a/text()
//div/div[@class="second"]
10
Location Paths:
how to move inside the document tree
11
Location Paths
Location path is the most common XPath
expression.
Used to move in any direction from a starting
point (the context node) to any node(s) in the
tree.
● a string, with a series of “steps”:
○ "step1 / step2 / step3 ..."
● represents selection & filtering of nodes,
processed step by step, from left to right
● each step is:
○ AXIS :: NODETEST [PREDICATE]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 12
whitespace does NOT
matter,
except for “//”,
“/ /” is a syntax error.
So don’t be afraid of
indenting your
XPath expressions
Relative vs. absolute paths
● "step1/step2/step3" is relative
● "/step1/step2/step3" is absolute
● i.e. an absolute path is a relative path
starting with "/" (slash)
● in fact, absolute paths are relative to the
root node
● use relative paths whenever possible
○ prevents unexpected selection of
same nodes in loop iterations...
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 13
Location Paths: abbreviations
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
Abbreviated syntax Full syntax (again, whitespace doesn’t matter)
/html/head/title /child:: html /child:: head /child:: title
//meta/@content /descendant-or-self::node()/child::meta/attribute::content
//div/div[@class="second"] /descendant-or-self::node()
/child::div
/child::div [ attribute::class = "second" ]
//div/a/text() /descendant-or-self::node()
/child::div/child::a/child::text()
What we’ve seen earlier is in fact “abbreviated syntax”.
Full syntax is quite verbose:
14
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
*
Axes give the direction to go next.
● self (where you are)
● parent, child (direct hop)
● ancestor, ancestor-or-self, descendant,
descendant-or-self (multi-hop)
● following, following-sibling, preceding,
preceding-sibling (document order)
● attribute, namespace (non-element)
15
Axes: moving around
AXIS :: nodetest [predicate]*
Axes: move up or down the tree
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
● self: context node
● child: children of context
node in the tree
● descendant: children of
context node, children of
children, …
● ancestor: parent of
context node (here,
<body>)
16
Axes: move on same tree level
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
● preceding-sibling: same
level in tree, but BEFORE
in document order
● self: context node
● following-sibling: same
level in tree, but AFTER
in document order
17
Axes: document partitioning
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
ancestor&
preceding
descendants
&following
● ancestor: parent, parent
of parent, ...
● preceding: before in
document order,
excluding ancestors
● self: context node
● descendant: children,
children of children, ...
● following: after in
document order,
excluding descendants
self U (ancestor U preceding)
U (descendant U following)
== all nodes
18
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
Select nodes types along the axes:
● either a name test:
○ element names: “html”, “p”, ...
○ attribute names: “content-type”, “href”, …
● or a node type test:
○ node(): ALL nodes types
○ text(): text nodes, i.e. character data
○ comment(): comment nodes
○ or * (i.e. the axis’ principal node type)
19
Node tests
text() is not
a function call
axis :: NODETEST [predicate]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
*
Additional node properties to filter on:
● simplest are positional (start at 1, not 0):
//div[3] (i.e. 3rd child div element)
● can be nested, can use location paths:
//div[p[a/@href="sample.html"]]
● ordered, from left to right:
//div[2][@class="content"]
vs.
//div[@class="content"][2]
20
Predicates
axis :: nodetest [PREDICATE]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 21
Abbreviations: to remember
Abbreviated step Meaning
*
(asterisk)
all element nodes (excluding text nodes, attribute nodes, etc.) ;
.//* != .//node()
Note that there is no element() test node.
@* attribute::*, all attribute nodes
//
/descendant-or-self::node()/
(exactly this, so .//* != ./descendant-or-self::*)
.
(a single dot)
self::node(), the context node
..
(2 dots)
parent::node()
Use cases:
“Show me some Python code!”
22
Text extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import lxml.html
>>> root = lxml.html.fromstring(htmlsource)
>>> root.xpath('//div[@class="second"]/text()') # you get back a text nodes node-set
[u'n Rien xe0 ajouter.n Sauf cet ', '. n ', 'n ']
>>> root.xpath('//div[@class="second"]//text()')
[u'n Rien xe0 ajouter.n Sauf cet ', 'autre lien', '. n ', 'n ']
>>> root.xpath('string(//div[@class="second"])') # you get back a single string
u'n Rien xe0 ajouter.n Sauf cet autre lien. n n '
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
23
Attributes extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import lxml.html
>>> root = lxml.html.fromstring(htmlsource)
>>> root.xpath('/html/head/meta/@content')
['text/html; charset=utf-8']
>>> root.xpath('/html/head/meta/@*')
['text/html; charset=utf-8', 'content-type']
<html><head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>...</html>
24
Attribute names extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> for element in root.xpath('/html/head/meta'):
... attributes = []
...
... # loop over all attribute nodes of the element
... for index, attribute in enumerate(element.xpath('@*'), start=1):
...
... # use XPath's name() string function on each attribute,
... # using their position
... attribute_name = element.xpath('name(@*[%d])' % index)
... attributes.append((attribute_name, attribute))
...
>>> attributes
[('content', 'text/html; charset=utf-8'), ('http-equiv', 'content-type')]
>>> dict(attributes)
{'content': 'text/html; charset=utf-8', 'http-equiv': 'content-type'}
25
CSS Selectors
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> selector.css('html > body > ul > li:first-child').extract()
[u'<li class="a b">apple</li>']
>>> selector.css('ul li + li').extract()
[u'<li class="b c">banana</li>', u'<li class="c a lastone">carrot</li>']
<html>
<body>
<ul>
<li class="a b">apple</li>
<li class="b c">banana</li>
<li class="c a lastone">carrot</li>
</ul>
<body>
</html>
26
lxml, scrapy and parsel
use cssselect under
the hood
CSS Selectors (cont.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> selector.css('li.a').extract()
[u'<li class="a b">apple</li>', u'<li class="c a lastone">carrot</li>']
>>> selector.css('li.a.c').extract()
[u'<li class="c a lastone">carrot</li>']
>>> selector.css('li[class$="one"]').extract()
[u'<li class="c a lastone">carrot</li>']
<html>
<body>
<ul>
<li class="a b">apple</li>
<li class="b c">banana</li>
<li class="c a lastone">carrot</li>
</ul>
<body>
</html>
27
Loop on elements (rows, lists, ...)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import parsel
>>> selector = parsel.Selector(text=htmlsource)
>>> dict((li.xpath('string(./strong)').extract_first(),
li.xpath('normalize-space( 
./strong/following-sibling::text())').extract_first())
for li in selector.css('div.detail-product > ul > li'))
{u'Doublure': u'Textile',
u'Ref': u'22369',
u'Type': u'Baskets'}
<div class='detail-product'>
<ul>
<li><strong>Type</strong> Baskets</li>
<li><strong>Ref</strong> 22369</li>
<li><strong>Doublure</strong> Textile</li>
</ul>
</div> <!-- borrowed from spartoo.com... -->
28
XPath buckets
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> for cnt, h in enumerate(selector.css('h2'), start=1):
... print h.xpath('string()').extract_first()
... print [p.xpath('string()').extract_first()
... for p in h.xpath('''./following-sibling::p[
... count(preceding-sibling::h2) = %d]''' % cnt)]
...
My BBQ Invitees
[u'Joe', u'Jeff', u'Suzy']
My Dinner Invitees
[u'Dylan', u'Hobbes']
<h2>My BBQ Invitees</h2>
<p>Joe</p>
<p>Jeff</p>
<p>Suzy</p>
<h2>My Dinner Invitees</h2>
<p>Dylan</p>
<p>Hobbes</p>
29
All elements are at the same
level, all siblings.
The idea here is to select <p>
and filter them by how many
<h2> siblings came before
XPath buckets: generalization
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> all_elements = selector.css('h2, p')
>>> h2_elements = selector.css('h2')
>>> order = lambda e: int(float(e.xpath('count(preceding::*) 
... + count(ancestor::*)').extract_first()))
>>> boundaries = [order(h) for h in h2_elements]
>>> buckets = []
>>> for pos, e in sorted((order(e), e) for e in all_elements):
... if pos in boundaries:
... bucket = []
... buckets.append(bucket)
... bucket.append((e.xpath('name()').extract_first(),
... e.xpath('string()').extract_first()))
>>> buckets
[[(u'h2', u'My BBQ Invitees'),
(u'p', u'Joe'), (u'p', u'Jeff'), (u'p', u'Suzy')],
[(u'h2', u'My Dinner Invitees'), (u'p', u'Dylan'), (u'p', u'Hobbes')]]
30
counting ancestors and preceding
elements gives you the node
position in document order
EXSLT extensions
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<div itemprop="director" itemscope itemtype="http://schema.org/Person">
Director: <span itemprop="name">James Cameron</span> (born <span itemprop="
birthDate">August 16, 1954</span>)
</div>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
_xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]')
_xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop],
.//*[@itemscope]//*[@itemprop])""",
namespaces = {"set": "http://exslt.org/sets"})
31
EXSLT extensions (cont.)
{'items': [{'properties': {'director': {'properties': {'birthDate': u'August 16,
1954',
'name': u'James Cameron'},
'type': ['http://schema.org/Person']},
'genre': u'Science fiction',
'name': u'Avatar',
'trailer': 'http://www.example.com/../movies/avatar-
theatrical-trailer.html'},
'type': ['http://schema.org/Movie']}]}
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 32
Scraping Javascript code
<div id="map"></div>
<script>
function initMap() {
var myLatLng = {lat: -25.363, lng: 131.044};
}
</script>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import js2xml
>>> import parsel
>>>
>>> selector = parsel.Selector(text=htmlsource)
>>> jssnippet = selector.css('div#map + script::text').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> js2xml.jsonlike.make_dict(jstree.xpath('//object[1]')[0])
{'lat': -25.363, 'lng': 131.044}
33
XPath tips & tricks
● Use relative XPath expressions
whenever possible
● Know your axes
● Remember XPath has string() and
normalize-space()
● text() is a node test, not a function call
● CSS selectors are very handy, easier to
maintain, but also less powerful
● js2xml is easier than regex+json.loads()
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 34
XPath resources
● Read http://www.w3.org/TR/xpath/
(it’s worth it!)
● XPath 1.0 tutorial by Zvon.org
● Concise XPath by Aristotle Pagaltzis
● XPath in lxml
● cssselect by Simon Sapin
● EXSLT extensions
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 35
Thank you!
Paul Tremberth, 17 October 2015
paul@scrapinghub.com
https://github.com/redapple
http://linkedin.com/in/paultremberth
Oh, and we’re hiring!
http://scrapinghub.com/jobs
36

More Related Content

What's hot

What's hot (20)

Css class-02
Css class-02Css class-02
Css class-02
 
Constructing SQL queries for AtoM
Constructing SQL queries for AtoMConstructing SQL queries for AtoM
Constructing SQL queries for AtoM
 
Introduction To Catalyst - Part 1
Introduction To Catalyst - Part 1Introduction To Catalyst - Part 1
Introduction To Catalyst - Part 1
 
Filtering, Searching, and Sorting ActiveRecord Lists Using Filterrific
Filtering, Searching, and Sorting ActiveRecord Lists Using FilterrificFiltering, Searching, and Sorting ActiveRecord Lists Using Filterrific
Filtering, Searching, and Sorting ActiveRecord Lists Using Filterrific
 
Let's read code: the python-requests library
Let's read code: the python-requests libraryLet's read code: the python-requests library
Let's read code: the python-requests library
 
Routing & Navigating Pages in Angular 2
Routing & Navigating Pages in Angular 2Routing & Navigating Pages in Angular 2
Routing & Navigating Pages in Angular 2
 
9. ES6 | Let And Const | TypeScript | JavaScript
9. ES6 | Let And Const | TypeScript | JavaScript9. ES6 | Let And Const | TypeScript | JavaScript
9. ES6 | Let And Const | TypeScript | JavaScript
 
Introduction to PHP - Basics of PHP
Introduction to PHP - Basics of PHPIntroduction to PHP - Basics of PHP
Introduction to PHP - Basics of PHP
 
Javascript event handler
Javascript event handlerJavascript event handler
Javascript event handler
 
OSMC 2021 | inspectIT Ocelot: Dynamic OpenTelemetry Instrumentation at Runtime
OSMC 2021 | inspectIT Ocelot: Dynamic OpenTelemetry Instrumentation at RuntimeOSMC 2021 | inspectIT Ocelot: Dynamic OpenTelemetry Instrumentation at Runtime
OSMC 2021 | inspectIT Ocelot: Dynamic OpenTelemetry Instrumentation at Runtime
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
Python - Dicionários
Python - DicionáriosPython - Dicionários
Python - Dicionários
 
presentation in html,css,javascript
presentation in html,css,javascriptpresentation in html,css,javascript
presentation in html,css,javascript
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Css pseudo-classes
Css pseudo-classesCss pseudo-classes
Css pseudo-classes
 
Pilhas e Filas
Pilhas e FilasPilhas e Filas
Pilhas e Filas
 
React render props
React render propsReact render props
React render props
 
Elastic serach
Elastic serachElastic serach
Elastic serach
 
Orientação a objetos em Python (compacto)
Orientação a objetos em Python (compacto)Orientação a objetos em Python (compacto)
Orientação a objetos em Python (compacto)
 
JQuery selectors
JQuery selectors JQuery selectors
JQuery selectors
 

Similar to XPath for web scraping

Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
webhostingguy
 

Similar to XPath for web scraping (20)

Lumberjack XPath 101
Lumberjack XPath 101Lumberjack XPath 101
Lumberjack XPath 101
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
Introduccion a HTML5
Introduccion a HTML5Introduccion a HTML5
Introduccion a HTML5
 
html5
html5html5
html5
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopers
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
 
Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09
 
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCRDrupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCR
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
DOM Quick Overview
DOM Quick OverviewDOM Quick Overview
DOM Quick Overview
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performance
 
The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

XPath for web scraping

  • 1. XPath (1.0) for web scraping Paul Tremberth, 17 October 2015, PyCon FR 1
  • 2. Who am I? I’m currently Head of Support at Scrapinghub. I got introduced to Python through web scraping. You can find me on StackOverflow: “xpath”, “scrapy”, “lxml” tags. I have a few repos on Github. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 2
  • 3. Talk outline ● What is XPath? ● Location paths ● HTML data extraction examples ● Advanced use-cases XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 3
  • 5. XPath is a language "XPath is a language for addressing parts of an XML document" — XML Path Language 1.0 XPath data model is a tree of nodes: ● element nodes (<p>...</p>) ● attribute nodes (href="page.html" ) ● text nodes ("Some Title") ● comment nodes (<!-- a comment --> ) (and 3 other types that we won’t cover here.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 5
  • 6. Why learn XPath? ● navigate everywhere inside a DOM tree ● a must-have skill for accurate web data extraction ● more powerful than CSS selectors ○ fine-grained look at the text content ○ complex conditioning with axes ● extensible with custom functions (we won’t cover that in this talk though) Also, it’s kind of fun :-) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 6
  • 7. <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 XPath Data Model: sample HTML 7
  • 8. XPath Data Model (cont.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 root meta5 html1 head2 title3 body6 div7 p8 p10 br15 Apparem ment.16 Ceci... 9 Est-ce..11 Ceci...4 div17 Rien...18 a19 .21 comment22 element node http-equiv a12 ?14 content href attribute node un lien13 comment node href autre lien20 text node nodes have an order, the document order, the order they appear in the XML/HTML source 8
  • 9. XPath return types XPath expressions can return different things: ● node-sets (most common case, and most often element nodes) ● strings ● numbers (floating point) ● booleans XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 9
  • 10. Example XPath expressions <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 /html/head/title //meta/@content //div/p //div/a/@href //div/a/text() //div/div[@class="second"] 10
  • 11. Location Paths: how to move inside the document tree 11
  • 12. Location Paths Location path is the most common XPath expression. Used to move in any direction from a starting point (the context node) to any node(s) in the tree. ● a string, with a series of “steps”: ○ "step1 / step2 / step3 ..." ● represents selection & filtering of nodes, processed step by step, from left to right ● each step is: ○ AXIS :: NODETEST [PREDICATE]* XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 12 whitespace does NOT matter, except for “//”, “/ /” is a syntax error. So don’t be afraid of indenting your XPath expressions
  • 13. Relative vs. absolute paths ● "step1/step2/step3" is relative ● "/step1/step2/step3" is absolute ● i.e. an absolute path is a relative path starting with "/" (slash) ● in fact, absolute paths are relative to the root node ● use relative paths whenever possible ○ prevents unexpected selection of same nodes in loop iterations... XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 13
  • 14. Location Paths: abbreviations XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 Abbreviated syntax Full syntax (again, whitespace doesn’t matter) /html/head/title /child:: html /child:: head /child:: title //meta/@content /descendant-or-self::node()/child::meta/attribute::content //div/div[@class="second"] /descendant-or-self::node() /child::div /child::div [ attribute::class = "second" ] //div/a/text() /descendant-or-self::node() /child::div/child::a/child::text() What we’ve seen earlier is in fact “abbreviated syntax”. Full syntax is quite verbose: 14
  • 15. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 * Axes give the direction to go next. ● self (where you are) ● parent, child (direct hop) ● ancestor, ancestor-or-self, descendant, descendant-or-self (multi-hop) ● following, following-sibling, preceding, preceding-sibling (document order) ● attribute, namespace (non-element) 15 Axes: moving around AXIS :: nodetest [predicate]*
  • 16. Axes: move up or down the tree <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 ● self: context node ● child: children of context node in the tree ● descendant: children of context node, children of children, … ● ancestor: parent of context node (here, <body>) 16
  • 17. Axes: move on same tree level XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> ● preceding-sibling: same level in tree, but BEFORE in document order ● self: context node ● following-sibling: same level in tree, but AFTER in document order 17
  • 18. Axes: document partitioning <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 ancestor& preceding descendants &following ● ancestor: parent, parent of parent, ... ● preceding: before in document order, excluding ancestors ● self: context node ● descendant: children, children of children, ... ● following: after in document order, excluding descendants self U (ancestor U preceding) U (descendant U following) == all nodes 18
  • 19. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 Select nodes types along the axes: ● either a name test: ○ element names: “html”, “p”, ... ○ attribute names: “content-type”, “href”, … ● or a node type test: ○ node(): ALL nodes types ○ text(): text nodes, i.e. character data ○ comment(): comment nodes ○ or * (i.e. the axis’ principal node type) 19 Node tests text() is not a function call axis :: NODETEST [predicate]*
  • 20. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 * Additional node properties to filter on: ● simplest are positional (start at 1, not 0): //div[3] (i.e. 3rd child div element) ● can be nested, can use location paths: //div[p[a/@href="sample.html"]] ● ordered, from left to right: //div[2][@class="content"] vs. //div[@class="content"][2] 20 Predicates axis :: nodetest [PREDICATE]*
  • 21. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 21 Abbreviations: to remember Abbreviated step Meaning * (asterisk) all element nodes (excluding text nodes, attribute nodes, etc.) ; .//* != .//node() Note that there is no element() test node. @* attribute::*, all attribute nodes // /descendant-or-self::node()/ (exactly this, so .//* != ./descendant-or-self::*) . (a single dot) self::node(), the context node .. (2 dots) parent::node()
  • 22. Use cases: “Show me some Python code!” 22
  • 23. Text extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import lxml.html >>> root = lxml.html.fromstring(htmlsource) >>> root.xpath('//div[@class="second"]/text()') # you get back a text nodes node-set [u'n Rien xe0 ajouter.n Sauf cet ', '. n ', 'n '] >>> root.xpath('//div[@class="second"]//text()') [u'n Rien xe0 ajouter.n Sauf cet ', 'autre lien', '. n ', 'n '] >>> root.xpath('string(//div[@class="second"])') # you get back a single string u'n Rien xe0 ajouter.n Sauf cet autre lien. n n ' <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> 23
  • 24. Attributes extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import lxml.html >>> root = lxml.html.fromstring(htmlsource) >>> root.xpath('/html/head/meta/@content') ['text/html; charset=utf-8'] >>> root.xpath('/html/head/meta/@*') ['text/html; charset=utf-8', 'content-type'] <html><head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head>...</html> 24
  • 25. Attribute names extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> for element in root.xpath('/html/head/meta'): ... attributes = [] ... ... # loop over all attribute nodes of the element ... for index, attribute in enumerate(element.xpath('@*'), start=1): ... ... # use XPath's name() string function on each attribute, ... # using their position ... attribute_name = element.xpath('name(@*[%d])' % index) ... attributes.append((attribute_name, attribute)) ... >>> attributes [('content', 'text/html; charset=utf-8'), ('http-equiv', 'content-type')] >>> dict(attributes) {'content': 'text/html; charset=utf-8', 'http-equiv': 'content-type'} 25
  • 26. CSS Selectors XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> selector.css('html > body > ul > li:first-child').extract() [u'<li class="a b">apple</li>'] >>> selector.css('ul li + li').extract() [u'<li class="b c">banana</li>', u'<li class="c a lastone">carrot</li>'] <html> <body> <ul> <li class="a b">apple</li> <li class="b c">banana</li> <li class="c a lastone">carrot</li> </ul> <body> </html> 26 lxml, scrapy and parsel use cssselect under the hood
  • 27. CSS Selectors (cont.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> selector.css('li.a').extract() [u'<li class="a b">apple</li>', u'<li class="c a lastone">carrot</li>'] >>> selector.css('li.a.c').extract() [u'<li class="c a lastone">carrot</li>'] >>> selector.css('li[class$="one"]').extract() [u'<li class="c a lastone">carrot</li>'] <html> <body> <ul> <li class="a b">apple</li> <li class="b c">banana</li> <li class="c a lastone">carrot</li> </ul> <body> </html> 27
  • 28. Loop on elements (rows, lists, ...) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import parsel >>> selector = parsel.Selector(text=htmlsource) >>> dict((li.xpath('string(./strong)').extract_first(), li.xpath('normalize-space( ./strong/following-sibling::text())').extract_first()) for li in selector.css('div.detail-product > ul > li')) {u'Doublure': u'Textile', u'Ref': u'22369', u'Type': u'Baskets'} <div class='detail-product'> <ul> <li><strong>Type</strong> Baskets</li> <li><strong>Ref</strong> 22369</li> <li><strong>Doublure</strong> Textile</li> </ul> </div> <!-- borrowed from spartoo.com... --> 28
  • 29. XPath buckets XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> for cnt, h in enumerate(selector.css('h2'), start=1): ... print h.xpath('string()').extract_first() ... print [p.xpath('string()').extract_first() ... for p in h.xpath('''./following-sibling::p[ ... count(preceding-sibling::h2) = %d]''' % cnt)] ... My BBQ Invitees [u'Joe', u'Jeff', u'Suzy'] My Dinner Invitees [u'Dylan', u'Hobbes'] <h2>My BBQ Invitees</h2> <p>Joe</p> <p>Jeff</p> <p>Suzy</p> <h2>My Dinner Invitees</h2> <p>Dylan</p> <p>Hobbes</p> 29 All elements are at the same level, all siblings. The idea here is to select <p> and filter them by how many <h2> siblings came before
  • 30. XPath buckets: generalization XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> all_elements = selector.css('h2, p') >>> h2_elements = selector.css('h2') >>> order = lambda e: int(float(e.xpath('count(preceding::*) ... + count(ancestor::*)').extract_first())) >>> boundaries = [order(h) for h in h2_elements] >>> buckets = [] >>> for pos, e in sorted((order(e), e) for e in all_elements): ... if pos in boundaries: ... bucket = [] ... buckets.append(bucket) ... bucket.append((e.xpath('name()').extract_first(), ... e.xpath('string()').extract_first())) >>> buckets [[(u'h2', u'My BBQ Invitees'), (u'p', u'Joe'), (u'p', u'Jeff'), (u'p', u'Suzy')], [(u'h2', u'My Dinner Invitees'), (u'p', u'Dylan'), (u'p', u'Hobbes')]] 30 counting ancestors and preceding elements gives you the node position in document order
  • 31. EXSLT extensions <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/Person"> Director: <span itemprop="name">James Cameron</span> (born <span itemprop=" birthDate">August 16, 1954</span>) </div> <span itemprop="genre">Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 _xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]') _xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop], .//*[@itemscope]//*[@itemprop])""", namespaces = {"set": "http://exslt.org/sets"}) 31
  • 32. EXSLT extensions (cont.) {'items': [{'properties': {'director': {'properties': {'birthDate': u'August 16, 1954', 'name': u'James Cameron'}, 'type': ['http://schema.org/Person']}, 'genre': u'Science fiction', 'name': u'Avatar', 'trailer': 'http://www.example.com/../movies/avatar- theatrical-trailer.html'}, 'type': ['http://schema.org/Movie']}]} XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 32
  • 33. Scraping Javascript code <div id="map"></div> <script> function initMap() { var myLatLng = {lat: -25.363, lng: 131.044}; } </script> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import js2xml >>> import parsel >>> >>> selector = parsel.Selector(text=htmlsource) >>> jssnippet = selector.css('div#map + script::text').extract_first() >>> jstree = js2xml.parse(jssnippet) >>> js2xml.jsonlike.make_dict(jstree.xpath('//object[1]')[0]) {'lat': -25.363, 'lng': 131.044} 33
  • 34. XPath tips & tricks ● Use relative XPath expressions whenever possible ● Know your axes ● Remember XPath has string() and normalize-space() ● text() is a node test, not a function call ● CSS selectors are very handy, easier to maintain, but also less powerful ● js2xml is easier than regex+json.loads() XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 34
  • 35. XPath resources ● Read http://www.w3.org/TR/xpath/ (it’s worth it!) ● XPath 1.0 tutorial by Zvon.org ● Concise XPath by Aristotle Pagaltzis ● XPath in lxml ● cssselect by Simon Sapin ● EXSLT extensions XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 35
  • 36. Thank you! Paul Tremberth, 17 October 2015 paul@scrapinghub.com https://github.com/redapple http://linkedin.com/in/paultremberth Oh, and we’re hiring! http://scrapinghub.com/jobs 36