Overview of python web scraping tools

Overview of Python
web scraping tools

Maik Röder
Barcelona Python Meetup Group
17.05.2012

Friday, May 18, 2012

Data Scraping

• Automated Process
• Explore and download pages
• Grab content
• Store in a database or in a text ﬁle


urlparse

• Manipulate URL strings
urlparse.urlparse()
urlparse.urljoin()
urlparse.urlunparse()


urllib

• Download data through different protocols
• HTTP, FTP, ...
urllib.parse()
urllib.urlopen()
urllib.urlretrieve()


Scrape a web site
• Example: http://www.wunderground.com/


Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()


Beautifulsoup

• HTML/XML parser
• designed for quick turnaround projects like
screen-scraping
• http://www.crummy.com/software/
BeautifulSoup


BeautifulSoup

from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]


Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]


Inspect the Element
• Inspect the Maximum temperature


Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23


htmllib.HTMLParser

• Interesting only for historical reasons
• based on sgmllib


htmllib5
• Using the custom simpletree format
• a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]


lxml
• Library for processing XML and HTML
• Based on C libraries install libxml2-dev
sudo aptitude
sudo aptitude install libxslt-dev

• Extends the ElementTree API
• e.g. with XPath


lxml

from lxml import etree
t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()


twill
• Simple
• No JavaScript
• http://twill.idyll.org
• Some more interesting concepts
• Pages, Scenarios
• State Machines


twill

• Commonly used methods:
go()
code()
show()
showforms()
formvalue() (or fv())
submit()


Twill

>>> from twill import commands as
twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()


Twill - acknowledge_equiv_refresh
>>> twill.go("http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html")
...
twill.errors.TwillException:
infinite refresh loop discovered;
aborting.
Try turning off
acknowledge_equiv_refresh...


Twill
>>> twill.config
("acknowledge_equiv_refresh", "false")
>>> twill.go("http://
BCN/2007/5/17/DailyHistory.html")
==> at http://www.wunderground.com/
history/airport/BCN/2007/5/17/
DailyHistory.html
'http://www.wunderground.com/history/
DailyHistory.html'


mechanize
• Stateful programmatic web browsing
• navigation history
• HTML form state
• cookies
• ftp:, http: and ﬁle: URL schemes
• redirections
• proxies
• Basic and Digest HTTP authentication

mechanize - robots.txt
>>> import mechanize
>>> browser = mechanize.Browser()
>>> browser.open('http://
DailyHistory.html')
mechanize._response.httperror_see
k_wrapper: HTTP Error 403:
request disallowed by robots.txt


mechanize - robots.txt
• Do not handle robots.txt
browser.set_handle_robots(False)

• Do not handle equiv
browser.set_handle_equiv(False)
browser.open('http://
DailyHistory.html')


Selenium

• http://seleniumhq.org
• Support for JavaScript


Selenium

from selenium import webdriver
from selenium.common.exceptions
import NoSuchElementException
from selenium.webdriver.common.keys
import Keys
import time


Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23

Phantom JS

• http://www.phantomjs.org/


Overview of python web scraping tools

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

More from maikroeder

More from maikroeder (8)

Recently uploaded

Recently uploaded (20)

Overview of python web scraping tools