• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Overview of python web scraping tools
 

Overview of python web scraping tools

on

  • 14,356 views

A talk I gave at the Barcelona Python Meetup May 2012.

A talk I gave at the Barcelona Python Meetup May 2012.

Statistics

Views

Total Views
14,356
Views on SlideShare
14,349
Embed Views
7

Actions

Likes
12
Downloads
169
Comments
0

3 Embeds 7

http://www.slashdocs.com 3
https://si0.twimg.com 2
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Overview of python web scraping tools Overview of python web scraping tools Presentation Transcript

    • Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
    • Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text fileFriday, May 18, 2012
    • urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse()Friday, May 18, 2012
    • urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve()Friday, May 18, 2012
    • Scrape a web site • Example: http://www.wunderground.com/Friday, May 18, 2012
    • Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen(http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html) >>> p = f.read() >>> d = StringIO(p) >>> f.close()Friday, May 18, 2012
    • Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoupFriday, May 18, 2012
    • BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll(a) [x[href] for x in a]Friday, May 18, 2012
    • Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer(a) a = BeautifulSoup(d, parseOnlyThese=p) [x[href] for x in a]Friday, May 18, 2012
    • Inspect the Element • Inspect the Maximum temperatureFriday, May 18, 2012
    • Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {class:nobr} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23Friday, May 18, 2012
    • htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllibFriday, May 18, 2012
    • htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= a] [x.attributes[href] for x in a]Friday, May 18, 2012
    • lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPathFriday, May 18, 2012
    • lxml from lxml import etree t = etree.parse(t.xml) for node in t.xpath(//a): node.tag node.get(href) node.items() node.text node.getParent()Friday, May 18, 2012
    • twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State MachinesFriday, May 18, 2012
    • twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit()Friday, May 18, 2012
    • Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go(http://www.google.com) >>> twill.showforms() >>> twill.formvalue(1, q, Python) >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html()Friday, May 18, 2012
    • Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh...Friday, May 18, 2012
    • Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.htmlFriday, May 18, 2012
    • mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authenticationFriday, May 18, 2012
    • mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html) mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txtFriday, May 18, 2012
    • mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html)Friday, May 18, 2012
    • Selenium • http://seleniumhq.org • Support for JavaScriptFriday, May 18, 2012
    • Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import timeFriday, May 18, 2012
    • Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,nobr)]) [position()=2]/span").text browser.close() >>> print a 23Friday, May 18, 2012
    • Phantom JS • http://www.phantomjs.org/Friday, May 18, 2012