Your SlideShare is downloading. ×
Overview of python web scraping tools
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Overview of python web scraping tools

16,145
views

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

A talk I gave at the Barcelona Python Meetup May 2012.

Published in: Technology, Design

0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,145
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
208
Comments
0
Likes
14
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
  • 2. Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text fileFriday, May 18, 2012
  • 3. urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse()Friday, May 18, 2012
  • 4. urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve()Friday, May 18, 2012
  • 5. Scrape a web site • Example: http://www.wunderground.com/Friday, May 18, 2012
  • 6. Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen(http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html) >>> p = f.read() >>> d = StringIO(p) >>> f.close()Friday, May 18, 2012
  • 7. Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoupFriday, May 18, 2012
  • 8. BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll(a) [x[href] for x in a]Friday, May 18, 2012
  • 9. Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer(a) a = BeautifulSoup(d, parseOnlyThese=p) [x[href] for x in a]Friday, May 18, 2012
  • 10. Inspect the Element • Inspect the Maximum temperatureFriday, May 18, 2012
  • 11. Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {class:nobr} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23Friday, May 18, 2012
  • 12. htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllibFriday, May 18, 2012
  • 13. htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= a] [x.attributes[href] for x in a]Friday, May 18, 2012
  • 14. lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPathFriday, May 18, 2012
  • 15. lxml from lxml import etree t = etree.parse(t.xml) for node in t.xpath(//a): node.tag node.get(href) node.items() node.text node.getParent()Friday, May 18, 2012
  • 16. twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State MachinesFriday, May 18, 2012
  • 17. twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit()Friday, May 18, 2012
  • 18. Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go(http://www.google.com) >>> twill.showforms() >>> twill.formvalue(1, q, Python) >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html()Friday, May 18, 2012
  • 19. Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh...Friday, May 18, 2012
  • 20. Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.htmlFriday, May 18, 2012
  • 21. mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authenticationFriday, May 18, 2012
  • 22. mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html) mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txtFriday, May 18, 2012
  • 23. mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html)Friday, May 18, 2012
  • 24. Selenium • http://seleniumhq.org • Support for JavaScriptFriday, May 18, 2012
  • 25. Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import timeFriday, May 18, 2012
  • 26. Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,nobr)]) [position()=2]/span").text browser.close() >>> print a 23Friday, May 18, 2012
  • 27. Phantom JS • http://www.phantomjs.org/Friday, May 18, 2012