Your SlideShare is downloading. ×
0
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Overview of python web scraping tools
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Overview of python web scraping tools

16,721

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

A talk I gave at the Barcelona Python Meetup May 2012.

Published in: Technology, Design
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,721
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
218
Comments
0
Likes
15
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
  • 2. Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text fileFriday, May 18, 2012
  • 3. urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse()Friday, May 18, 2012
  • 4. urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve()Friday, May 18, 2012
  • 5. Scrape a web site • Example: http://www.wunderground.com/Friday, May 18, 2012
  • 6. Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen(http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html) >>> p = f.read() >>> d = StringIO(p) >>> f.close()Friday, May 18, 2012
  • 7. Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoupFriday, May 18, 2012
  • 8. BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll(a) [x[href] for x in a]Friday, May 18, 2012
  • 9. Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer(a) a = BeautifulSoup(d, parseOnlyThese=p) [x[href] for x in a]Friday, May 18, 2012
  • 10. Inspect the Element • Inspect the Maximum temperatureFriday, May 18, 2012
  • 11. Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {class:nobr} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23Friday, May 18, 2012
  • 12. htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllibFriday, May 18, 2012
  • 13. htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= a] [x.attributes[href] for x in a]Friday, May 18, 2012
  • 14. lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPathFriday, May 18, 2012
  • 15. lxml from lxml import etree t = etree.parse(t.xml) for node in t.xpath(//a): node.tag node.get(href) node.items() node.text node.getParent()Friday, May 18, 2012
  • 16. twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State MachinesFriday, May 18, 2012
  • 17. twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit()Friday, May 18, 2012
  • 18. Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go(http://www.google.com) >>> twill.showforms() >>> twill.formvalue(1, q, Python) >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html()Friday, May 18, 2012
  • 19. Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh...Friday, May 18, 2012
  • 20. Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.htmlFriday, May 18, 2012
  • 21. mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authenticationFriday, May 18, 2012
  • 22. mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html) mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txtFriday, May 18, 2012
  • 23. mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html)Friday, May 18, 2012
  • 24. Selenium • http://seleniumhq.org • Support for JavaScriptFriday, May 18, 2012
  • 25. Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import timeFriday, May 18, 2012
  • 26. Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,nobr)]) [position()=2]/span").text browser.close() >>> print a 23Friday, May 18, 2012
  • 27. Phantom JS • http://www.phantomjs.org/Friday, May 18, 2012

×