Overview of Python                       web scraping tools                                   Maik Röder                  ...
Data Scraping                       • Automated Process                        • Explore and download pages               ...
urlparse                       • Manipulate URL strings                         urlparse.urlparse()                       ...
urllib                   • Download data through different protocols                   • HTTP, FTP, ...                   ...
Scrape a web site                       • Example: http://www.wunderground.com/Friday, May 18, 2012
Preparation          >>> from StringIO import StringIO          >>> from urllib2 import urlopen          >>> f = urlopen(h...
Beautifulsoup                       • HTML/XML parser                       • designed for quick turnaround projects like ...
BeautifulSoup          from BeautifulSoup import *          a = BeautifulSoup(d).findAll(a)          [x[href] for x in a]F...
Faster BeautifulSoup         from BeautifulSoup import *         p = SoupStrainer(a)         a = BeautifulSoup(d, parseOnl...
Inspect the Element                       • Inspect the Maximum temperatureFriday, May 18, 2012
Find the node   >>> from BeautifulSoup import   BeautifulSoup   >>> soup = BeautifulSoup(d)   >>> attrs = {class:nobr}   >...
htmllib.HTMLParser                       • Interesting only for historical reasons                       • based on sgmlli...
htmllib5  • Using the custom simpletree format    • a built-in DOM-ish tree type (pythonic idioms)          from html5lib ...
lxml            • Library for processing XML and HTML            • Based on C libraries install libxml2-dev              s...
lxml                       from lxml import etree                       t = etree.parse(t.xml)                       for n...
twill                       • Simple                       • No JavaScript                       • http://twill.idyll.org ...
twill                       • Commonly used methods:                         go()                         code()          ...
Twill        >>> from twill import commands as        twill        >>> from twill import get_browser        >>> twill.go(h...
Twill - acknowledge_equiv_refresh                >>> twill.go("http://                www.wunderground.com/history/       ...
Twill       >>> twill.config       ("acknowledge_equiv_refresh", "false")       >>> twill.go("http://       www.wundergrou...
mechanize                       • Stateful programmatic web browsing                       • navigation history           ...
mechanize - robots.txt            >>> import mechanize            >>> browser = mechanize.Browser()            >>> browser...
mechanize - robots.txt      • Do not handle robots.txt        browser.set_handle_robots(False)      • Do not handle equiv ...
Selenium                       • http://seleniumhq.org                       • Support for JavaScriptFriday, May 18, 2012
Selenium          from selenium import webdriver          from selenium.common.exceptions                import NoSuchElem...
Selenium          >>> browser = webdriver.Firefox()          >>> browser.get("http://          www.wunderground.com/histor...
Phantom JS                       • http://www.phantomjs.org/Friday, May 18, 2012
Upcoming SlideShare
Loading in...5
×

Overview of python web scraping tools

16,973

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

Published in: Technology, Design
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,973
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
225
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Overview of python web scraping tools

  1. 1. Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
  2. 2. Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text fileFriday, May 18, 2012
  3. 3. urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse()Friday, May 18, 2012
  4. 4. urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve()Friday, May 18, 2012
  5. 5. Scrape a web site • Example: http://www.wunderground.com/Friday, May 18, 2012
  6. 6. Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen(http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html) >>> p = f.read() >>> d = StringIO(p) >>> f.close()Friday, May 18, 2012
  7. 7. Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoupFriday, May 18, 2012
  8. 8. BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll(a) [x[href] for x in a]Friday, May 18, 2012
  9. 9. Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer(a) a = BeautifulSoup(d, parseOnlyThese=p) [x[href] for x in a]Friday, May 18, 2012
  10. 10. Inspect the Element • Inspect the Maximum temperatureFriday, May 18, 2012
  11. 11. Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {class:nobr} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23Friday, May 18, 2012
  12. 12. htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllibFriday, May 18, 2012
  13. 13. htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= a] [x.attributes[href] for x in a]Friday, May 18, 2012
  14. 14. lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPathFriday, May 18, 2012
  15. 15. lxml from lxml import etree t = etree.parse(t.xml) for node in t.xpath(//a): node.tag node.get(href) node.items() node.text node.getParent()Friday, May 18, 2012
  16. 16. twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State MachinesFriday, May 18, 2012
  17. 17. twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit()Friday, May 18, 2012
  18. 18. Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go(http://www.google.com) >>> twill.showforms() >>> twill.formvalue(1, q, Python) >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html()Friday, May 18, 2012
  19. 19. Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh...Friday, May 18, 2012
  20. 20. Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.htmlFriday, May 18, 2012
  21. 21. mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authenticationFriday, May 18, 2012
  22. 22. mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html) mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txtFriday, May 18, 2012
  23. 23. mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open(http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html)Friday, May 18, 2012
  24. 24. Selenium • http://seleniumhq.org • Support for JavaScriptFriday, May 18, 2012
  25. 25. Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import timeFriday, May 18, 2012
  26. 26. Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,nobr)]) [position()=2]/span").text browser.close() >>> print a 23Friday, May 18, 2012
  27. 27. Phantom JS • http://www.phantomjs.org/Friday, May 18, 2012
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×