SlideShare a Scribd company logo
1 of 27
Download to read offline
Overview of Python
                       web scraping tools

                                   Maik Röder
                         Barcelona Python Meetup Group
                                   17.05.2012




Friday, May 18, 2012
Data Scraping

                       • Automated Process
                        • Explore and download pages
                        • Grab content
                        • Store in a database or in a text file

Friday, May 18, 2012
urlparse

                       • Manipulate URL strings
                         urlparse.urlparse()
                         urlparse.urljoin()
                         urlparse.urlunparse()




Friday, May 18, 2012
urllib

                   • Download data through different protocols
                   • HTTP, FTP, ...
                       urllib.parse()
                       urllib.urlopen()
                       urllib.urlretrieve()



Friday, May 18, 2012
Scrape a web site
                       • Example: http://www.wunderground.com/




Friday, May 18, 2012
Preparation
          >>> from StringIO import StringIO
          >>> from urllib2 import urlopen
          >>> f = urlopen('http://
          www.wunderground.com/history/airport/
          BCN/2007/5/17/DailyHistory.html')
          >>> p = f.read()
          >>> d = StringIO(p)
          >>> f.close()


Friday, May 18, 2012
Beautifulsoup

                       • HTML/XML parser
                       • designed for quick turnaround projects like
                         screen-scraping
                       • http://www.crummy.com/software/
                         BeautifulSoup



Friday, May 18, 2012
BeautifulSoup

          from BeautifulSoup import *
          a = BeautifulSoup(d).findAll('a')
          [x['href'] for x in a]




Friday, May 18, 2012
Faster BeautifulSoup
         from BeautifulSoup import *
         p = SoupStrainer('a')
         a = BeautifulSoup(d, parseOnlyThese=p)
         [x['href'] for x in a]




Friday, May 18, 2012
Inspect the Element
                       • Inspect the Maximum temperature




Friday, May 18, 2012
Find the node
   >>> from BeautifulSoup import
   BeautifulSoup
   >>> soup = BeautifulSoup(d)
   >>> attrs = {'class':'nobr'}
   >>> nobrs = soup.findAll(attrs=attrs)
   >>> temperature = nobrs[3].span.string
   >>> print temperature
   23


Friday, May 18, 2012
htmllib.HTMLParser


                       • Interesting only for historical reasons
                       • based on sgmllib


Friday, May 18, 2012
htmllib5
  • Using the custom simpletree format
    • a built-in DOM-ish tree type (pythonic idioms)
          from html5lib import parse
          from html5lib import treebuilders
          e = treebuilders.simpletree.Element
          i = parse(d)
          a =[x for x in d if isinstance(x, e)
          and x.name= 'a']
          [x.attributes['href'] for x in a]


Friday, May 18, 2012
lxml
            • Library for processing XML and HTML
            • Based on C libraries install libxml2-dev
              sudo aptitude
                       sudo aptitude install libxslt-dev

            • Extends the ElementTree API
             • e.g. with XPath


Friday, May 18, 2012
lxml

                       from lxml import etree
                       t = etree.parse('t.xml')
                       for node in t.xpath('//a'):
                           node.tag
                           node.get('href')
                           node.items()
                           node.text
                           node.getParent()



Friday, May 18, 2012
twill
                       • Simple
                       • No JavaScript
                       • http://twill.idyll.org
                       • Some more interesting concepts
                        • Pages, Scenarios
                        • State Machines

Friday, May 18, 2012
twill

                       • Commonly used methods:
                         go()
                         code()
                         show()
                         showforms()
                         formvalue() (or fv())
                         submit()



Friday, May 18, 2012
Twill

        >>> from twill import commands as
        twill
        >>> from twill import get_browser
        >>> twill.go('http://www.google.com')
        >>> twill.showforms()
        >>> twill.formvalue(1, 'q', 'Python')
        >>> twill.showforms()
        >>> twill.submit()
        >>> get_browser().get_html()

Friday, May 18, 2012
Twill - acknowledge_equiv_refresh
                >>> twill.go("http://
                www.wunderground.com/history/
                airport/BCN/2007/5/17/
                DailyHistory.html")
                ...
                twill.errors.TwillException:
                infinite refresh loop discovered;
                aborting.
                Try turning off
                acknowledge_equiv_refresh...


Friday, May 18, 2012
Twill
       >>> twill.config
       ("acknowledge_equiv_refresh", "false")
       >>> twill.go("http://
       www.wunderground.com/history/airport/
       BCN/2007/5/17/DailyHistory.html")
       ==> at http://www.wunderground.com/
       history/airport/BCN/2007/5/17/
       DailyHistory.html
       'http://www.wunderground.com/history/
       airport/BCN/2007/5/17/
       DailyHistory.html'


Friday, May 18, 2012
mechanize
                       • Stateful programmatic web browsing
                       • navigation history
                       • HTML form state
                       • cookies
                       • ftp:, http: and file: URL schemes
                       • redirections
                       • proxies
                       • Basic and Digest HTTP authentication
Friday, May 18, 2012
mechanize - robots.txt
            >>> import mechanize
            >>> browser = mechanize.Browser()
            >>> browser.open('http://
            www.wunderground.com/history/
            airport/BCN/2007/5/17/
            DailyHistory.html')
            mechanize._response.httperror_see
            k_wrapper: HTTP Error 403:
            request disallowed by robots.txt


Friday, May 18, 2012
mechanize - robots.txt
      • Do not handle robots.txt
        browser.set_handle_robots(False)

      • Do not handle equiv
        browser.set_handle_equiv(False)
              browser.open('http://
              www.wunderground.com/history/
              airport/BCN/2007/5/17/
              DailyHistory.html')


Friday, May 18, 2012
Selenium


                       • http://seleniumhq.org
                       • Support for JavaScript


Friday, May 18, 2012
Selenium

          from selenium import webdriver
          from selenium.common.exceptions 
               import NoSuchElementException
          from selenium.webdriver.common.keys 
               import Keys
          import time



Friday, May 18, 2012
Selenium
          >>> browser = webdriver.Firefox()
          >>> browser.get("http://
          www.wunderground.com/history/airport/
          BCN/2007/5/17/DailyHistory.html")
          >>> a = browser.find_element_by_xpath
          ("(//span[contains(@class,'nobr')])
          [position()=2]/span").text
          browser.close()
          >>> print a
          23
Friday, May 18, 2012
Phantom JS


                       • http://www.phantomjs.org/



Friday, May 18, 2012

More Related Content

Viewers also liked

Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Top 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingTop 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingPromptCloud
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapyorangain
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 

Viewers also liked (12)

Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Top 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingTop 5 Tools for Web Scraping
Top 5 Tools for Web Scraping
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

More from maikroeder

Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboardmaikroeder
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2maikroeder
 
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...maikroeder
 
Cms - Content Management System Utilities for Django
Cms - Content Management System Utilities for DjangoCms - Content Management System Utilities for Django
Cms - Content Management System Utilities for Djangomaikroeder
 
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik RöderPlone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Rödermaikroeder
 

More from maikroeder (8)

Google charts
Google chartsGoogle charts
Google charts
 
Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboard
 
Pandas
PandasPandas
Pandas
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
 
Cms - Content Management System Utilities for Django
Cms - Content Management System Utilities for DjangoCms - Content Management System Utilities for Django
Cms - Content Management System Utilities for Django
 
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik RöderPlone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
 

Recently uploaded

Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Overview of python web scraping tools

  • 1. Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
  • 2. Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text file Friday, May 18, 2012
  • 3. urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse() Friday, May 18, 2012
  • 4. urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve() Friday, May 18, 2012
  • 5. Scrape a web site • Example: http://www.wunderground.com/ Friday, May 18, 2012
  • 6. Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen('http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html') >>> p = f.read() >>> d = StringIO(p) >>> f.close() Friday, May 18, 2012
  • 7. Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoup Friday, May 18, 2012
  • 8. BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll('a') [x['href'] for x in a] Friday, May 18, 2012
  • 9. Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer('a') a = BeautifulSoup(d, parseOnlyThese=p) [x['href'] for x in a] Friday, May 18, 2012
  • 10. Inspect the Element • Inspect the Maximum temperature Friday, May 18, 2012
  • 11. Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {'class':'nobr'} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23 Friday, May 18, 2012
  • 12. htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllib Friday, May 18, 2012
  • 13. htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= 'a'] [x.attributes['href'] for x in a] Friday, May 18, 2012
  • 14. lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPath Friday, May 18, 2012
  • 15. lxml from lxml import etree t = etree.parse('t.xml') for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent() Friday, May 18, 2012
  • 16. twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State Machines Friday, May 18, 2012
  • 17. twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit() Friday, May 18, 2012
  • 18. Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go('http://www.google.com') >>> twill.showforms() >>> twill.formvalue(1, 'q', 'Python') >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html() Friday, May 18, 2012
  • 19. Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh... Friday, May 18, 2012
  • 20. Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html 'http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html' Friday, May 18, 2012
  • 21. mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authentication Friday, May 18, 2012
  • 22. mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open('http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html') mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txt Friday, May 18, 2012
  • 23. mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open('http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html') Friday, May 18, 2012
  • 24. Selenium • http://seleniumhq.org • Support for JavaScript Friday, May 18, 2012
  • 25. Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import time Friday, May 18, 2012
  • 26. Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,'nobr')]) [position()=2]/span").text browser.close() >>> print a 23 Friday, May 18, 2012
  • 27. Phantom JS • http://www.phantomjs.org/ Friday, May 18, 2012