SlideShare a Scribd company logo
1 of 13
Download to read offline
Web Scrapping with Python




               Miguel Miranda de Mattos
                      :@mmmattos - mmmattos.net
                            Porto Alegre, Brazil.
                                          2012
Web Scrapping with Python

     ● Tools:

       ○ BeautifulSoup

       ○ Mechanize
BeautifulSoup
 An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.

● In Summary:
   ○ Navigate the "soup" of HTML/XML tags,
     programatically

   ○ Access tag´s properties and values

   ○ Search for tags and their attributes.
BeautifulSoup
  ○ Example:

      from BeautifulSoup import BeautifulSoup
      doc = "<html><h1>Heading</h1><p>Text"
      soup = BeautifulSoup(doc)
      print soup.prettify()

      # <html>
      # <h1>
      # Heading
      # </h1>
      # <p>
      # Text
      # </p>
      # </html>
  ○
BeautifulSoup

  ○ Searching / Looking for things
    ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
           'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
           'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
           'findPreviousSiblings'

      ■ findAll
        ● findAll(self, name=None, attrs={}, recursive=True,
                text=None, limit=None, **kwargs)

           ●       Extracts a list of Tag objects that match the given
           ●       criteria. You can specify the name of the Tag and any
           ●       attributes you want the Tag to have.
BeautifulSoup
● Example:
  >>> from BeautifulSoup import BeautifulSoup
  >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
  >>> docSoup = BeautifulSoup(doc)

  >>> print docSoup.findAll('tr')
  [<tr><td>one</td><td>two</td></tr>]

  >>> print docSoup.findAll('td')
  [<td>one</td>, <td>two</td>]
BeautifulSoup
● findAll (cont´d.):
   >>> for t in docSoup.findAll('td'):
   >>>     print t

   <td>one</td>
   <td>two</td>

   >>> for t in docSoup.findAll('td'):
   >>>     print t.getText()

   one
   two
BeautifulSoup
●   findAll using attributes to qualify:
    >>> soup.findAll('div',attrs = {'class': 'Menus'})
    [<div>musicMenu</div>,<div>videoMenu</div>]

●   For more options:
    ○   dir (BeautifulSoup)
    ○   help (yourSoup.<command>)

●   Use BeautifulSoup rather than regexp patterns:
        patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"')
        re.findAll(patFinderTitle, html)
    ○   by
        soup = BeautifulSoup(html)
        for tag in brand_row_soup.findAll('a'):
        print tag['title']
Mechanize
● Stateful programmatic web browsing in Python, after
   Andy Lester’s Perl module.
    ●   mechanize.Browser and mechanize.UserAgentBase, so:
          ○ any URL can be opened, not just http:
          ○ mechanize.UserAgentBase offers easy dynamic configuration of
            user-agent features like protocol, cookie, redirection and robots.
            txt handling, without having to make a new OpenerDirector each
            time, e.g. by callingbuild_opener().
    ●   Easy HTML form filling.
    ●   Convenient link parsing and following.
    ●   Browser history (.back() and .reload() methods).
    ●   The Referer HTTP header is added properly (optional).
    ●   Automatic observance of robots.txt.
    ●   Automatic handling of HTTP-Equiv and Refresh.
Mechanize
● Navigation commands:
  ○ open(url)
  ○ follow_link(link)
  ○ back()
  ○ submit()
  ○ reload()
● Examples
   br = mechanize.Browser()
   br.open("python.org")
   gothtml = br.response().read()
   for link in br.links(url_regex="python.org"):
      print link
      br.follow_link(link) # takes EITHER Link instance OR keyword args
      br.back()
Mechanize
● Example:
     import re
     import mechanize

     br = mechanize.Browser()
     br.open("http://www.example.com/")

     # follow second link with element text matching
     # regular expression
     response1 = br.follow_link(text_regex=r"cheeses*shop")

     assert br.viewing_html()
     print br.title()
     print response1.geturl()
     print response1.info() # headers
     print response1.read() # body
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize
     from BeautifulSoup import BeutifulSoup

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           print d
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           if d.has_key('class'):
                 print d['class']

More Related Content

What's hot

How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide webSergio Burdisso
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 

What's hot (20)

Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Selenium&amp;scrapy
Selenium&amp;scrapySelenium&amp;scrapy
Selenium&amp;scrapy
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Introducing CouchDB
Introducing CouchDBIntroducing CouchDB
Introducing CouchDB
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 

Similar to Web Scrapping with Python

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84Mahmoud Samir Fayed
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189Mahmoud Samir Fayed
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesArtur Barseghyan
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxVivekBaghel30
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202Mahmoud Samir Fayed
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJames Casey
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - GoodiesJerry Emmanuel
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRick Copeland
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core JavascriptArti Parab Academics
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3Sunny Batabyal
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosIgor Sobreira
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentationguest5d87aa6
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfDakshPratapSingh1
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers workNeoito
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 

Similar to Web Scrapping with Python (20)

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptx
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - Goodies
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core Javascript
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Code Management
Code ManagementCode Management
Code Management
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentation
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdf
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers work
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Web Scrapping with Python

  • 1. Web Scrapping with Python Miguel Miranda de Mattos :@mmmattos - mmmattos.net Porto Alegre, Brazil. 2012
  • 2. Web Scrapping with Python ● Tools: ○ BeautifulSoup ○ Mechanize
  • 3. BeautifulSoup An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. ● In Summary: ○ Navigate the "soup" of HTML/XML tags, programatically ○ Access tag´s properties and values ○ Search for tags and their attributes.
  • 4. BeautifulSoup ○ Example: from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) print soup.prettify() # <html> # <h1> # Heading # </h1> # <p> # Text # </p> # </html> ○
  • 5. BeautifulSoup ○ Searching / Looking for things ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings' ■ findAll ● findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) ● Extracts a list of Tag objects that match the given ● criteria. You can specify the name of the Tag and any ● attributes you want the Tag to have.
  • 6. BeautifulSoup ● Example: >>> from BeautifulSoup import BeautifulSoup >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>" >>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr') [<tr><td>one</td><td>two</td></tr>] >>> print docSoup.findAll('td') [<td>one</td>, <td>two</td>]
  • 7. BeautifulSoup ● findAll (cont´d.): >>> for t in docSoup.findAll('td'): >>> print t <td>one</td> <td>two</td> >>> for t in docSoup.findAll('td'): >>> print t.getText() one two
  • 8. BeautifulSoup ● findAll using attributes to qualify: >>> soup.findAll('div',attrs = {'class': 'Menus'}) [<div>musicMenu</div>,<div>videoMenu</div>] ● For more options: ○ dir (BeautifulSoup) ○ help (yourSoup.<command>) ● Use BeautifulSoup rather than regexp patterns: patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"') re.findAll(patFinderTitle, html) ○ by soup = BeautifulSoup(html) for tag in brand_row_soup.findAll('a'): print tag['title']
  • 9. Mechanize ● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module. ● mechanize.Browser and mechanize.UserAgentBase, so: ○ any URL can be opened, not just http: ○ mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots. txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener(). ● Easy HTML form filling. ● Convenient link parsing and following. ● Browser history (.back() and .reload() methods). ● The Referer HTTP header is added properly (optional). ● Automatic observance of robots.txt. ● Automatic handling of HTTP-Equiv and Refresh.
  • 10. Mechanize ● Navigation commands: ○ open(url) ○ follow_link(link) ○ back() ○ submit() ○ reload() ● Examples br = mechanize.Browser() br.open("python.org") gothtml = br.response().read() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
  • 11. Mechanize ● Example: import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching # regular expression response1 = br.follow_link(text_regex=r"cheeses*shop") assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body
  • 12. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize from BeautifulSoup import BeutifulSoup url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: print d
  • 13. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: if d.has_key('class'): print d['class']