Web Scrapping with Python               Miguel Miranda de Mattos                      :@mmmattos - mmmattos.net           ...
Web Scrapping with Python     ● Tools:       ○ BeautifulSoup       ○ Mechanize
BeautifulSoup An HTML/XML parser for Python that can turn even invalidmarkup into a parse tree. It provides simple, idioma...
BeautifulSoup  ○ Example:      from BeautifulSoup import BeautifulSoup      doc = "<html><h1>Heading</h1><p>Text"      sou...
BeautifulSoup  ○ Searching / Looking for things    ■ find, findAll, findAllNext, findAllPrevious, findChild,           fin...
BeautifulSoup● Example:  >>> from BeautifulSoup import BeautifulSoup  >>> doc = "<table><tr><td>one</td><td>two</td></tr><...
BeautifulSoup● findAll (cont´d.):   >>> for t in docSoup.findAll(td):   >>>     print t   <td>one</td>   <td>two</td>   >>...
BeautifulSoup●   findAll using attributes to qualify:    >>> soup.findAll(div,attrs = {class: Menus})    [<div>musicMenu</...
Mechanize● Stateful programmatic web browsing in Python, after   Andy Lester’s Perl module.    ●   mechanize.Browser and m...
Mechanize● Navigation commands:  ○ open(url)  ○ follow_link(link)  ○ back()  ○ submit()  ○ reload()● Examples   br = mecha...
Mechanize● Example:     import re     import mechanize     br = mechanize.Browser()     br.open("http://www.example.com/")...
Mechanize● Example: Combining Mechanize and BeautifulSoup     import re     import mechanize     from BeautifulSoup import...
Mechanize● Example: Combining Mechanize and BeautifulSoup     import re     import mechanize     url = "http://www.hp.com"...
Upcoming SlideShare
Loading in...5
×

Web Scrapping with Python

6,448

Published on

Introduction on how to crawl for sites and content from the unstructured data on the web. using the Python programming language and some existing python modules.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,448
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
73
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Web Scrapping with Python

  1. 1. Web Scrapping with Python Miguel Miranda de Mattos :@mmmattos - mmmattos.net Porto Alegre, Brazil. 2012
  2. 2. Web Scrapping with Python ● Tools: ○ BeautifulSoup ○ Mechanize
  3. 3. BeautifulSoup An HTML/XML parser for Python that can turn even invalidmarkup into a parse tree. It provides simple, idiomatic waysof navigating, searching, and modifying the parse tree. Itcommonly saves programmers hours or days of work.● In Summary: ○ Navigate the "soup" of HTML/XML tags, programatically ○ Access tag´s properties and values ○ Search for tags and their attributes.
  4. 4. BeautifulSoup ○ Example: from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) print soup.prettify() # <html> # <h1> # Heading # </h1> # <p> # Text # </p> # </html> ○
  5. 5. BeautifulSoup ○ Searching / Looking for things ■ find, findAll, findAllNext, findAllPrevious, findChild, findChildren, findNext, findNextSibling, findNextSiblings, findParent, findParents, findPrevious, findPreviousSibling, findPreviousSiblings ■ findAll ● findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) ● Extracts a list of Tag objects that match the given ● criteria. You can specify the name of the Tag and any ● attributes you want the Tag to have.
  6. 6. BeautifulSoup● Example: >>> from BeautifulSoup import BeautifulSoup >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>" >>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll(tr) [<tr><td>one</td><td>two</td></tr>] >>> print docSoup.findAll(td) [<td>one</td>, <td>two</td>]
  7. 7. BeautifulSoup● findAll (cont´d.): >>> for t in docSoup.findAll(td): >>> print t <td>one</td> <td>two</td> >>> for t in docSoup.findAll(td): >>> print t.getText() one two
  8. 8. BeautifulSoup● findAll using attributes to qualify: >>> soup.findAll(div,attrs = {class: Menus}) [<div>musicMenu</div>,<div>videoMenu</div>]● For more options: ○ dir (BeautifulSoup) ○ help (yourSoup.<command>)● Use BeautifulSoup rather than regexp patterns: patFinderTitle = re.compile(r<a[^>]*stitle="(.*?)") re.findAll(patFinderTitle, html) ○ by soup = BeautifulSoup(html) for tag in brand_row_soup.findAll(a): print tag[title]
  9. 9. Mechanize● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module. ● mechanize.Browser and mechanize.UserAgentBase, so: ○ any URL can be opened, not just http: ○ mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots. txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener(). ● Easy HTML form filling. ● Convenient link parsing and following. ● Browser history (.back() and .reload() methods). ● The Referer HTTP header is added properly (optional). ● Automatic observance of robots.txt. ● Automatic handling of HTTP-Equiv and Refresh.
  10. 10. Mechanize● Navigation commands: ○ open(url) ○ follow_link(link) ○ back() ○ submit() ○ reload()● Examples br = mechanize.Browser() br.open("python.org") gothtml = br.response().read() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
  11. 11. Mechanize● Example: import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching # regular expression response1 = br.follow_link(text_regex=r"cheeses*shop") assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body
  12. 12. Mechanize● Example: Combining Mechanize and BeautifulSoup import re import mechanize from BeautifulSoup import BeutifulSoup url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll(div) print "Found " + str(len(found_divs)) for d in found_divs: print d
  13. 13. Mechanize● Example: Combining Mechanize and BeautifulSoup import re import mechanize url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll(div) print "Found " + str(len(found_divs)) for d in found_divs: if d.has_key(class): print d[class]
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×