Parse the web
    using Python + Beautiful Soup




                     at ncucc
                 cwebb(dot)tw(at)gmail(d...
Agenda

•
• Python
• Beautiful Soup
Parse the web?
            but how?
Solutions

• C++
• Java
• Perl
• Python
• Others?
Solutions (Cont.)

•
• Regular expression
•        Parser
So I decide...
Python + Beautiful Soup
Python + Beautiful Soup
Python

• high-level programming language
• scripting language
•         Google
•
•               {}
• list tuple dictionary
list
• a=[‘asdf’,123,12.01,‘abcd’]
• a[3] (a[-1])
 • 12.01
• a[0:2] (a[:2])
 • [‘asdf’,123,12.01]
• b=[‘asdf’,123,[‘qwer’,...
list (Cont.)
• a=[‘abc’,12]
• len(a)
• #2
• a.append(1)
• #[‘abc’,12,1]
• a.insert(1,‘def’)
• #[‘abc’,‘def’,12,1]
list (Cont.)
• a= [321,456,12,1]
• a.pop()
• #[321,456,12]
• a.index(12)
• #2
• a.sort()
• #1,12,321,456]
tuple

• a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01
• a=((‘abc’,1),123.1)
• a,b=1,2
Dictionary

• a={123:‘abc’,‘cde’:456}
• a[123]
• #abc’
• a[‘cde’]
• #456
if else
if a>10:
   print ‘a>10’
elif a<5:
   print ‘a<5’
else:
   print ‘5<a<10
while loop
while a>2 or a<3:
 pass
for loop
a=[‘abc’,123,‘def’]        abc
for x in a:                123
  print x                  def

                   ...
function
def fib(n):
 if n==0 or n==1:
    return n
 else:
    return fib(n-1),fib(n-2)
....
What is Beautiful Soup
                    not Beautiful Soap


• python module
• html/xml parser
• html/xml
•
Beautiful Soup
<html>
 <head>
  <title>
    page title
  </title>
 </head>
 <body>
  <p id=quot;firstparaquot; align=quot;c...
check urllib/urllib2 to see
                                           how to open a url in python

from BeautifulSoup imp...
(Cont.)
• parent         (go to parent node)

    soup.title.parent == soup.head

• next             (go to next node)

  ...
(Cont.)
• contents         (all content nodes)

     soup.html.contents ==
     [soup.html.head , soup.html.body]

• nextS...
(Cont.)
• tag
    soup.html.body.name == ‘body’

•
    soup.html.head.title.string
    == str(soup.html.head.title)
    ==...
• find(name, attrs, recursive, text)
• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag                tag


• find(name, attrs, recursive, text)
             tag
find(name, attrs, recursive, text)



• soup.find(‘p’)
   #<p id=quot;firstparaquot; align=quot;centerquot;>
   This is parag...
find(name, attrs, recursive, text)


soup.find(‘p’) == soup.html.body.p

soup.find(‘p’,id=‘secondpara’)
  #<p id=quot;secondp...
findAll(name, attrs, recursive, text,limit)

soup.findAll(‘p’) == [soup.html.body.p
                     ,soup.p.nextSibling...
Other solutions
• lxml
• html5lib
• HTMLParser
• htmlfill
• Genshi
  http://blog.ianbicking.org/2008/03/30/python-html-pars...
Reference
• Python Official Website
  http://www.python.com/ (>///<               )
  http://www.python.org/


• Beautiful ...
Upcoming SlideShare
Loading in …5
×

Parse The Web Using Python+Beautiful Soup

9,916
-1

Published on

Published in: Technology, Self Improvement
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,916
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
143
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Parse The Web Using Python+Beautiful Soup

  1. 1. Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
  2. 2. Agenda • • Python • Beautiful Soup
  3. 3. Parse the web? but how?
  4. 4. Solutions • C++ • Java • Perl • Python • Others?
  5. 5. Solutions (Cont.) • • Regular expression • Parser
  6. 6. So I decide...
  7. 7. Python + Beautiful Soup
  8. 8. Python + Beautiful Soup
  9. 9. Python • high-level programming language • scripting language • Google
  10. 10. • • {} • list tuple dictionary
  11. 11. list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
  12. 12. list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
  13. 13. list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
  14. 14. tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
  15. 15. Dictionary • a={123:‘abc’,‘cde’:456} • a[123] • #abc’ • a[‘cde’] • #456
  16. 16. if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
  17. 17. while loop while a>2 or a<3: pass
  18. 18. for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
  19. 19. function def fib(n): if n==0 or n==1: return n else: return fib(n-1),fib(n-2)
  20. 20. ....
  21. 21. What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
  22. 22. Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=quot;firstparaquot; align=quot;centerquot;> first paragraph <b> one </b> </p> <p id=quot;secondparaquot; align=quot;blahquot;> second paragraph <b> two </b> </p> </body> </html>
  23. 23. check urllib/urllib2 to see how to open a url in python from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(page) soup.html.head #<head><title>page title</title></head> soup.head #<head><title>page title</title></head> soup.body.p #<p id=quot;firstparaquot; align=quot;centerquot;>This is paragraph<b>one</b></p>
  24. 24. (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘first paragraph’
  25. 25. (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
  26. 26. (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'firstpara'} soup.html.body.p[‘id’] == 'firstpara'
  27. 27. • find(name, attrs, recursive, text)
  28. 28. • find(name, attrs, recursive, text) tag
  29. 29. tag • find(name, attrs, recursive, text) tag
  30. 30. tag • find(name, attrs, recursive, text) tag
  31. 31. tag tag • find(name, attrs, recursive, text) tag
  32. 32. find(name, attrs, recursive, text) • soup.find(‘p’) #<p id=quot;firstparaquot; align=quot;centerquot;> This is paragraph<b>one</b></p>
  33. 33. find(name, attrs, recursive, text) soup.find(‘p’) == soup.html.body.p soup.find(‘p’,id=‘secondpara’) #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p> soup.find(‘p’,recuresive=False)==None soup.find(text=‘one’)==soup.b.next
  34. 34. findAll(name, attrs, recursive, text,limit) soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.findAll(‘p’,id=‘secondpara’) #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>] soup.findAll(‘p’,recuresive=False)==[] soup.findAll(text=‘one’)==soup.b.next soup.findAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
  35. 35. Other solutions • lxml • html5lib • HTMLParser • htmlfill • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
  36. 36. Reference • Python Official Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×