Parse The Web Using Python+Beautiful Soup
Upcoming SlideShare
Loading in...5
×
 

Parse The Web Using Python+Beautiful Soup

on

  • 14,120 views

 

Statistics

Views

Total Views
14,120
Views on SlideShare
14,014
Embed Views
106

Actions

Likes
1
Downloads
124
Comments
0

3 Embeds 106

http://www.slideshare.net 86
http://www.givnosite.cv.ua 17
http://givnositecvua.appspot.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Parse The Web Using Python+Beautiful Soup Parse The Web Using Python+Beautiful Soup Presentation Transcript

  • Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
  • Agenda • • Python • Beautiful Soup
  • Parse the web? but how?
  • Solutions • C++ • Java • Perl • Python • Others?
  • Solutions (Cont.) • • Regular expression • Parser
  • So I decide...
  • Python + Beautiful Soup
  • Python + Beautiful Soup
  • Python • high-level programming language • scripting language • Google
  • • • {} • list tuple dictionary
  • list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
  • list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
  • list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
  • tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
  • Dictionary • a={123:‘abc’,‘cde’:456} • a[123] • #abc’ • a[‘cde’] • #456
  • if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
  • while loop while a>2 or a<3: pass
  • for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
  • function def fib(n): if n==0 or n==1: return n else: return fib(n-1),fib(n-2)
  • ....
  • What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
  • Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=quot;firstparaquot; align=quot;centerquot;> first paragraph <b> one </b> </p> <p id=quot;secondparaquot; align=quot;blahquot;> second paragraph <b> two </b> </p> </body> </html>
  • check urllib/urllib2 to see how to open a url in python from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(page) soup.html.head #<head><title>page title</title></head> soup.head #<head><title>page title</title></head> soup.body.p #<p id=quot;firstparaquot; align=quot;centerquot;>This is paragraph<b>one</b></p>
  • (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘first paragraph’
  • (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
  • (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'firstpara'} soup.html.body.p[‘id’] == 'firstpara'
  • • find(name, attrs, recursive, text)
  • • find(name, attrs, recursive, text) tag
  • tag • find(name, attrs, recursive, text) tag
  • tag • find(name, attrs, recursive, text) tag
  • tag tag • find(name, attrs, recursive, text) tag
  • find(name, attrs, recursive, text) • soup.find(‘p’) #<p id=quot;firstparaquot; align=quot;centerquot;> This is paragraph<b>one</b></p>
  • find(name, attrs, recursive, text) soup.find(‘p’) == soup.html.body.p soup.find(‘p’,id=‘secondpara’) #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p> soup.find(‘p’,recuresive=False)==None soup.find(text=‘one’)==soup.b.next
  • findAll(name, attrs, recursive, text,limit) soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.findAll(‘p’,id=‘secondpara’) #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>] soup.findAll(‘p’,recuresive=False)==[] soup.findAll(text=‘one’)==soup.b.next soup.findAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
  • Other solutions • lxml • html5lib • HTMLParser • htmlfill • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
  • Reference • Python Official Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/