Upcoming SlideShare
×

# Parse The Web Using Python+Beautiful Soup

9,916
-1

Published on

Published in: Technology, Self Improvement
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
9,916
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
143
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Parse The Web Using Python+Beautiful Soup

1. 1. Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
2. 2. Agenda • • Python • Beautiful Soup
3. 3. Parse the web? but how?
4. 4. Solutions • C++ • Java • Perl • Python • Others?
5. 5. Solutions (Cont.) • • Regular expression • Parser
6. 6. So I decide...
7. 7. Python + Beautiful Soup
8. 8. Python + Beautiful Soup
9. 9. Python • high-level programming language • scripting language • Google
10. 10. • • {} • list tuple dictionary
11. 11. list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
12. 12. list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
13. 13. list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
14. 14. tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
15. 15. Dictionary • a={123:‘abc’,‘cde’:456} • a[123] • #abc’ • a[‘cde’] • #456
16. 16. if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
17. 17. while loop while a>2 or a<3: pass
18. 18. for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
19. 19. function def ﬁb(n): if n==0 or n==1: return n else: return ﬁb(n-1),ﬁb(n-2)
20. 20. ....
21. 21. What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
22. 22. Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=quot;ﬁrstparaquot; align=quot;centerquot;> ﬁrst paragraph <b> one </b> </p> <p id=quot;secondparaquot; align=quot;blahquot;> second paragraph <b> two </b> </p> </body> </html>
24. 24. (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘ﬁrst paragraph’
25. 25. (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
26. 26. (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'ﬁrstpara'} soup.html.body.p[‘id’] == 'ﬁrstpara'
27. 27. • ﬁnd(name, attrs, recursive, text)
28. 28. • ﬁnd(name, attrs, recursive, text) tag
29. 29. tag • ﬁnd(name, attrs, recursive, text) tag
30. 30. tag • ﬁnd(name, attrs, recursive, text) tag
31. 31. tag tag • ﬁnd(name, attrs, recursive, text) tag
32. 32. ﬁnd(name, attrs, recursive, text) • soup.ﬁnd(‘p’) #<p id=quot;ﬁrstparaquot; align=quot;centerquot;> This is paragraph<b>one</b></p>
33. 33. ﬁnd(name, attrs, recursive, text) soup.ﬁnd(‘p’) == soup.html.body.p soup.ﬁnd(‘p’,id=‘secondpara’) #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p> soup.ﬁnd(‘p’,recuresive=False)==None soup.ﬁnd(text=‘one’)==soup.b.next
34. 34. ﬁndAll(name, attrs, recursive, text,limit) soup.ﬁndAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.ﬁndAll(‘p’,id=‘secondpara’) #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>] soup.ﬁndAll(‘p’,recuresive=False)==[] soup.ﬁndAll(text=‘one’)==soup.b.next soup.ﬁndAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
35. 35. Other solutions • lxml • html5lib • HTMLParser • htmlﬁll • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
36. 36. Reference • Python Ofﬁcial Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.