Parse The Web Using Python+Beautiful Soup

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Parse The Web Using Python+Beautiful Soup - Presentation Transcript

    1. Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
    2. Agenda • • Python • Beautiful Soup
    3. Parse the web? but how?
    4. Solutions • C++ • Java • Perl • Python • Others?
    5. Solutions (Cont.) • • Regular expression • Parser
    6. So I decide...
    7. Python + Beautiful Soup
    8. Python + Beautiful Soup
    9. Python • high-level programming language • scripting language • Google
    10. • • {} • list tuple dictionary
    11. list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
    12. list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
    13. list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
    14. tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
    15. Dictionary • a={123:‘abc’,‘cde’:456} • a[123] • #abc’ • a[‘cde’] • #456
    16. if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
    17. while loop while a>2 or a<3: pass
    18. for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
    19. function def fib(n): if n==0 or n==1: return n else: return fib(n-1),fib(n-2)
    20. ....
    21. What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
    22. Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=\"firstpara\" align=\"center\"> first paragraph <b> one </b> </p> <p id=\"secondpara\" align=\"blah\"> second paragraph <b> two </b> </p> </body> </html>
    23. check urllib/urllib2 to see how to open a url in python from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(page) soup.html.head #<head><title>page title</title></head> soup.head #<head><title>page title</title></head> soup.body.p #<p id=\"firstpara\" align=\"center\">This is paragraph<b>one</b></p>
    24. (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘first paragraph’
    25. (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
    26. (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'firstpara'} soup.html.body.p[‘id’] == 'firstpara'
    27. • find(name, attrs, recursive, text)
    28. • find(name, attrs, recursive, text) tag
    29. tag • find(name, attrs, recursive, text) tag
    30. tag • find(name, attrs, recursive, text) tag
    31. tag tag • find(name, attrs, recursive, text) tag
    32. find(name, attrs, recursive, text) • soup.find(‘p’) #<p id=\"firstpara\" align=\"center\"> This is paragraph<b>one</b></p>
    33. find(name, attrs, recursive, text) soup.find(‘p’) == soup.html.body.p soup.find(‘p’,id=‘secondpara’) #<p id=\"secondpara\" align=\"blah\">This is paragraph<b>two</b></p> soup.find(‘p’,recuresive=False)==None soup.find(text=‘one’)==soup.b.next
    34. findAll(name, attrs, recursive, text,limit) soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.findAll(‘p’,id=‘secondpara’) #[<p id=\"secondpara\" align=\"blah\">This is paragraph<b>two</b></p>] soup.findAll(‘p’,recuresive=False)==[] soup.findAll(text=‘one’)==soup.b.next soup.findAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
    35. Other solutions • lxml • html5lib • HTMLParser • htmlfill • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
    36. Reference • Python Official Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    + Jim ChangJim Chang, 6 months ago

    custom

    1716 views, 1 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1716
      • 1716 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 23
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories