Your SlideShare is downloading. ×
/me wants it. Scraping sites to get data.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

/me wants it. Scraping sites to get data.

3,249
views

Published on

Building scrapers for grabbing data from websites. Tools, techniques, and tips. …

Building scrapers for grabbing data from websites. Tools, techniques, and tips.

A presentation at KiwiPycon 2009

Published in: Technology

1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,249
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
73
Comments
1
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • We’ve ended up with this datasets-as-websites problem.
  • I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data.
    http://www.flickr.com/photos/bowbrick/2365377635
  • So, I need the raw data. Not some pretty webpages.

    http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  • At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option.
    http://www.flickr.com/photos/davidmaddison/102584440
  • Usually it’s easier to ask forgiveness than permission.
  • Transcript

    • 1. /me wants it Scraping Sites to get Data Rob Coup robert@coup.net.nz
    • 2. Who am I? • Koordinates • Open data open.org.nz • Geek • Pythonista
    • 3. Datasets as websites
    • 4. But I want to mix it up! http://www.flickr.com/photos/bowbrick/2365377635
    • 5. DATA http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
    • 6. And when do I want it? http://www.flickr.com/photos/davidmaddison/102584440
    • 7. Just Scrape It
    • 8. First Example • Wanganui District Council Food Gradings • http://j.mp/i4yNZ
    • 9. Review • POST to URLs for each Grade • Parse HTML response for: • Business Name • Address • Grading • Output as CSV
    • 10. What to POST? • Tools: Firebug, Charles http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp txtGrading=A [ B, C, D, E, “Exempt”, “Currently Not Graded” ] Submit=Go
    • 11. POSTing in Python import urllib import urllib2 url = 'http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp' post_data = { 'txtGrading': 'A', 'Submit': 'Go', } post_encoded = urllib.urlencode(post_data) html = urllib2.urlopen(url, post_encoded).read() print html
    • 12. Results … <TD class="bodytext"> <h2>Search results...</h2> <B>39 South</B><br /> 159 Victoria Ave<br /> Wanganui<br /> Grading: <B>A</b> <hr /> <B>Alma Junction Dairy</B><br /> 1 Alma Rd<br /> Wanganui<br /> Grading: <B>A</b> <hr /> …
    • 13. Getting Data Out • Tools: BeautifulSoup • Parses HTML-ish documents • Easy navigation & searching of tree
    • 14. Our Parser from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) container = soup.find('td', {'class':'bodytext'}) for hr_el in container.findAll('hr'): # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/> text_parts = hr_el.findPreviousSiblings(text=True, limit=3) # ['Grading:', 'ADDRESS_1', 'ADDRESS_0'] address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2) # [<b>GRADE</b>, <b>NAME</b>] grade = el_parts[0].string name = el_parts[1].string print name, address, grade
    • 15. Putting it all together • loop over the grading values • write CSV output
    • 16. Advanced Crawlers • Form filling • Authentication & cookies
    • 17. Mechanize • http://wwwsearch.sourceforge.net/mechanize/ • programmable browser in Python • fills forms, navigates links & pages, eats cookies
    • 18. Data Parsing • JSON: SimpleJSON (pre-Py2.6) • XML: ElementTree • HTML: BeautifulSoup • Nasties: Abobe PDF, Microsoft Excel “PDF files are where data goes to die”
    • 19. Reading nasties in Python • Abobe PDF: PDFMiner, pdftable • MS Excel: xlrd
    • 20. Example Two • Palmerston North City Food Gradings • http://j.mp/31YuRH
    • 21. Review • Get HTML page • Find current PDF link • Download PDF • Parse table • Name • Grading
    • 22. Parsing PDF import urllib2 from cStringIO import StringIO from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.layout import LAParams pdf_file = StringIO(urllib2.urlopen(pdf_url).read()) text = StringIO() rsrc = PDFResourceManager() device = TextConverter(rsrc, text, laparams=LAParams()) process_pdf(rsrc, device, pdf_file) device.close() print text.getvalue()
    • 23. Summary • Python has some great tools for: • querying websites • parsing HTML & other formats • Open data as data, not websites