/me wants it. Scraping sites to get data.

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

1 comments

Comments 1 - 1 of 1 previous next Post a comment

  • + rcoup Robert Coup 3 weeks ago
    There are notes on most of the slides giving a bit of an outline of what I was talking about. There’s also a paper available to read which has the full code examples and more detailed descriptions.
Post a comment
Embed Video
Edit your comment Cancel

Notes on slide 1

We’ve ended up with this datasets-as-websites problem.

I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data.
http://www.flickr.com/photos/bowbrick/2365377635

So, I need the raw data. Not some pretty webpages.

http://fl1p51d3.deviantart.com/art/The-Matrix-4594403

At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option.
http://www.flickr.com/photos/davidmaddison/102584440

Usually it’s easier to ask forgiveness than permission.

1 Group

/me wants it. Scraping sites to get data. - Presentation Transcript

  1. /me wants it Scraping Sites to get Data Rob Coup robert@coup.net.nz
  2. Who am I? • Koordinates • Open data open.org.nz • Geek • Pythonista
  3. Datasets as websites
  4. But I want to mix it up! http://www.flickr.com/photos/bowbrick/2365377635
  5. DATA http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  6. And when do I want it? http://www.flickr.com/photos/davidmaddison/102584440
  7. Just Scrape It
  8. First Example • Wanganui District Council Food Gradings • http://j.mp/i4yNZ
  9. Review • POST to URLs for each Grade • Parse HTML response for: • Business Name • Address • Grading • Output as CSV
  10. What to POST? • Tools: Firebug, Charles http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp txtGrading=A [ B, C, D, E, “Exempt”, “Currently Not Graded” ] Submit=Go
  11. POSTing in Python import urllib import urllib2 url = 'http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp' post_data = { 'txtGrading': 'A', 'Submit': 'Go', } post_encoded = urllib.urlencode(post_data) html = urllib2.urlopen(url, post_encoded).read() print html
  12. Results … <TD class="bodytext"> <h2>Search results...</h2> <B>39 South</B><br /> 159 Victoria Ave<br /> Wanganui<br /> Grading: <B>A</b> <hr /> <B>Alma Junction Dairy</B><br /> 1 Alma Rd<br /> Wanganui<br /> Grading: <B>A</b> <hr /> …
  13. Getting Data Out • Tools: BeautifulSoup • Parses HTML-ish documents • Easy navigation & searching of tree
  14. Our Parser from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) container = soup.find('td', {'class':'bodytext'}) for hr_el in container.findAll('hr'): # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/> text_parts = hr_el.findPreviousSiblings(text=True, limit=3) # ['Grading:', 'ADDRESS_1', 'ADDRESS_0'] address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2) # [<b>GRADE</b>, <b>NAME</b>] grade = el_parts[0].string name = el_parts[1].string print name, address, grade
  15. Putting it all together • loop over the grading values • write CSV output
  16. Advanced Crawlers • Form filling • Authentication & cookies
  17. Mechanize • http://wwwsearch.sourceforge.net/mechanize/ • programmable browser in Python • fills forms, navigates links & pages, eats cookies
  18. Data Parsing • JSON: SimpleJSON (pre-Py2.6) • XML: ElementTree • HTML: BeautifulSoup • Nasties: Abobe PDF, Microsoft Excel “PDF files are where data goes to die”
  19. Reading nasties in Python • Abobe PDF: PDFMiner, pdftable • MS Excel: xlrd
  20. Example Two • Palmerston North City Food Gradings • http://j.mp/31YuRH
  21. Review • Get HTML page • Find current PDF link • Download PDF • Parse table • Name • Grading
  22. Parsing PDF import urllib2 from cStringIO import StringIO from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.layout import LAParams pdf_file = StringIO(urllib2.urlopen(pdf_url).read()) text = StringIO() rsrc = PDFResourceManager() device = TextConverter(rsrc, text, laparams=LAParams()) process_pdf(rsrc, device, pdf_file) device.close() print text.getvalue()
  23. Summary • Python has some great tools for: • querying websites • parsing HTML & other formats • Open data as data, not websites

+ Robert CoupRobert Coup, 3 weeks ago

custom

143 views, 0 favs, 0 embeds more stats

Building scrapers for grabbing data from websites. more

More info about this document

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Go to text version

  • Total Views 143
    • 143 on SlideShare
    • 0 from embeds
  • Comments 1
  • Favorites 0
  • Downloads 6
Most viewed embeds

more

All embeds

less

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel
File a copyright complaint
Having problems? Go to our helpdesk?

Categories

Groups / Events