/me wants it. Scraping sites to get data.
Upcoming SlideShare
Loading in...5

/me wants it. Scraping sites to get data.



Building scrapers for grabbing data from websites. Tools, techniques, and tips.

Building scrapers for grabbing data from websites. Tools, techniques, and tips.

A presentation at KiwiPycon 2009



Total Views
Views on SlideShare
Embed Views



4 Embeds 22

http://www.slideshare.net 12
http://www.techgig.com 7
http://www.linkedin.com 2 1



Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • There are notes on most of the slides giving a bit of an outline of what I was talking about. There's also a paper available to read which has the full code examples and more detailed descriptions.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • We’ve ended up with this datasets-as-websites problem.
  • I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data. <br /> http://www.flickr.com/photos/bowbrick/2365377635
  • So, I need the raw data. Not some pretty webpages. <br /> <br /> http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  • At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option. <br /> http://www.flickr.com/photos/davidmaddison/102584440
  • Usually it&#x2019;s easier to ask forgiveness than permission.

/me wants it. Scraping sites to get data. /me wants it. Scraping sites to get data. Presentation Transcript

  • /me wants it Scraping Sites to get Data Rob Coup robert@coup.net.nz
  • Who am I? • Koordinates • Open data open.org.nz • Geek • Pythonista
  • Datasets as websites
  • But I want to mix it up! http://www.flickr.com/photos/bowbrick/2365377635
  • DATA http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  • And when do I want it? http://www.flickr.com/photos/davidmaddison/102584440
  • Just Scrape It
  • First Example • Wanganui District Council Food Gradings • http://j.mp/i4yNZ
  • Review • POST to URLs for each Grade • Parse HTML response for: • Business Name • Address • Grading • Output as CSV
  • What to POST? • Tools: Firebug, Charles http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp txtGrading=A [ B, C, D, E, “Exempt”, “Currently Not Graded” ] Submit=Go
  • POSTing in Python import urllib import urllib2 url = 'http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp' post_data = { 'txtGrading': 'A', 'Submit': 'Go', } post_encoded = urllib.urlencode(post_data) html = urllib2.urlopen(url, post_encoded).read() print html
  • Results … <TD class="bodytext"> <h2>Search results...</h2> <B>39 South</B><br /> 159 Victoria Ave<br /> Wanganui<br /> Grading: <B>A</b> <hr /> <B>Alma Junction Dairy</B><br /> 1 Alma Rd<br /> Wanganui<br /> Grading: <B>A</b> <hr /> …
  • Getting Data Out • Tools: BeautifulSoup • Parses HTML-ish documents • Easy navigation & searching of tree
  • Our Parser from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) container = soup.find('td', {'class':'bodytext'}) for hr_el in container.findAll('hr'): # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/> text_parts = hr_el.findPreviousSiblings(text=True, limit=3) # ['Grading:', 'ADDRESS_1', 'ADDRESS_0'] address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2) # [<b>GRADE</b>, <b>NAME</b>] grade = el_parts[0].string name = el_parts[1].string print name, address, grade
  • Putting it all together • loop over the grading values • write CSV output
  • Advanced Crawlers • Form filling • Authentication & cookies
  • Mechanize • http://wwwsearch.sourceforge.net/mechanize/ • programmable browser in Python • fills forms, navigates links & pages, eats cookies
  • Data Parsing • JSON: SimpleJSON (pre-Py2.6) • XML: ElementTree • HTML: BeautifulSoup • Nasties: Abobe PDF, Microsoft Excel “PDF files are where data goes to die”
  • Reading nasties in Python • Abobe PDF: PDFMiner, pdftable • MS Excel: xlrd
  • Example Two • Palmerston North City Food Gradings • http://j.mp/31YuRH
  • Review • Get HTML page • Find current PDF link • Download PDF • Parse table • Name • Grading
  • Parsing PDF import urllib2 from cStringIO import StringIO from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.layout import LAParams pdf_file = StringIO(urllib2.urlopen(pdf_url).read()) text = StringIO() rsrc = PDFResourceManager() device = TextConverter(rsrc, text, laparams=LAParams()) process_pdf(rsrc, device, pdf_file) device.close() print text.getvalue()
  • Summary • Python has some great tools for: • querying websites • parsing HTML & other formats • Open data as data, not websites