Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
/me wants it
         Scraping Sites to get Data




Rob Coup
robert@coup.net.nz
Who am I?

• Koordinates
• Open data
  open.org.nz

• Geek
• Pythonista
Datasets as
  websites
But I
want
to mix
it up!

         http://www.flickr.com/photos/bowbrick/2365377635
DATA
       http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
And when do I want it?




                 http://www.flickr.com/photos/davidmaddison/102584440
Just Scrape It
First Example


• Wanganui District Council Food Gradings
• http://j.mp/i4yNZ
Review
• POST to URLs for each Grade
• Parse HTML response for:
 • Business Name
 • Address
 • Grading
• Output as CSV
What to POST?
• Tools: Firebug, Charles
  http://www.wanganui.govt.nz/services/foodgrading/
  SearchResults.asp

  txtGrad...
POSTing in Python
import urllib
import urllib2

url = 'http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp...
Results
…
<TD class="bodytext">
  <h2>Search results...</h2>
  <B>39 South</B><br />
  159 Victoria Ave<br />
  Wanganui<b...
Getting Data Out
• Tools: BeautifulSoup

• Parses HTML-ish documents
• Easy navigation & searching of tree
Our Parser
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
container = soup.find('td', {'class':'bodyt...
Putting it all together


• loop over the grading values
• write CSV output
Advanced Crawlers


• Form filling
• Authentication & cookies
Mechanize


•   http://wwwsearch.sourceforge.net/mechanize/

•   programmable browser in Python

•   fills forms, navigates...
Data Parsing

• JSON: SimpleJSON (pre-Py2.6)
• XML: ElementTree
• HTML: BeautifulSoup
• Nasties: Abobe PDF, Microsoft Exce...
Reading nasties in
         Python

• Abobe PDF: PDFMiner, pdftable
• MS Excel: xlrd
Example Two


• Palmerston North City Food Gradings
• http://j.mp/31YuRH
Review
• Get HTML page
• Find current PDF link
• Download PDF
• Parse table
 • Name
 • Grading
Parsing PDF
import urllib2
from cStringIO import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfin...
Summary

• Python has some great tools for:
 • querying websites
 • parsing HTML & other formats

• Open data as data, not...
Upcoming SlideShare
Loading in …5
×

/me wants it. Scraping sites to get data.

3,994 views

Published on

Building scrapers for grabbing data from websites. Tools, techniques, and tips.

A presentation at KiwiPycon 2009

Published in: Technology
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ♥♥♥ http://bit.ly/2Q98JRS ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/2Q98JRS ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Visit Here to Read PDF eBook === http://readjpaojdpa.ygto.com/2850366811-le-robert-junior-illustre.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

/me wants it. Scraping sites to get data.

  1. 1. /me wants it Scraping Sites to get Data Rob Coup robert@coup.net.nz
  2. 2. Who am I? • Koordinates • Open data open.org.nz • Geek • Pythonista
  3. 3. Datasets as websites
  4. 4. But I want to mix it up! http://www.flickr.com/photos/bowbrick/2365377635
  5. 5. DATA http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
  6. 6. And when do I want it? http://www.flickr.com/photos/davidmaddison/102584440
  7. 7. Just Scrape It
  8. 8. First Example • Wanganui District Council Food Gradings • http://j.mp/i4yNZ
  9. 9. Review • POST to URLs for each Grade • Parse HTML response for: • Business Name • Address • Grading • Output as CSV
  10. 10. What to POST? • Tools: Firebug, Charles http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp txtGrading=A [ B, C, D, E, “Exempt”, “Currently Not Graded” ] Submit=Go
  11. 11. POSTing in Python import urllib import urllib2 url = 'http://www.wanganui.govt.nz/services/foodgrading/ SearchResults.asp' post_data = { 'txtGrading': 'A', 'Submit': 'Go', } post_encoded = urllib.urlencode(post_data) html = urllib2.urlopen(url, post_encoded).read() print html
  12. 12. Results … <TD class="bodytext"> <h2>Search results...</h2> <B>39 South</B><br /> 159 Victoria Ave<br /> Wanganui<br /> Grading: <B>A</b> <hr /> <B>Alma Junction Dairy</B><br /> 1 Alma Rd<br /> Wanganui<br /> Grading: <B>A</b> <hr /> …
  13. 13. Getting Data Out • Tools: BeautifulSoup • Parses HTML-ish documents • Easy navigation & searching of tree
  14. 14. Our Parser from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) container = soup.find('td', {'class':'bodytext'}) for hr_el in container.findAll('hr'): # <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/> text_parts = hr_el.findPreviousSiblings(text=True, limit=3) # ['Grading:', 'ADDRESS_1', 'ADDRESS_0'] address = (text_parts[2], text_parts[1]) el_parts = hr_el.findPreviousSiblings('b', limit=2) # [<b>GRADE</b>, <b>NAME</b>] grade = el_parts[0].string name = el_parts[1].string print name, address, grade
  15. 15. Putting it all together • loop over the grading values • write CSV output
  16. 16. Advanced Crawlers • Form filling • Authentication & cookies
  17. 17. Mechanize • http://wwwsearch.sourceforge.net/mechanize/ • programmable browser in Python • fills forms, navigates links & pages, eats cookies
  18. 18. Data Parsing • JSON: SimpleJSON (pre-Py2.6) • XML: ElementTree • HTML: BeautifulSoup • Nasties: Abobe PDF, Microsoft Excel “PDF files are where data goes to die”
  19. 19. Reading nasties in Python • Abobe PDF: PDFMiner, pdftable • MS Excel: xlrd
  20. 20. Example Two • Palmerston North City Food Gradings • http://j.mp/31YuRH
  21. 21. Review • Get HTML page • Find current PDF link • Download PDF • Parse table • Name • Grading
  22. 22. Parsing PDF import urllib2 from cStringIO import StringIO from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.layout import LAParams pdf_file = StringIO(urllib2.urlopen(pdf_url).read()) text = StringIO() rsrc = PDFResourceManager() device = TextConverter(rsrc, text, laparams=LAParams()) process_pdf(rsrc, device, pdf_file) device.close() print text.getvalue()
  23. 23. Summary • Python has some great tools for: • querying websites • parsing HTML & other formats • Open data as data, not websites

×