Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Scraping with Python

9,748 views

Published on

Web Scraping with Python
NICAR 2015 • Atlanta, Georgia • March 6-7, 2015

Published in: Technology
  • Interesting citations in the section "not coding" ! How would you scrape all finance releases in companies investor relationship pages?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Web Scraping with Python

  1. 1. using python web scraping
  2. 2. Paul Schreiber paul.schreiber@fivethirtyeight.com paulschreiber@gmail.com @paulschreiber
  3. 3.
  4. 4. </>
  5. 5. Fetching pages ➜ urllib ➜ urllib2 ➜ urllib (Python 3) ➜ requests
  6. 6. import'requests' page'='requests.get('http://www.ire.org/') Fetch one page
  7. 7. import'requests' base_url'=''http://www.fishing.ocean/p/%s'' for'i'in'range(0,'10):' ''url'='base_url'%'i' ''page'='requests.get(url) Fetch a set of results
  8. 8. import'requests' page'='requests.get('http://www.ire.org/')' with'open("index.html",'"wb")'as'html:' ''html.write(page.content) Download a file
  9. 9. Parsing data ➜ Regular Expressions ➜ CSS Selectors ➜ XPath ➜ Object Hierarchy ➜ Object Searching
  10. 10. <html>' ''<head><title>Green'Eggs'and'Ham</title></head>' ''<body>' '''<ol>' ''''''<li>Green'Eggs</li>' ''''''<li>Ham</li>' '''</ol>' ''</body>' </html>
  11. 11. import're' item_re'='re.compile("<li[^>]*>([^<]+?)</ li>")' item_re.findall(html) Regular Expressions DON’T DO THIS!
  12. 12. from'bs4'import'BeautifulSoup' soup'='BeautifulSoup(html)' [s.text'for's'in'soup.select("li")] CSS Selectors
  13. 13. from'lxml'import'etree' from'StringIO'import'*' html'='StringIO(html)' tree'='etree.parse(html)' [s.text'for's'in'tree.xpath('/ol/li')] XPath
  14. 14. import'requests' from'bs4'import'BeautifulSoup' page'='requests.get('http://www.ire.org/')' soup'='BeautifulSoup(page.content)' print'"The'title'is'"'+'soup.title Object Hierarchy
  15. 15. from'bs4'import'BeautifulSoup' soup'='BeautifulSoup(html)' [s.text'for's'in'soup.find_all("li")] Object Searching
  16. 16. import'csv' with'open('shoes.csv',''wb')'as'csvfile:' ''''shoe_writer'='csv.writer(csvfile)' ''''for'line'in'shoe_list:' ''''''''shoe_writer.writerow(line)' Write CSV
  17. 17. output'='open("shoes.txt",'"w")' for'row'in'data:' ''''output.write("t".join(row)'+'"n")' output.close()' Write TSV
  18. 18. import'json' with'open('shoes.json',''wb')'as'outfile:' ''''json.dump(my_json,'outfile)' Write JSON
  19. 19. workon'web_scraping
  20. 20. WTA Rankings EXAMPLE 1
  21. 21. WTA Rankings ➜ fetch page ➜ parse cells ➜ write to file
  22. 22. import'csv' import'requests' from'bs4'import'BeautifulSoup' url'=''http://www.wtatennis.com/singlesZ rankings'' page'='requests.get(url)' soup'='BeautifulSoup(page.content) WTA Rankings
  23. 23. soup.select("#myTable'td") WTA Rankings
  24. 24. [s'for's'in'soup.select("#myTable'td")] WTA Rankings
  25. 25. [s.get_text()'for's'in' soup.select("#myTable'td")] WTA Rankings
  26. 26. [s.get_text().strip()'for's'in' soup.select("#myTable'td")] WTA Rankings
  27. 27. cells'='[s.get_text().strip()'for's'in' soup.select("#myTable'td")] WTA Rankings
  28. 28. for'i'in'range(0,'3):' ''print'cells[i*7:i*7+7] WTA Rankings
  29. 29. with'open('wta.csv',''wb')'as'csvfile:' ''wtawriter'='csv.writer(csvfile)' ''for'i'in'range(0,'3):' ''''wtawriter.writerow(cells[i*7:i*7+7]) WTA Rankings
  30. 30. NY Election Boards EXAMPLE 2
  31. 31. NY Election Boards ➜ list counties ➜ loop over counties ➜ fetch county pages ➜ parse county data ➜ write to file
  32. 32. import'requests' from'bs4'import'BeautifulSoup' url'=''http://www.elections.ny.gov/ CountyBoards.html'' page'='requests.get(url)' soup'='BeautifulSoup(page.content) NY Election Boards
  33. 33. soup.select("area") NY Election Boards
  34. 34. counties'='soup.select("area")' county_urls'='[u.get('href')'for'u'in' counties] NY Election Boards
  35. 35. counties'='soup.select("area")' county_urls'='[u.get('href')'for'u'in' counties]' county_urls'='county_urls[1:]' county_urls'='list(set(county_urls)) NY Election Boards
  36. 36. for'url'in'county_urls[0:3]:' ''''print'"Fetching'%s"'%'url' ''''page'='requests.get(url)' ''''soup'='BeautifulSoup(page.content)' ''''lines'='[s'for's'in'soup.select("th") [0].strings]' ''''data.append(lines) NY Election Boards
  37. 37. output'='open("boards.txt",'"w")' for'row'in'data:' ''''output.write("t".join(row)'+'"n")' output.close() NY Election Boards
  38. 38. ACEC Members EXAMPLE 3
  39. 39. ACEC Members ➜ loop over pages ➜ fetch result table ➜ parse name, id, location ➜ write to file
  40. 40. import'requests' import'json' from'bs4'import'BeautifulSoup' base_url'=''http://www.acec.ca/about_acec/ search_member_firms/ business_sector_search.html/search/ business/page/%s' ACEC Members
  41. 41. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' soup.find(id='resulttable') ACEC Members
  42. 42. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' table'='soup.find(id='resulttable')' rows'='table.find_all('tr') ACEC Members
  43. 43. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' table'='soup.find(id='resulttable')' rows'='table.find_all('tr')' columns'='rows[0].find_all('td') ACEC Members
  44. 44. columns'='rows[0].find_all('td')' company_data'='{' '''name':'columns[1].a.text,' '''id':'columns[1].a['href'].split('/') [Z1],' '''location':'columns[2].text' } ACEC Members
  45. 45. start_page'='1' end_page'='2' result'='[] ACEC Members
  46. 46. for'i'in'range(start_page,'end_page'+'1):' ''''url'='base_url'%'i' ''''print'"Fetching'%s"'%'url' ''''page'='requests.get(url)' ''''soup'='BeautifulSoup(page.content)' ''''table'='soup.find(id='resulttable')' ''''rows'='table.find_all('tr') ACEC Members
  47. 47. ''for'r'in'rows:' ''''columns'='r.find_all('td')' ''''company_data'='{' '''''''name':'columns[1].a.text,' '''''''id':'columns[1].a['href'].split('/')[Z1],' '''''''location':'columns[2].text' ''''}' ''''result.append(company_data)' ACEC Members
  48. 48. with'open('acec.json',''w')'as'outfile:' ''''json.dump(result,'outfile) ACEC Members
  49. 49. </>
  50. 50. Python Tools ➜ lxml ➜ scrapy ➜ MechanicalSoup ➜ RoboBrowser ➜ pyQuery
  51. 51. Ruby Tools ➜ nokogiri ➜ Mechanize
  52. 52. Not coding? Scrape with: ➜ import.io ➜ Kimono ➜ copy & paste ➜ PDFTables ➜ Tabula
  53. 53.
  54. 54. page'='requests.get(url,'auth=('drseuss',' 'hamsandwich')) Basic Authentication
  55. 55. page'='requests.get(url,'verify=False) Self-signed certificates
  56. 56. page'='requests.get(url,'verify='/etc/ssl/ certs.pem') Specify Certificate Bundle
  57. 57. requests.exceptions.SSLError:5hostname5 'shrub.ca'5doesn't5match5either5of5 'www.arthurlaw.ca',5'arthurlaw.ca'5 $'pip'install'pyopenssl' $'pip'install'ndgZhttpsclient' $'pip'install'pyasn1 Server Name Indication (SNI)
  58. 58. UnicodeEncodeError:''ascii','u'Cornet,' Alizxe9','12,'13,''ordinal'not'in' range(128)'' Fix myvar.encode("utfZ8") Unicode
  59. 59. page'='requests.get(url)' if'(page.status_code'>='400):' ''...' else:' ''... Server Errors
  60. 60. try:' ''''r'='requests.get(url)' except'requests.exceptions.RequestException'as'e:' ''''print'e' ''''sys.exit(1)' Exceptions
  61. 61. headers'='{' '''''UserZAgent':''Mozilla/3000'' }' response'='requests.get(url,'headers=headers) Browser Disallowed
  62. 62. import'time' for'i'in'range(0,'10):' ''url'='base_url'%'i' ''page'='requests.get(url)' ''time.sleep(1) Rate Limiting/Slow Servers
  63. 63. requests.get("http://greeneggs.ham/",'params=' {'name':''sam',''verb':''are'}) Query String
  64. 64. requests.post("http://greeneggs.ham/",'data=' {'name':''sam',''verb':''are'}) POST a form
  65. 65. paritcipate
  66. 66. <strong>' ''<em>foo</strong>' </em>
  67. 67. !
  68. 68. github.com/paulschreiber/nicar15
  69. 69. Many graphics from The Noun Project Binoculars by Stephen West. Broken file by Maxi Koichi. Broom by Anna Weiss. Chess by Matt Brooks. Cube by Luis Rodrigues. Firewall by Yazmin Alanis. Frown by Simple Icons. Lock by Edward Boatman. Wrench by Tony Gines.

×