Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Web Scraping with Python Slide 1 Web Scraping with Python Slide 2 Web Scraping with Python Slide 3 Web Scraping with Python Slide 4 Web Scraping with Python Slide 5 Web Scraping with Python Slide 6 Web Scraping with Python Slide 7 Web Scraping with Python Slide 8 Web Scraping with Python Slide 9 Web Scraping with Python Slide 10 Web Scraping with Python Slide 11 Web Scraping with Python Slide 12 Web Scraping with Python Slide 13 Web Scraping with Python Slide 14 Web Scraping with Python Slide 15 Web Scraping with Python Slide 16 Web Scraping with Python Slide 17 Web Scraping with Python Slide 18 Web Scraping with Python Slide 19 Web Scraping with Python Slide 20 Web Scraping with Python Slide 21 Web Scraping with Python Slide 22 Web Scraping with Python Slide 23 Web Scraping with Python Slide 24 Web Scraping with Python Slide 25 Web Scraping with Python Slide 26 Web Scraping with Python Slide 27 Web Scraping with Python Slide 28 Web Scraping with Python Slide 29 Web Scraping with Python Slide 30 Web Scraping with Python Slide 31 Web Scraping with Python Slide 32 Web Scraping with Python Slide 33 Web Scraping with Python Slide 34 Web Scraping with Python Slide 35 Web Scraping with Python Slide 36 Web Scraping with Python Slide 37 Web Scraping with Python Slide 38 Web Scraping with Python Slide 39 Web Scraping with Python Slide 40 Web Scraping with Python Slide 41 Web Scraping with Python Slide 42 Web Scraping with Python Slide 43 Web Scraping with Python Slide 44 Web Scraping with Python Slide 45 Web Scraping with Python Slide 46 Web Scraping with Python Slide 47 Web Scraping with Python Slide 48 Web Scraping with Python Slide 49 Web Scraping with Python Slide 50 Web Scraping with Python Slide 51 Web Scraping with Python Slide 52 Web Scraping with Python Slide 53 Web Scraping with Python Slide 54 Web Scraping with Python Slide 55 Web Scraping with Python Slide 56 Web Scraping with Python Slide 57 Web Scraping with Python Slide 58 Web Scraping with Python Slide 59 Web Scraping with Python Slide 60 Web Scraping with Python Slide 61 Web Scraping with Python Slide 62 Web Scraping with Python Slide 63 Web Scraping with Python Slide 64 Web Scraping with Python Slide 65 Web Scraping with Python Slide 66 Web Scraping with Python Slide 67 Web Scraping with Python Slide 68 Web Scraping with Python Slide 69 Web Scraping with Python Slide 70 Web Scraping with Python Slide 71 Web Scraping with Python Slide 72 Web Scraping with Python Slide 73 Web Scraping with Python Slide 74 Web Scraping with Python Slide 75 Web Scraping with Python Slide 76 Web Scraping with Python Slide 77 Web Scraping with Python Slide 78 Web Scraping with Python Slide 79 Web Scraping with Python Slide 80 Web Scraping with Python Slide 81 Web Scraping with Python Slide 82 Web Scraping with Python Slide 83 Web Scraping with Python Slide 84 Web Scraping with Python Slide 85 Web Scraping with Python Slide 86 Web Scraping with Python Slide 87 Web Scraping with Python Slide 88 Web Scraping with Python Slide 89 Web Scraping with Python Slide 90 Web Scraping with Python Slide 91 Web Scraping with Python Slide 92 Web Scraping with Python Slide 93 Web Scraping with Python Slide 94 Web Scraping with Python Slide 95 Web Scraping with Python Slide 96 Web Scraping with Python Slide 97 Web Scraping with Python Slide 98 Web Scraping with Python Slide 99 Web Scraping with Python Slide 100 Web Scraping with Python Slide 101 Web Scraping with Python Slide 102 Web Scraping with Python Slide 103 Web Scraping with Python Slide 104 Web Scraping with Python Slide 105 Web Scraping with Python Slide 106
Upcoming SlideShare
Some Advanced Remarketing Ideas
Next
Download to read offline and view in fullscreen.

11 Likes

Share

Download to read offline

Web Scraping with Python

Download to read offline

Web Scraping with Python
NICAR 2015 • Atlanta, Georgia • March 6-7, 2015

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Web Scraping with Python

  1. 1. using python web scraping
  2. 2. Paul Schreiber paul.schreiber@fivethirtyeight.com paulschreiber@gmail.com @paulschreiber
  3. 3.
  4. 4. </>
  5. 5. Fetching pages ➜ urllib ➜ urllib2 ➜ urllib (Python 3) ➜ requests
  6. 6. import'requests' page'='requests.get('http://www.ire.org/') Fetch one page
  7. 7. import'requests' base_url'=''http://www.fishing.ocean/p/%s'' for'i'in'range(0,'10):' ''url'='base_url'%'i' ''page'='requests.get(url) Fetch a set of results
  8. 8. import'requests' page'='requests.get('http://www.ire.org/')' with'open("index.html",'"wb")'as'html:' ''html.write(page.content) Download a file
  9. 9. Parsing data ➜ Regular Expressions ➜ CSS Selectors ➜ XPath ➜ Object Hierarchy ➜ Object Searching
  10. 10. <html>' ''<head><title>Green'Eggs'and'Ham</title></head>' ''<body>' '''<ol>' ''''''<li>Green'Eggs</li>' ''''''<li>Ham</li>' '''</ol>' ''</body>' </html>
  11. 11. import're' item_re'='re.compile("<li[^>]*>([^<]+?)</ li>")' item_re.findall(html) Regular Expressions DON’T DO THIS!
  12. 12. from'bs4'import'BeautifulSoup' soup'='BeautifulSoup(html)' [s.text'for's'in'soup.select("li")] CSS Selectors
  13. 13. from'lxml'import'etree' from'StringIO'import'*' html'='StringIO(html)' tree'='etree.parse(html)' [s.text'for's'in'tree.xpath('/ol/li')] XPath
  14. 14. import'requests' from'bs4'import'BeautifulSoup' page'='requests.get('http://www.ire.org/')' soup'='BeautifulSoup(page.content)' print'"The'title'is'"'+'soup.title Object Hierarchy
  15. 15. from'bs4'import'BeautifulSoup' soup'='BeautifulSoup(html)' [s.text'for's'in'soup.find_all("li")] Object Searching
  16. 16. import'csv' with'open('shoes.csv',''wb')'as'csvfile:' ''''shoe_writer'='csv.writer(csvfile)' ''''for'line'in'shoe_list:' ''''''''shoe_writer.writerow(line)' Write CSV
  17. 17. output'='open("shoes.txt",'"w")' for'row'in'data:' ''''output.write("t".join(row)'+'"n")' output.close()' Write TSV
  18. 18. import'json' with'open('shoes.json',''wb')'as'outfile:' ''''json.dump(my_json,'outfile)' Write JSON
  19. 19. workon'web_scraping
  20. 20. WTA Rankings EXAMPLE 1
  21. 21. WTA Rankings ➜ fetch page ➜ parse cells ➜ write to file
  22. 22. import'csv' import'requests' from'bs4'import'BeautifulSoup' url'=''http://www.wtatennis.com/singlesZ rankings'' page'='requests.get(url)' soup'='BeautifulSoup(page.content) WTA Rankings
  23. 23. soup.select("#myTable'td") WTA Rankings
  24. 24. [s'for's'in'soup.select("#myTable'td")] WTA Rankings
  25. 25. [s.get_text()'for's'in' soup.select("#myTable'td")] WTA Rankings
  26. 26. [s.get_text().strip()'for's'in' soup.select("#myTable'td")] WTA Rankings
  27. 27. cells'='[s.get_text().strip()'for's'in' soup.select("#myTable'td")] WTA Rankings
  28. 28. for'i'in'range(0,'3):' ''print'cells[i*7:i*7+7] WTA Rankings
  29. 29. with'open('wta.csv',''wb')'as'csvfile:' ''wtawriter'='csv.writer(csvfile)' ''for'i'in'range(0,'3):' ''''wtawriter.writerow(cells[i*7:i*7+7]) WTA Rankings
  30. 30. NY Election Boards EXAMPLE 2
  31. 31. NY Election Boards ➜ list counties ➜ loop over counties ➜ fetch county pages ➜ parse county data ➜ write to file
  32. 32. import'requests' from'bs4'import'BeautifulSoup' url'=''http://www.elections.ny.gov/ CountyBoards.html'' page'='requests.get(url)' soup'='BeautifulSoup(page.content) NY Election Boards
  33. 33. soup.select("area") NY Election Boards
  34. 34. counties'='soup.select("area")' county_urls'='[u.get('href')'for'u'in' counties] NY Election Boards
  35. 35. counties'='soup.select("area")' county_urls'='[u.get('href')'for'u'in' counties]' county_urls'='county_urls[1:]' county_urls'='list(set(county_urls)) NY Election Boards
  36. 36. for'url'in'county_urls[0:3]:' ''''print'"Fetching'%s"'%'url' ''''page'='requests.get(url)' ''''soup'='BeautifulSoup(page.content)' ''''lines'='[s'for's'in'soup.select("th") [0].strings]' ''''data.append(lines) NY Election Boards
  37. 37. output'='open("boards.txt",'"w")' for'row'in'data:' ''''output.write("t".join(row)'+'"n")' output.close() NY Election Boards
  38. 38. ACEC Members EXAMPLE 3
  39. 39. ACEC Members ➜ loop over pages ➜ fetch result table ➜ parse name, id, location ➜ write to file
  40. 40. import'requests' import'json' from'bs4'import'BeautifulSoup' base_url'=''http://www.acec.ca/about_acec/ search_member_firms/ business_sector_search.html/search/ business/page/%s' ACEC Members
  41. 41. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' soup.find(id='resulttable') ACEC Members
  42. 42. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' table'='soup.find(id='resulttable')' rows'='table.find_all('tr') ACEC Members
  43. 43. url'='base_url'%'1' page'='requests.get(url)' soup'='BeautifulSoup(page.content)' table'='soup.find(id='resulttable')' rows'='table.find_all('tr')' columns'='rows[0].find_all('td') ACEC Members
  44. 44. columns'='rows[0].find_all('td')' company_data'='{' '''name':'columns[1].a.text,' '''id':'columns[1].a['href'].split('/') [Z1],' '''location':'columns[2].text' } ACEC Members
  45. 45. start_page'='1' end_page'='2' result'='[] ACEC Members
  46. 46. for'i'in'range(start_page,'end_page'+'1):' ''''url'='base_url'%'i' ''''print'"Fetching'%s"'%'url' ''''page'='requests.get(url)' ''''soup'='BeautifulSoup(page.content)' ''''table'='soup.find(id='resulttable')' ''''rows'='table.find_all('tr') ACEC Members
  47. 47. ''for'r'in'rows:' ''''columns'='r.find_all('td')' ''''company_data'='{' '''''''name':'columns[1].a.text,' '''''''id':'columns[1].a['href'].split('/')[Z1],' '''''''location':'columns[2].text' ''''}' ''''result.append(company_data)' ACEC Members
  48. 48. with'open('acec.json',''w')'as'outfile:' ''''json.dump(result,'outfile) ACEC Members
  49. 49. </>
  50. 50. Python Tools ➜ lxml ➜ scrapy ➜ MechanicalSoup ➜ RoboBrowser ➜ pyQuery
  51. 51. Ruby Tools ➜ nokogiri ➜ Mechanize
  52. 52. Not coding? Scrape with: ➜ import.io ➜ Kimono ➜ copy & paste ➜ PDFTables ➜ Tabula
  53. 53.
  54. 54. page'='requests.get(url,'auth=('drseuss',' 'hamsandwich')) Basic Authentication
  55. 55. page'='requests.get(url,'verify=False) Self-signed certificates
  56. 56. page'='requests.get(url,'verify='/etc/ssl/ certs.pem') Specify Certificate Bundle
  57. 57. requests.exceptions.SSLError:5hostname5 'shrub.ca'5doesn't5match5either5of5 'www.arthurlaw.ca',5'arthurlaw.ca'5 $'pip'install'pyopenssl' $'pip'install'ndgZhttpsclient' $'pip'install'pyasn1 Server Name Indication (SNI)
  58. 58. UnicodeEncodeError:''ascii','u'Cornet,' Alizxe9','12,'13,''ordinal'not'in' range(128)'' Fix myvar.encode("utfZ8") Unicode
  59. 59. page'='requests.get(url)' if'(page.status_code'>='400):' ''...' else:' ''... Server Errors
  60. 60. try:' ''''r'='requests.get(url)' except'requests.exceptions.RequestException'as'e:' ''''print'e' ''''sys.exit(1)' Exceptions
  61. 61. headers'='{' '''''UserZAgent':''Mozilla/3000'' }' response'='requests.get(url,'headers=headers) Browser Disallowed
  62. 62. import'time' for'i'in'range(0,'10):' ''url'='base_url'%'i' ''page'='requests.get(url)' ''time.sleep(1) Rate Limiting/Slow Servers
  63. 63. requests.get("http://greeneggs.ham/",'params=' {'name':''sam',''verb':''are'}) Query String
  64. 64. requests.post("http://greeneggs.ham/",'data=' {'name':''sam',''verb':''are'}) POST a form
  65. 65. paritcipate
  66. 66. <strong>' ''<em>foo</strong>' </em>
  67. 67. !
  68. 68. github.com/paulschreiber/nicar15
  69. 69. Many graphics from The Noun Project Binoculars by Stephen West. Broken file by Maxi Koichi. Broom by Anna Weiss. Chess by Matt Brooks. Cube by Luis Rodrigues. Firewall by Yazmin Alanis. Frown by Simple Icons. Lock by Edward Boatman. Wrench by Tony Gines.
  • Lumipanda

    Aug. 12, 2017
  • choeungjin

    Dec. 2, 2016
  • lovecalista

    Aug. 15, 2016
  • EdwinDenisonOMario

    Aug. 6, 2016
  • MorrisBankston

    Mar. 9, 2016
  • JorgeRSantos

    Jul. 13, 2015
  • reemal84

    Jul. 5, 2015
  • uncreativename24

    Jun. 21, 2015
  • Pavelihoek

    Apr. 6, 2015
  • RandyMueller

    Apr. 3, 2015
  • jordipala

    Mar. 30, 2015

Web Scraping with Python NICAR 2015 • Atlanta, Georgia • March 6-7, 2015

Views

Total views

11,423

On Slideshare

0

From embeds

0

Number of embeds

4,653

Actions

Downloads

315

Shares

0

Comments

0

Likes

11

×