Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scraping in 20 mins

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 19 Ad
Advertisement

More Related Content

More from Paul Bradshaw (20)

Recently uploaded (20)

Advertisement

Scraping in 20 mins

  1. 1. Scraping in 20 mins Paul Bradshaw * Leanpub.com/scrapingforjournalists Friday, 13 July 2012
  2. 2. * Friday, 13 July 2012
  3. 3. Function (Parameters) * Friday, 13 July 2012
  4. 4. Function (Parameters) =SUM(A2:A50) =AVERAGE(B2:B300) =COUNTIF(A10:A3000,”Smith”) * Friday, 13 July 2012
  5. 5. (“string”, index) * Friday, 13 July 2012
  6. 6. Tip: search for documentation * Friday, 13 July 2012
  7. 7. Tip: search for structure around data * Friday, 13 July 2012
  8. 8. * Friday, 13 July 2012
  9. 9. //div[starts-with(@ class, ‘jobWrap’)]* Friday, 13 July 2012
  10. 10. bit.ly/nrwscraper2 * Friday, 13 July 2012
  11. 11. excelnotes.posterous.com /tag/importxml /tag/importhtml * Friday, 13 July 2012
  12. 12. * Friday, 13 July 2012
  13. 13. https://scraperwiki.com/scrapers/ basic_twitter_scraper/ * Friday, 13 July 2012
  14. 14. https://scraperwiki.com/docs/python/tutorials/ - Screen Scraper 2 * Friday, 13 July 2012
  15. 15. Things to know • Libraries • Functions • Variables • Lists or arrays [‘Bob’, ‘Jane’] • Index • String, integer, float • If/Else • For loops • Operators Friday, 13 July 2012
  16. 16. Following the data • From String (URL) -> • Variable (html) -> • Variable (root) -> • Variable containing a list (tds) -> • Variable (td) Friday, 13 July 2012
  17. 17. Looping through a list • Tds = [‘Duarte’, ‘Sihl’, ‘Franzi’, ‘Paul’] • For td in tds • The first time, td = Duarte • The second time, td = Sihl • Then td = Franzi • Then td = Paul • Then it has finished the loop! Friday, 13 July 2012
  18. 18. * Friday, 13 July 2012
  19. 19. Leanpub.com/scrapingforjournalists @paulbradshaw onlinejournalismblog.com helpmeinvestigate.com slideshare.net/onlinejournalist * linkedin.com/in/onlinejournalist Friday, 13 July 2012

×