Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scraping for journalists - ideas, concepts and tips (CIJ Summer School 2019)

435 views

Published on

Presentation at the Centre for Investigative Journalism Summer School, July 2019

Published in: Education
  • Be the first to comment

  • Be the first to like this

Scraping for journalists - ideas, concepts and tips (CIJ Summer School 2019)

  1. 1. Paul Bradshaw Leanpub.com/scrapingforjournalists* Scraping for stories
  2. 2. Why scraping How to spot opportunities for scraping Tools and traits: what can be scraped, and how Why and how
  3. 3. Automating the repetitive gathering of data, e.g. Data from same page every day (e.g. social media)
 Data from multiple pages
 Multiple documents: spreadsheets, PDFs What is scraping?
  4. 4. Why is a government website carrying fake jobs? Aron Pilhofer, News Rewired
  5. 5. https://www.youtube.com/watch?v=Efr-VEkwWoM
  6. 6. http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
  7. 7. http://doctors.ajc.com/
  8. 8. https://www.youtube.com/watch?v=DWRGqmywNYs&t=2m10s https://nos.nl/op3/artikel/2248432-zo-dun-moet-je-echt-zijn-als-je-model-wil-worden.html https://datajournalismawards.org/projects/this-is-how-thin-you-have-to-be-to-walk-the-catwalk/
  9. 9. You want to investigate allegations of match fixing, what data do you scrape?
  10. 10. One scraper, many stories
  11. 11. Patterns, hunches
  12. 12. Anticipate published data exemption ‘Second source’ for FOI data Combine with FOI
  13. 13. Gives you a long term insight into the issue Allows you to spot things being removed and added Scheduled scrapes
  14. 14. https://www.bbc.co.uk/news/health-48624554
  15. 15. New entries - or disappearing ones
  16. 16. Data outside scope of FOI Repetitive work, e.g. 1000s of documents Information that changes regularly Sources that follow a pattern Stories that suit scraping
  17. 17. Data that can be obtained through asking, FOI, API etc. Multiple sites with different CMS Small amounts of data/ repetition Stories that don’t.
  18. 18. What makes a website suitable for scraping?
  19. 19. *
  20. 20. * http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
  21. 21. * URL parameters https://stores.sainsburys.co.uk/
  22. 22. *
  23. 23. * https://tfl.gov.uk/corporate/safety-and-security/road-safety/safety-enforcement-cameras
  24. 24. * https://tfl.gov.uk/corporate/safety-and-security/road-safety/safety-enforcement-cameras
  25. 25. HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR Document challenges
  26. 26. 1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed Hosting challenges
  27. 27. Patterns Look for structure in a webpage - how are elements distinguished? Think code and text
  28. 28. * Chrome: right-click > Inspect
  29. 29. * Inspector: right-click > Copy…
  30. 30. * A tip about URLs
  31. 31. This bit isn’t needed.
  32. 32. This bit is just for SEO.
  33. 33. You can put anything there. Literally.
  34. 34. Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?
  35. 35. Paul Bradshaw Leanpub.com/scrapingforjournalists* Thank you.

×