Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Paul Bradshaw
Leanpub.com/scrapingforjournalists*
Scraping
for stories
Why scraping
How to spot
opportunities for scraping
Tools and traits: what can
be scraped, and how
Why and how
Automating the repetitive
gathering of data, e.g.
Multiple tables in one page

Webpage tables

Multiple spreadsheets

Mult...
Why is a government
website carrying fake
jobs?
Aron Pilhofer, News Rewired
https://www.youtube.com/watch?v=Efr-VEkwWoM
http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
http://www.private-eye.co.uk/registry
Focus.
New entries - or
disappearing ones
http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/
https://moveplanner.zoopla.co.uk/terms-and-conditions
http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424
What makes a website
suitable for scraping?
*
*
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
*
URL parameters
https://stores.sainsburys.co.uk/
HTML <table>
HTML tag(s)
JSON file
XML
TXT
CSV or XLS
PDF
PDF which needs OCR
Document challenges
1 page, changes
>1 pages, ‘next’ links
pages linked from 1 index
>1 pages, URL pattern
Search results URL pattern
Search r...
Patterns
Look for structure in a
webpage - how are
elements distinguished?
Think code and text
*
Chrome: right-click >
Inspect
*
Inspector: right-click >
Copy…
*
A tip about URLs
This bit isn’t needed.
This bit is just for SEO.
You can put anything
there. Literally.
Do it now:
Identify an online
source of information
you might scrape
Think beyond tables:
what about series of
pages? Docu...
https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/
Robots.txt
http://www.tcij.org/robots.txt
Treat like any source:
build in TGTBT checks
Seek second sources
Seek right of reply/
confirmation
Data is just a lead
http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/
https://www.mediawiki.org/wiki/API:Main_page
Does it have an API?
Gives you a long term
insight into the issue
Allows you to spot things
being removed and added
Scheduled scrapes
Paul Bradshaw
Leanpub.com/scrapingforjournalists*
Thank you.
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Scraping for Stories
Upcoming SlideShare
Loading in …5
×

Scraping for Stories

582 views

Published on

Presentation at CIJ Summer School 2016

Published in: Education
  • Be the first to comment

  • Be the first to like this

Scraping for Stories

  1. 1. Paul Bradshaw Leanpub.com/scrapingforjournalists* Scraping for stories
  2. 2. Why scraping How to spot opportunities for scraping Tools and traits: what can be scraped, and how Why and how
  3. 3. Automating the repetitive gathering of data, e.g. Multiple tables in one page
 Webpage tables
 Multiple spreadsheets
 Multiple PDFs What is scraping?
  4. 4. Why is a government website carrying fake jobs? Aron Pilhofer, News Rewired
  5. 5. https://www.youtube.com/watch?v=Efr-VEkwWoM
  6. 6. http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
  7. 7. http://www.private-eye.co.uk/registry
  8. 8. Focus.
  9. 9. New entries - or disappearing ones
  10. 10. http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/
  11. 11. https://moveplanner.zoopla.co.uk/terms-and-conditions
  12. 12. http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
  13. 13. http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424
  14. 14. What makes a website suitable for scraping?
  15. 15. *
  16. 16. * http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
  17. 17. * URL parameters https://stores.sainsburys.co.uk/
  18. 18. HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR Document challenges
  19. 19. 1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed Hosting challenges
  20. 20. Patterns Look for structure in a webpage - how are elements distinguished? Think code and text
  21. 21. * Chrome: right-click > Inspect
  22. 22. * Inspector: right-click > Copy…
  23. 23. * A tip about URLs
  24. 24. This bit isn’t needed.
  25. 25. This bit is just for SEO.
  26. 26. You can put anything there. Literally.
  27. 27. Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?
  28. 28. https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/
  29. 29. Robots.txt http://www.tcij.org/robots.txt
  30. 30. Treat like any source: build in TGTBT checks Seek second sources Seek right of reply/ confirmation Data is just a lead
  31. 31. http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/
  32. 32. https://www.mediawiki.org/wiki/API:Main_page Does it have an API?
  33. 33. Gives you a long term insight into the issue Allows you to spot things being removed and added Scheduled scrapes
  34. 34. Paul Bradshaw Leanpub.com/scrapingforjournalists* Thank you.

×