Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding data BBC 15

2,118 views

Published on

Talk at the BBC Data Day, Birmingham, January 2015

Published in: Education

Finding data BBC 15

  1. 1. @PaulBradshaw, Online Journalism Blog Birmingham City University and City University London BBC, January 2015 Data Mining Search, scraping, FOI and feeds Image by Evan Long
  2. 2. 1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping
  3. 3. 1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping
  4. 4. Don’t ask for what you want: describe what you expect to find Search operators
  5. 5. What text will it contain? Where will that text be? What text will it not contain? Imagine the data: text
  6. 6. Specific references, not general: Specify a constituency… …a school …an institution code …an invoice number …a piece of jargon
  7. 7. “”
  8. 8. -
  9. 9. *
  10. 10. ..
  11. 11. “disclosure log” “between * and 2014” “hate crime” -religion -"publication scheme" Number ranges: 2000..2014
  12. 12. ‘life expectancy Birmingham’
  13. 13. "life expectancy" 
 "perry barr"
  14. 14. inurl:
  15. 15. inurl:foi inurl:ccg inurl:intranet inurl:search.asp inurl:search.php
  16. 16. intitle: allintitle:
  17. 17. intitle:foi allintitle:disclosure log intitle:“bank fines”
  18. 18. intext: allintext:
  19. 19. intext:“miserable failure” allintext:miserable failure
  20. 20. "life expectancy" 
 "perry barr"
  21. 21. "life expectancy" 
 "perry barr" 
 filetype:xls
  22. 22. "life expectancy" 
 "perry barr" 
 filetype:xls 
 site:ons.gov.uk
  23. 23. "life expectancy" 
 "perry barr" 
 filetype:xls 
 site:ons.gov.uk 
 2009..2014
  24. 24. "life expectancy" 
 "perry barr" 
 filetype:xls 
 site:ons.gov.uk 
 2009..2014 
 -winter
  25. 25. Where is it likely to be What format? When was it not published? Imagine the data: meta data
  26. 26. site:
  27. 27. site:gov.uk site:nhs.uk site:police.uk site:ac.uk site:org.uk site:org site:birmingham.gov.uk site:met.police.uk/foi/ disclosure
  28. 28. filetype:
  29. 29. filetype:xls filetype:xlsx filetype:pdf filetype:csv filetype:ppt filetype:doc filetype:docx filetype:xml
  30. 30. search tools
  31. 31. “disclosure log” site:gov.uk allintitle:hate crime report filetype:pdf site:police.uk art inurl:search.asp -library Combine operators:
  32. 32. research.google.com
  33. 33. zanran.com
  34. 34. Do it now: Search for a disclosure log for a CCG Search for spreadsheets mentioning Andrew Mitchell MP
  35. 35. 1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping
  36. 36. Audits and transparency data Parliamentary questions Reports, research, sources FOI requests, disclosure logs Press offices Public data and databases - scraping
  37. 37. Open data initiatives & activism (TWFY) Hackdays e.g. Rewired State Public data and databases - scraping Crowdsourcing or surveys Social networks
  38. 38. NOMIS, ONS, Data.gov.uk HES, NHSIC indicator portal
 Data.Police.uk
 HEFCE, HESA, Ofsted, UCAS fullfact.org/finder Key sources
  39. 39. Do it now: Set up Change Detection for the CCG disclosure log Set up email alerts for publications on Data.gov.uk
  40. 40. 1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping
  41. 41. http://www.panopticonblog.com/2014/08/01/section-11-foia-and-the-form-of-a-request/ http://www.bailii.org/ew/cases/EWCA/Civ/2014/1086.html As per the judgement in Innes v Information Commissioner [2014] EWCA Civ 1086 I would like to request the data in spreadsheet format…
  42. 42. Do it now: Draft an FOI request for a local body’s data dictionary Use WhatDoTheyKnow (so others googling codes can find you)
  43. 43. 1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping
  44. 44. Automating the repetitive gathering of data, e.g. Multiple tables in one page
 Webpage tables
 Multiple spreadsheets
 Multiple PDFs What is scraping?
  45. 45. https://www.youtube.com/watch?v=Efr-VEkwWoM
  46. 46. http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
  47. 47. http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
  48. 48. * http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs Tip: empty search
  49. 49. Basic tables: WYSIWYG tools Google Sheets functions Programming: Scraperwiki How to scrape?
  50. 50. Paul Bradshaw Leanpub.com/scrapingforjournalists* <plug>
  51. 51. * Function (Arguments) (aka parameters)
  52. 52. * Query (XPath)
  53. 53. * Tip: search for structure around data http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
  54. 54. * http:// www.w4mpjobs.org/ SearchJobs.aspx? http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
  55. 55. *
  56. 56. * "//div[@class= 'leftcolumn']"
  57. 57. * //div[starts-with(@ class, ‘jobWrap’)]
  58. 58. * A crib sheet:
  59. 59. Paul Bradshaw Leanpub.com/scrapingforjournalists* Scraping tools
  60. 60. * Chrome extension:
  61. 61. *
  62. 62. *
  63. 63. OutWit Hub
  64. 64. Do it now: Identify a website which has multiple pages or documents containing data you could combine Where’s the structure? Table? URL? Links?
  65. 65. 1. Search: describe the data 2. Feeds: get regular updates 3. FOI: request detail, in CSV format 4. Scraping: look for structure and repetition
  66. 66. Thank you. Image by Evan Long @PaulBradshaw, Online Journalism Blog, HelpMeInvestigate Birmingham City University and City University London BBC Future Day, September 2014

×