@PaulBradshaw, Online Journalism Blog
Birmingham City University and City University London
BBC, January 2015
Data Mining
Search, scraping, FOI and feeds
Image by Evan Long
1. Search tips and tools
2. Sources and feeds
3. Data requests
4. Scraping
1. Search tips and tools
2. Sources and feeds
3. Data requests
4. Scraping
Don’t ask for what you want:
describe what you expect to
find
Search operators
What text will it contain?
Where will that text be?
What text will it not contain?
Imagine the data: text
Specific references, not
general:
Specify a constituency…
…a school
…an institution code
…an invoice number
…a piece of jargon
“”
-
*
..
“disclosure log”
“between * and 2014”
“hate crime” -religion
-"publication scheme"
Number ranges: 2000..2014
‘life expectancy Birmingham’
"life expectancy" 

"perry barr"
inurl:
inurl:foi
inurl:ccg
inurl:intranet
inurl:search.asp
inurl:search.php
intitle:
allintitle:
intitle:foi
allintitle:disclosure log
intitle:“bank fines”
intext:
allintext:
intext:“miserable failure”
allintext:miserable failure
"life expectancy" 

"perry barr"
"life expectancy" 

"perry barr" 

filetype:xls
"life expectancy" 

"perry barr" 

filetype:xls 

site:ons.gov.uk
"life expectancy" 

"perry barr" 

filetype:xls 

site:ons.gov.uk 

2009..2014
"life expectancy" 

"perry barr" 

filetype:xls 

site:ons.gov.uk 

2009..2014 

-winter
Where is it likely to be
What format?
When was it not published?
Imagine the data: meta data
site:
site:gov.uk
site:nhs.uk
site:police.uk
site:ac.uk
site:org.uk
site:org
site:birmingham.gov.uk
site:met.police.uk/foi/
disclosure
filetype:
filetype:xls
filetype:xlsx
filetype:pdf
filetype:csv
filetype:ppt
filetype:doc
filetype:docx
filetype:xml
search tools
“disclosure log” site:gov.uk
allintitle:hate crime report
filetype:pdf site:police.uk
art inurl:search.asp -library
Combine operators:
research.google.com
zanran.com
Do it now:
Search for a disclosure log
for a CCG
Search for spreadsheets
mentioning Andrew Mitchell
MP
1. Search tips and tools
2. Sources and feeds
3. Data requests
4. Scraping
Audits and transparency data
Parliamentary questions
Reports, research, sources
FOI requests, disclosure logs
Press offices
Public data and databases -
scraping
Open data initiatives &
activism (TWFY)
Hackdays e.g. Rewired State
Public data and databases -
scraping
Crowdsourcing or surveys
Social networks
NOMIS, ONS, Data.gov.uk
HES, NHSIC indicator portal

Data.Police.uk

HEFCE, HESA, Ofsted, UCAS
fullfact.org/finder
Key sources
Do it now:
Set up Change Detection for
the CCG disclosure log
Set up email alerts for
publications on Data.gov.uk
1. Search tips and tools
2. Sources and feeds
3. Data requests
4. Scraping
http://www.panopticonblog.com/2014/08/01/section-11-foia-and-the-form-of-a-request/
http://www.bailii.org/ew/cases/EWCA/Civ/2014/1086.html
As per the judgement in Innes
v Information Commissioner
[2014] EWCA Civ 1086 I would
like to request the data in
spreadsheet format…
Do it now:
Draft an FOI request for a
local body’s data dictionary
Use WhatDoTheyKnow (so
others googling codes can
find you)
1. Search tips and tools
2. Sources and feeds
3. Data requests
4. Scraping
Automating the repetitive
gathering of data, e.g.
Multiple tables in one page

Webpage tables

Multiple spreadsheets

Multiple PDFs
What is scraping?
https://www.youtube.com/watch?v=Efr-VEkwWoM
http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
*
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
Tip: empty search
Basic tables: WYSIWYG tools
Google Sheets functions
Programming: Scraperwiki
How to scrape?
Paul Bradshaw
Leanpub.com/scrapingforjournalists*
<plug>
*
Function (Arguments)
(aka parameters)
*
Query (XPath)
*
Tip: search for structure
around data
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
*
http://
www.w4mpjobs.org/
SearchJobs.aspx?
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
*
*
"//div[@class=
'leftcolumn']"
*
//div[starts-with(@
class, ‘jobWrap’)]
*
A crib sheet:
Paul Bradshaw
Leanpub.com/scrapingforjournalists*
Scraping
tools
*
Chrome extension:
*
*
OutWit Hub
Do it now:
Identify a website which
has multiple pages or
documents containing data
you could combine
Where’s the structure?
Table? URL? Links?
1. Search: describe the data
2. Feeds: get regular
updates
3. FOI: request detail, in CSV
format
4. Scraping: look for
structure and repetition
Thank you.
Image by Evan Long
@PaulBradshaw, Online Journalism Blog, HelpMeInvestigate
Birmingham City University and City University London
BBC Future Day, September 2014

Finding data BBC 15