A presentation from Martin Ralphs (Office for National Statistics, head of the Good Practice team across the Government Statistical Service) as part of the Young Policy Professionals event, ‘Public policy in the “big data” age’, held on 9 March 2016 at the National Audit Office, London.
2. Government Data Science Partnership
Raise awareness of data science potential
Embed new approaches and new skills and improve existing capability
Engage with departments to understand opportunities and issues
Build and support a cross-government data science community to share
expertise
Break down technical barriers and understand ethical issues
Learning by doing
Government Innovation Group
Government Digital Service
3. What is data science?
3
Data science
Volume
Velocity
Variety
New approach
New technology
New Approach
A ‘data first’ mindset; exploring
the data to find insights &
potential improvements using
new & innovative techniques
New technology
New, low priced storage in the
cloud, with unrestricted
technology capable of running
software which can gain speedy
insights
4. How can data science improve government
policy and operations?
4
Data visualisation
New data sets and
collection methods
Machine learning
Social media Webscraping
Prediction
Clustering
Unstructured data
Real time data
Interactive web apps
Real-time feeds
Personalisation
5. Data sources for official statistics
Surveys – e.g. of businesses and households
Census – every 10 years
Administrative data – by-product of government processes
Big Data?
“Data that is difficult to collect, store or process within the
conventional systems of statistical organizations. Either, their
volume, velocity, structure or variety requires the adoption of
new statistical software processing techniques and/or IT
infrastructure to enable cost-effective insights to be made.”
(UNECE, 2013)
6. Big data sources
Social media: posts, pictures and
videos
Purchase transaction records
Mobile phone GPS and cell
tower signals
High volume administrative
& transactional records
Sensors gathering information: e.g.
climate, traffic, internet of things etc.
Digital satellite images
7. 7
New data sets and
collection methods
Social media Webscraping
Real time data
Web scraping
supermarket prices
● Prices collection currently manual
● Web scraping offers more detailed, more frequent data
at lower cost
● Web scraped data provides an opportunity to gain
experience in processing high volume price data
8. 8
Prototype web scrapers
● 3 supermarkets
● 35 CPI/RPI item categories
● Written using Python (scrapy)
● Daily collection (around 6500 price quotes)
● Item counts monitored daily
9. 9
Classification challenge
“This is a dessert apple”
“This is fruit juice
(not orange)”
“This is fruit juice
(not orange)” and not
a dessert apple!
Tesco Mango
Juice Drink 1ltr
Tesco Pure
Apple Juice 2 Litre
Training Set
Supervised machine learning
12. “The real finding of the initial research was not that inflation is too
high, but the method of collecting prices matters rather a lot”
Paul Johnson, IFS
13. Smart meters
Rationale: Smart-type electricity meter data to
model occupancy or household composition with
energy use profiles
Support more efficient field operations (in 2011, £6.6m spent
trying to enumerate vacant properties)
Data from smart meter trials in Great Britain and Republic of
Ireland
A range of potential methods identified
Significant issues around privacy and ethics
15. Twitter
Rationale: Using geo-located
Tweets to explore mobility and
migration
7 months of geo-located tweets within
Great Britain (about 100 million data
points)
Can infer place of usual residence
Significant issues around privacy and
ethics
Geolocated Tweet penetration
rate by local authority
21. Government
Data
Programme
Policies and
Governance
Modern Data
Infrastructure
Data Science
Open Data
Data Leaders
Network
Data Steering
Group
Inter-Ministerial Group for
Digital Transformation
National
Information
Infrastructure
Common
Technology
Services
Platforms and
Standards
Registers
Digital
Services
Departmental
Transformation
Government
as a
Platform
Editor's Notes
Hundreds of technology prices and specifications need to be collected, currently taking 3 members of staff one week per collection.
New point and click web-scraping technology (import.io) enables us to carry out each collection in less than 5 minutes, providing more data than could ever be collected manually
FT used the raw data to produce their own analysis
Some interesting and useful conclusions reached, that provide validation that the research being undertaken is significant and important
Key issues include: are scrapers really comparing like with like? Supermarkets regularly reclassify items on websites. Small differences in product month on month can cause difficulty. Just supermarkets, not market stalls etc.
Some products higher than published CPI (tea bags, biscuits) some lower (wine, bananas).
Significant correlations between spikes in the number of norovirus lab reports and spikes in conversations on Twitter using words and hashtags like winterbug, norovirus, sickness bug, winter virus, vomiting, barf, flu that strongly correlated to future lab cases