Anonymising quantative data

Anonymising quantitative
data
Dr Sharon Bolton
UK Data Service
UK Data Archive, University of Essex
Anonymising Research Data workshop
Dublin, 22 June 2016

The UK Data Service
• Single point of access to wide range of social science data:
ukdataservice.ac.uk
• Funded by the ESRC to serve the academic community: training
and guidance; UK Data Archive established 1967
• Used by academic researchers and students; government analysts;
charities; business; research centres; think tanks
• Survey microdata; cohort studies; international macrodata; census
data; qualitative/mixed methods data
• Support and guide data creators, including disclosure review
(anonymisation) and preparation for archiving

Protecting confidentiality: the ‘5 Safes’
Five guiding principles:
• Safe people - educate researchers to use data safely
• Safe projects - research projects for ‘public good’
• Safe settings - SecureLab system for sensitive data
• Safe outputs - SecureLab projects outputs screened
• Safe data - treat the data to protect respondent
confidentiality
• For this session, we will concentrate (mostly) on Safe
data

Data collection: planning
• Explain to respondents what archiving entails and gain
agreement for data sharing – informed consent
• Think about disclosure risks before starting – what kind
of information do you need to collect?
• Direct identifiers include: names; addresses; telephone
numbers; email addresses; photos; (perhaps) IP
addresses; do you really need them?
• Unless explicit consent obtained for sharing, direct
identifiers should always be removed from data

Anonymising data: indirect identifiers
Indirect identifiers include:
• Sensitive information: health information/medical
conditions; crime victimisation/offending; drug/alcohol
use etc.
• ‘Less sensitive’ information: age/birth date; educational
characteristics; employment details; religious affiliation;
household size; geographic area
• Look at demographics in combination (e.g.
demographics + geographies)
• Text/string variables – too detailed?

Anonymising indirect identifiers
• Aggregate categories to reduce precision
• Band ages, incomes, expenditure, etc. to disguise outliers
• Use standard coding frames – e.g. SOC2010
• Generalise meaning of detailed text
• Document the changes you make
• Talk to other researchers, archives, data services
Published guides:
• UCD Research Data Management Guide
http://libguides.ucd.ie/data/ethics
• ONS Disclosure control guidance for microdata produced from social
surveys
http://www.ons.gov.uk/methodology/methodologytopicsandstatisticalc
oncepts/disclosurecontrol/policyforsocialsurveymicrodata

Anonymising data: new developments and tools
Statistical Disclosure Control (SDC) software is available:
• mu-Argus
• standalone software package recommended by Eurostat for
government statisticians
• software and manual: http://neon.vb.cbs.nl/casc/mu.htm
• R tool - SDCMicro (GUI)
• Software, manual:
http://www.inside-r.org/packages/cran/sdcMicro/docs/sdcMicro
• new documentation being developed by UK Data Service, working with
R developers

Quiz 1: disclosive text in job title
Job title Frequency Valid Percent
nurse 73 73.0
carer for elderly man 1 1.0
hospital ward cleaner 1 1.0
social science researcher 1 1.0
head of dental practice 2 2.0
cleaner in electronics factory 1 1.0
Financial Director, Sunnyview Care Home,
Colchester
1 1.0
general manager 1 1.0
GP 1 1.0
Manager, Cotterill Village Stores 1 1.0
works in electronics factory 1 1.0
on benefits, not working 1 1.0
police officer 2 2.0
consultant, geriatric psychiatry 1 1.0
Reetired 1 1.0
retired 1 1.0
Retired 1 1.0
retirement 1 1.0
geography teacher 2 2.0
Teacher, music 2 2.0
Seondary school teeacher 1 1.0
unemployed 1 1.0
web designer 2 2.0
Total 100 100.0

Quiz 1: jobs coded with SOC2010
Job title: SOC2010 Frequency Valid Percent
1131: Director, financial 1 1.0
1171: Manager, general 1 1.0
1190: Manager, retail 1 1.0
2231: Nurse 73 73.0
2426: Researcher 1 1.0
2215: Dentist 2 2.0
2211: Doctor, medical 2 2.0
3312: Officer, police 2 2.0
2314 Teacher, secondary 3 3.0
2137: Designer, web 2 2.0
6145: Carer 1 1.0
9139: Worker, factory 1 1.0
9233: Cleaner 2 2.0
Retired 4 4.0
Unemployed 2 2.0
Total 100 100.0

Quiz 2: detailed religion categories
Religious affiliation
Frequency Valid Percent
1 Protestant 41 41.4
2 Anglican 4 4.0
3 Catholic 26 26.3
4 Muslim 8 8.1
5 Sikh 5 5.1
6 Jehovah's Witness 6 6.1
7 Methodist 1 1.0
8 Mormon 1 1.0
9 Baptist 1 1.0
10 Buddhist 3 3.0
11 None 1 1.0
12 No religion 1 1.0
13 Moravian 1 1.0
Total 99 100.0

Quiz 2: religion categories aggregated
Religious affiliation
1 Protestant 49 49.0
3 Catholic 26 26.0
4 Muslim 8 8.0
5 Sikh 5 5.0
6 Other religion 10 10.0
7 No religion 2 2.0
Total 100 100.0

Quiz 3: age
in years
Age in years
16 3 3.0
17 3 3.0
18 9 9.0
19 9 9.0
20 16 16.0
21 4 4.0
22 2 2.0
23 2 2.0
24 2 2.0
25 2 2.0
26 2 2.0
27 2 2.0
28 2 2.0
29 2 2.0
30 2 2.0
31 1 1.0
32 1 1.0
40 11 11.0
41 1 1.0
42 1 1.0
43 3 3.0
49 1 1.0
50 13 13.0
51 1 1.0
60 1 1.0
61 1 1.0
62 1 1.0
63 1 1.0
64 1 1.0
Total 100 100.0

Quiz 3: banded age
Age (banded)
1 16-20 40 40.0
2 21-30 22 22.0
4 41-50 13 13.0
5 51-60 19 19.0
6 60-64 6 6.0
Total 100 100.0

Access control
• Don’t over anonymise - find balance between protecting
respondents’ confidentiality and maintaining research
usability of data
• Can’t fully anonymise data without removing all the
useful detail? Go back to the 5 Safes – think about
access control: Safe people, Safe settings, Safe outputs

Access control
• At UK Data Service, data available under 3 access levels:
• OPEN – open public access
• SAFEGUARDED – downloadable, but use is traceable
• Registered users only (agree not to try to identify any
individual respondents)
• Special agreements/licence: permission-only access;
approved projects – usage agreed in advance
• CONTROLLED – accredited users take a further training course
• Access via on-site safe setting or virtual secure environment
(SecureLab)
• Outputs disclosure-checked before publication

Anonymising quantitative data: summary
• Informed consent
• Think about level of detail needed before data collection
• Remove direct identifiers
• Check and treat indirect identifiers to reduce disclosure
risk
• Document your changes
• Balance anonymisation with access control to preserve
data usability

Questions?
Guidance on anonymisation:
• UCD: http://libguides.ucd.ie/data/ethics
• UKDS: www.data-archive.ac.uk/create-manage/consent-
ethics/anonymisation
• Managing and Sharing Research Data book
https://uk.sagepub.com/en-gb/eur/managing-and-sharing-research-
data/book240297

Anonymising quantative data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Anonymising quantative data

Similar to Anonymising quantative data (20)

More from ISSDA

More from ISSDA (6)

Recently uploaded

Recently uploaded (20)

Anonymising quantative data