2016-10-17		|		UC	Berkeley	 Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	 59	
Data De-iden(fica(on,
Anonymiza(on, & Storage
Data De-iden(fica(on
& Anonymiza(on:
Why important?
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
60	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Why important?
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
61	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”	
hkps://med.stanford.edu/news/all-news/2011/05/dangerous-side-effect-of-
common-drug-combina)on-discovered-by-data-mining.html
Data De-iden(fica(on
& Anonymiza(on:
Big-picture
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
62
Data De-iden(fica(on
& Anonymiza(on
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
63
Data De-iden(fica(on
& Anonymiza(on:
Defini(ons
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
64
Data De-iden(fica(on
& Anonymiza(on:
Defini(ons
• “Direct	IdenNfiers	—	Main	func)on	is	to	iden)fy	people.	
• Name	
• SSN	
• Iden)fiers	must	be	suppressed	
• Quasi-IdenNfiers	—	Useful	for	analysis,	but	can	also	iden)fy.	
• Date	of	Birth	
• Physical	characteris)cs	—	height,	weight,	hair	color,	etc.	
• History,	capabili)es,	etc.”	
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
65	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Issues
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
66	
See	also:	Sweeney	L.,	Simple	Demographics	Ojen	Iden)fy	
People	Uniquely,	Carnegie	Mellon	University,	Data	Privacy	
Working	Paper	3,	Piksburgh,	2000.	hkp://
dataprivacylab.org/projects/iden)fiability/paper1.pdf
Data De-iden(fica(on
& Anonymiza(on:
Issues - example
•  Governor	Weld	fainted	in	1996	at	a	college	gradua)on	and	was	admiked	to	a	
hospital	
•  State	of	MA	made	“de-iden)fied”	hospital	records	of	state	employees	
available	for	research	on	health	care	
•  MA	Removed	name;	lej	Date	of	Birth,	Sex	&	ZIP	
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
67	
Source	(all	slide	content	from):	Simson	Garfinkel,	
NIST,	2016-06	BITSS	presenta)on	“Data	de-
iden)fica)on:	Overview	and	framing	of	current	
issues”	
Sweeney	
purchased	voter	
registra)on	
records	for	
Cambridge	
containing:	
•  Date	of	Birth	
•  Sex	
•  ZIP	
Hospital	Admission	Records	
Date	of	Birth	
Sex	
ZIP	
Name	
Diagnosis	
Treatment	
Test	Results	
“Direct”	iden)fiers	“Quasi-Iden)fiers”	
Sensi)ve	Values	
Voter	RegistraNon	Records	
See	also:	Sweeney	L.,	Simple	Demographics	Ojen	Iden)fy	People	Uniquely,	Carnegie	Mellon	University,	Data	Privacy	
Working	Paper	3,	Piksburgh,	2000.	hkp://dataprivacylab.org/projects/iden)fiability/paper1.pdf
Data De-iden(fica(on
& Anonymiza(on:
Reducing de-id. risk
•  Garfinkel/NIST	proposes	using	an	“iden)fiability	spectrum”	
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
68	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Reducing de-id. risk
• “Four	main	techniques	for	modifying	data	[quasi-idenNfiers]	to	limit	data	
disclosure:	
• Suppression					 	 	 	January	1,	1980	→	XXXXXXXX,	1980	
• GeneralizaNon 	 	 	January	1,	1980	→	1980-1985	
• Swapping	(between	people) 	January	1,	1980	→	February	29,	1984	
• Noise	AddiNon 	 	 	January	1,	1980	→	December	21,	1979”	
• What	are	some	problems/issues	associated	with	each	approach?	
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
69	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Es(ma(ng risk…
•  “Calcula)ng	re-iden)fica)on	risk:	
•  Must	be	calculated	for	every	record.	
•  Key	issues:	
•  Defini)on	of	‘matching’	
•  Defini)on	of	‘popula)on’”	
	
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
70	
Risk	of	record	
re-iden)fica)on		
																										1	
=							---------------------------	
									#	possible	matching	
									records	in	popula)on	
The risk of re-identification by this means is termed
tor re-identification risk”.21
Variables representing a
background information that is already known to
uder are called quasi-identifiers. Examples of these
ntifiers are age, sex, postal code, ethnicity, race,
n, and main language spoken. An intruder who is a
ur of the specific patient would know such details
his or her personal association with the patient.
vely, the background information of a famous person
epresented in the database would be available to the
through the public domain.
intruder might also have background information
any patients and might attempt to re-identify any one
, rather than targeting one specific person. In this
, the re-identified patient is assumed to have been
y selected. The risk of re-identification by this means is
ournalist re-identification risk”.21
In this case, the
needs an external database, known as an identification
22
against which to compare the prescription database.
t, the identification database contains background
ion about many patients. Such a database can be
ted from public registries.22
For patients who are youth
y 18 years of age or younger), there are few publicly
and easily accessible government databases (federal,
al, or municipal) containing pertinent quasi-identifiers,
ey do not own property, borrow money, have
es in their own names, or vote.22
However, the
assessment were selected from this list of variables.
Figure 1. Overview of the risk assessment methodology used
in this study. If no additional controls can be imposed and no
Figure:	Evalua)ng	the	Risk	of	Re-iden)fica)on	of	Pa)ents	from	Hospital	
Prescrip)on	Records,	El	Emam	et	al,	Can	J	Hosp	Pharm	2009;62(4):307–319		
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Reducing de-id. risk
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
71	
Source:	Simson	Garfinkel,	NIST,	2016-06	
BITSS	presenta)on	“Data	de-iden)fica)on:	
Overview	and	framing	of	current	issues”
Data De-iden(fica(on
& Anonymiza(on:
Much much more…
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
72
Data Repositories
(& Journal Policies)
2016-10-17		|		UC	Berkeley	
Alasdair	Cohen		|		Lecture	for	Publich	Health	250B	
73	
hkps://dataverse.harvard.edu

Transparency5