Burton - Security, Privacy and Trust

Security, privacy and trust: why and
how might we control access to
research data?
Paul Burton, Rebecca Wilson
University of Bristol, D2K Research Program
NISO Symposium, Denver
11th September, 2016

• Perhaps the most important message is in the
title:
• This is a complex challenge involving science,
technology, governance and other fundamental
social issues
• No single solution will be adequate
• True transdisciplinary programs of work are essential
• Even the most complex and sophisticated of
solutions will never offer fully effective exploitation
of available data with zero risk of mistakes in
managing data or of malign interference
Security, privacy and trust

From Preface:
“Our view has always been that
anonymisation is a heavily context-
dependent process and only by
considering the data and its
environment as a total system
(which we call the data situation),
can one come to a well informed
decision about whether and what
anonymisation is needed.”

Controlling access to
research data (security):
why and when?

• Who might share data?
• Distinct generator and user
• Share data across a consortium
• How is ‘sharing’ achieved?
• Physically transfer data to a user
• Provide access to analyse
• Analysis on-site
• Remote analysis
• Federated analysis
• All interpretations valid and important
What does “sharing” research microdata mean?

• Management of intellectual property invested
and held in data – most areas of research
• Legal, ethical and other governance stipulations
to protect the welfare of research participants –
particularly in health/social/biomedical research
• Disclosure of identity
• Disclosure of associated information
• Balance between these and the societal benefits of
streamlined comprehensive data access – which is
evolving rapidly with time and social context
Why control research data at all?

The Research
Data Pipeline
When are
data “at risk”?

The Research
Data Pipeline
When are
data “at risk”?
Dissemination
and evidence
based action

• Risks associated with: storage; transmission; use
• Accidental v deliberate violations
• Direct v inferential disclosure
• Risks and remedies lie in the nature of the data
themselves and the contextual environment in
which the data are to be used – and potentially
misused
• Mark Elliot, Elaine Mackey, Keith Spicer, Caroline Tudor,
the Anonymisation Decision-Making Framework, 2016
Issues to consider

• Consider the user(s)
• Are he/she/they bona fide researchers?
• Consider the application – for example:
• Does it violate (or potentially violate) any of the
ethical permissions granted to the study or any
of the consent forms signed by the
participants or their guardians?
• Is there a risk it may produce information that
may allow individual cohort members to be
identified?
How to implement control in practice? UK
MetaDAC and ALSPAC as illustrative examples

• Administrative and research data held
separately
• Hard copies of data held in locked storage
• Electronic data held on password protected
systems with access restricted to those who
really need it
• All electronic data held in encrypted form
• Extensive QC (security of quality)
• Multiple back-ups (security of existence)
Managing the data and the data environment

• All data pseudonomised before release
• If pseudonomisation scientifically impossible,
data can be analysed ‘on site’
• All data transfers encrypted using standard
protocols
• All linked data released on study-specific IDs
• Explicit acknowledgment that no system can
guarantee a zero risk of disclosure or misuse of
data
Managing the data and the data environment

• A strong underpinning governance structure
is essential.
• EAGDA (Expert Advisory Group on Data
Access) report 2015 considered, amongst
many other key issues that:
EAGDA report, 2015
• Governance must be proportionate and
context appropriate
• Must be transparent, auditable and
appealable
• Need mutual trust and respect amongst
stakeholders

• Applicants for data through MetaDAC or from
ALSPAC sign up agreeing to governance
documents which include statements such as:
• Applicants are reminded that the Terms and Conditions for the
cohort explicitly forbid any attempt to identify individuals or to
compromise or otherwise infringe the confidentiality of
information on data subjects and their right to privacy.
• Do you understand that you must not pass on any data or
samples awarded, or any derived variables or genotypes
generated by this application to a third party (i.e. to anybody
that is not included in this list of applicants on this project, nor is
a direct employee of one of these applicants)?
How to implement control in practice? UK
MetaDAC and ALSPAC as illustrative examples

Role of encryption and other
technology-based forms of
privacy protection in “open
science”

• When research data are very sensitive or are seen as
having a particular intellectual property value can we
develop technology-based solutions that facilitate
access to microdata by enhancing privacy protection
so that all intellectual property and governance
constraints are met in full while lowering the
governance bar? This can promote open science by
easing and/or speeding up access requests
• Should be seen as an additional component to be
applied on top of a data access and governance
system that is already well founded
• EAGDA report emphasises sustainability
Privacy protection in “open science”

Data
Computer
DC
Analysis
Computer
AC
Single site DataSHIELD
2009: The DataSHIELD
challenge
Given that microdata are scientifically
critical and yet potentially sensitive,
can we ensure that the information
driving analysis of the data at each
centre only ever emerges from the
firewall in non-disclosive form? (i)
encryption (trivial and non-trivial);
(ii) low dimensional (ideally
sufficient) statistics
Multi-site DataSHIELD
horizontally partitioned data

• One step analyses: e.g.
ds.table2D - request
non-disclosive output
from all sources
• Multi-step analyses: e.g.
ds.lexis – set up and then
request
• Iterative analyses: e.g.
ds.glm - parallel
processes linked
together by non-
identifying summary
statistics – e.g. for glm
= score vectors and
information matrices
• Can be used as
equivalent to
full individual
level analysis or
to study level
meta-analysis
The DataSHIELD
solution

DataSHIELD
b.vector<-c(0,0,0,0)
glm(cc~1+BMI+BMI.456+SNP,
family=binomial,
start=b.vector, maxit=1)
Analysis commands (1)

Information Matrix Study 5
Score vector Study 5
Summary Statistics (1)
[36, 487.2951, 487.2951, 149]
DataSHIELD

Σ Information Matrix Study 5
[36, 487.2951, 487.2951, 149]
DataSHIELD

b.vector<-
c(-0.322, 0.0223, 0.0391, 0.535)
Analysis commands (2)
glm(cc~1+BMI+BMI.456+SNP,
family=binomial,
start=b.vector, maxit=1)
DataSHIELD

Updated parameters (4)
Σ
Coefficient Estimate Std Error
Intercept -0.3296 0.02838
BMI 0.02300 0.00621
BMI.456 0.04126 0.01140
SNP 0.5517 0.03295
Final parameter estimates
DataSHIELD

DataSHIELD analysis
Direct conventional analysis
Parameter Coefficient Standard Error
bintercept
-0.3296 0.02838
bBMI
0.02300 0.00621
bBMI.456
0.04126 0.01140
bSNP
0.5517 0.03295
Coefficients:
Estimate Std. Error
(Intercept) -0.32956 0.02838
BMI 0.02300 0.00621
BMI.456 0.04126 0.01140
SNP 0.55173 0.03295
Does it
work?

Server-side functions
Client-side functions
Individual level data never
transmitted or seen by the
statistician in charge, or by
anybody outside the
original centre in which
they are stored.
R
R
R R
Web services
Web servicesWeb services
Data server
Opal
Finrisk
Opal
Prevend
Opal
1958BC
Data server
Data server
BioSHaRE
web site
Web services
Analysis
client
DataSHIELD: current implementation for
horizontally partitioned data

IM5:
Analysis
Computer
R
R
R R
Web services
Data computer Opal
NHS
Opal
ALSPAC
Opal
Education
Data computer Data computer
Regression coefficients = XTY/ XTX
XTX: Need to calculate
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXC
XA
XB
XAXB
XA1 * XB1
+
XA2 * XB2
+
XA3 * XB3
+
……
DataSHIELD: current implementation for
vertically partitioned (linked) data

IM5:
Analysis
Computer
R
R
R R
Web services
Data computer Opal
NHS
Opal
ALSPAC
Opal
Education
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXC
MA
MB
MC
XA
XB
XAXB
XA1 * XB1
+
XA2 * XB2
+
XA3 * XB3
+
……
DataSHIELD:current
implementationfor vertically
partitioned(linked) data
plain.text.vector.A plain.text.vector.N
0 1 1 1 0 0 1 1 1 0 1 0 0 1
encryption.matrix
[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736
[2,] -1.355529 -9.369041 2.687347
[3,] 4.603762 -3.622044 -2.817478
occluded.matrix.A
[,1] [,2] [,3]
[1,] -1.4546711 0 4.0722205
[2,] 6.4809785 1 -4.5814726
[3,] 4.4954801 1 -8.7036260
[4,] 0.1995684 1 -8.6872205
[5,] -6.4060220 0 -6.6471777
[6,] -0.5164345 0 -0.2564673
[7,] -5.8981933 1 -8.5032852

IM5:
Analysis
Computer
R
R
R R
Web services
Opal
NHS
Opal
ALSPAC
Opal
Education
XAXA XAXB XAXC
XAXB XBXB XBXC
XAXC XBXC XCXC
MA XT
A
MAXT
AXBMB
MA
XA
XB
MB
(MA)-1 MAXT
AXBMB (MB)-1 = XAXB
DataSHIELD:current
implementationfor vertically
partitioned(linked) data
plain.text.vector.A plain.text.vector.N
0 1 1 1 0 0 1 1 1 0 1 0 0 1
encryption.matrix
[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736
[2,] -1.355529 -9.369041 2.687347
[3,] 4.603762 -3.622044 -2.817478
occluded.matrix.A
[,1] [,2] [,3]
[1,] -1.4546711 0 4.0722205
[2,] 6.4809785 1 -4.5814726
[3,] 4.4954801 1 -8.7036260
[4,] 0.1995684 1 -8.6872205
[5,] -6.4060220 0 -6.6471777
[6,] -0.5164345 0 -0.2564673
[7,] -5.8981933 1 -8.5032852

The core DataSHIELD
Development Team

Becca Wilson
If people want to know technical details about DataSHIELD or methods for secure data
sharing/analysis: see 12th September 11:30 - 13:00 Secure Multiparty Computation for Statistical
Analysis of Private Data
Demetris Avraam
Poster #12 RDA poster session Wednesday, Thursday, Friday – DataSHIELD: a method for privacy
protected analysis of individual level data
RDA Working Group for Data Security and Trust
RDA 8th Plenary in the same venue as the NISO symposium. Session on 17th September 11:00 -
12:30. We are running a survey to gather information on current data security practices in our
community. The survey is available at www.bit.ly/dash-ing (see below)
Data to Knowledge (D2K) Research Group. Contact details:
@Data2Knowledge there is also a "contact us" page on www.datashield.ac.uk
Setting up a professional community for stakeholders in health data sharing
DASH-ING: DAta Sharing for Health - INnovation Group
The website will initially (and temporarily) be at: www.bit.ly/dash-ing This webpage contains
questions for RDA survey and link to join the professional community
Additional opportunities for interaction

> plain.text.vector.L
[1] 0 1 1 1 0 0 1
> plain.text.vector.N
[1] 1 1 0 1 0 0 1
> sum(plain.text.vector.L*plain.text.matrix.N)
[1] 3
>t(matrix(plain.text.vector.L))%*%matrix(plain.text.vector.N)
[,1]
[1,] 3
How does matrix-based encryption
work?

> occluded.matrix.L > plain.text.vector.L
[,1] [,2] [,3] [1] 0 1 1 1 0 0 1
[1,] -1.4546711 0 4.0722205
[2,] 6.4809785 1 -4.5814726
[3,] 4.4954801 1 -8.7036260
[4,] 0.1995684 1 -8.6872205
[5,] -6.4060220 0 -6.6471777
[6,] -0.5164345 0 -0.2564673
[7,] -5.8981933 1 -8.5032852
> e.mat.L
[,1] [,2] [,3]
[1,] -1.444769 2.495677 -5.322736
[2,] -1.355529 -9.369041 2.687347
[3,] 4.603762 -3.622044 -2.817478
> e.mat.L%*%occluded.matrix.L
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949
[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142
[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104
work?

> e.mat.L%*%occluded.matrix.L
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949
[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142
[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104
> plain.text.vector.N
[1] 1 1 0 1 0 0 1
> e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N
[,1]
[1,] 102.66952
[2,] -74.76116
[3,] 35.90735
> inv.e.mat.L%*%e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N
[,1]
[1,] -0.6723174
[2,] 3.0000000
[3,] -17.6997578
> sum(plain.text.vector.L*plain.text.matrix.N)
[1] 3
work?

> plain.text.vector.L
[1] 0 1 1 1 0 0 1
> e.mat.1
[,1]
[1,] 7.13763
> e.mat.1%*%t(matrix(plain.text.vector.L))
[,1] [,2] [,3] [,4] [,5] [,6][,7]
[1,] 0 7.13763 7.13763 7.13763 0 0 7.13763
>e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.text.matrix.N
[,1]
[1,] 21.41289
>(1/e.mat.1)*e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.tex
t.matrix.N
[,1]
[1,] 3
Why do we need to occlude the
original plain text vector?

• Is there a significant risk of upsetting or
alienating cohort members or of reducing their
willingness to remain as active participants?
• Does it address topics that fall within the
acknowledged scientific remit of the cohort?
• Is access requested to an infinite resource (data
or cell line DNA) or a depletable resource. If
non-depletable NO assessment of the science
underpinning the application
UK MetaDAC and ALSPAC as illustrative examples

• Wish to control access
• Undesirable loss of intellectual property
• Violation of legal, ethical and other governance
stipulations – particularly disclosure of identity
and/or associated information
• Wish to ensure data used widely for scientific
research and that access procedures are
streamlined
• Data and contextual “data environment” both
crucial and control systems are typically
complex and multi-faceted
How should access be controlled?

Burton - Security, Privacy and Trust

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Burton - Security, Privacy and Trust

Similar to Burton - Security, Privacy and Trust (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

Burton - Security, Privacy and Trust