A privacy-preserving environment for genomic data analysis is feasible; A privacy-preserving environment will help promote data sharing: not the opposite and a severe leak may reverse the public opinion trend; Determinants of human genetic individuality are an essential study for a privacy-preserving environment
IMM Computational Biology and Bioinformatics Seminars (CBBS), October 13, 2016
Towards a privacy-preserving environment for genomic data analysis
1. Towards a privacy-preserving
environment for genomic data analysis
Francisco M. Couto
LaSIGE, Faculdade de Ciências da Universidade de Lisboa
October 13, 2016
IMM Computational Biology and Bioinformatics Seminars (CBBS)
6. Does it affect others?
• DNA is transmitted from parents to children
• One leak may compromise a large group of
people (relatives)
7. How to detect the privacy sensitive
parts?
• the detection should identify small sensitive
elements
– using databases of known patterns
• Challenge:
– constructing a comprehensive knowledge
database
8. What is the impact on sharing data?
• sharing certain portions of data is more
attractive than sharing nothing
• privacy-sensitive portions may still be shared
in a controlled way
– e.g., using the cryptographic methods
9. Why sharing?
• The highest value of genomes is achieved only
when sharing them with others
• sharing each individual genome may have
little impact
– but sharing many of them may have a huge
impact
10. But why sharing non-privacy sensitive
sequences?
• Is there any scientifically value there?
– Researchers want the privacy sensitive ones
• But are they really non-privacy sensitive
sequences?
– Maybe, currently we do not know
– Some of them will become privacy sensitive in the
future
• According to new discoveries
– Sharing them may speed up this process
12. Privacy attacks
Erlich, Yaniv, and Arvind Narayanan. "Routes for breaching and protecting genetic
privacy." Nature Reviews Genetics 15.6 (2014): 409-421.
13. Identity Tracing attacks
• Goal:
– uniquely identify the data donor
– despite data de-identifying techniques
• absence of explicit identifiers such as the name and exact
address
• Method
– accumulate quasi-identifiers
• additional metadata, such as basic demographic details,
inclusion/exclusion criteria, pedigree structure, and health
conditions
– gradually narrow down the possible individuals
14. A possible route for identity tracing
228 - 268M individuals
23 - 8 individuals
15. 1000 Genomes project attack
• Queried the Y-STR profiles in
– YSearch and SMGF
– correct surname in 12% of cases
• with 82% of confidence
• Triangulating identities
– combined the obtained surnames with age and state
– U.S. census
• 131 out of the 1,092 participants
– will never recover their privacy
Gymrek, Melissa, et al. "Identifying personal genomes by surname inference." Science 339.6117
(2013): 321-324.
17. Main players
• Sample donors
– donate biological material
• Sample managers
– receive, manipulate, sequence, store, and provide
biological material and the results
• Researchers
– Consumers of data
• Auditors
– verify who accessed specific datasets
18. Donors
• Inform their preferences on data sharing
– free to customize
• Blanket consent
– participate in projects related to specific topics
• Opt-in or opt-out
– specific projects they sympathize with (or not)
• May delegate to Sample Managers
• Donor dies
– relatives gain the ability to explicitly customize them
19. Researchers
• register themselves in the system
• propose a project
• If approved by the Sample Manager
– Can use authorized privacy-sensitive portions
• The value of sequenced data is kept intact to
authorized researchers
22. Attacks References
Re-identification (few hundred SNPs are enough) Lin [23]
Acquire knowledge about targets from GWAS results Wang [34]
Acquire knowledge about targets from microarray results Homer [18]
Infer masked genes (e.g., the APOE gene from Dr. Watson [35]) Nyholt [28]
Genomic variations
Attacks References
Use STR profiles to identify donors of 1000 Genomes Project Gymrek [16]
Forensic identification Butler [6]
Short tandem repeats
Attacks References
Direct-to-consumer genomic testing Goldsmith [13]
Masking the APOE gene (related to Alzheimer) from Dr. Watson’s genome Wheeler [35]
Disease-related genes
Successful attacks using public data
Cogo, Vinicius V., et al. "A high-throughput method to detect privacy-sensitive human genomic
data." Proceedings of the 14th ACM Workshop on Privacy in the Electronic Society. ACM, 2015.
23. STR n
DYS392 4
DYS396 23
… …
DYS 618 17
• Small strings repeated several times
• Individual profile:
DYS392 = [TAT]n
cgac TAT TAT TAT TAT cgca
n=4
Short tandem repeats (STR)
26. STRs Genes Variations Total
Databases TRDB GeneCards
1000 Genomes
Project
-
Number of
entries
240k 20k 38M 38.3M
DB sequences 22M 8.7M 1147M 1178M
DB size 660MB 87MB 34.4GB 35.1GB
Note: Any other database can be used with our solution
Retrieving the sequences
27. • Bloom filter
• Efficient data structure (space and performance)
• Test if an element is member of a set
• Does a specific value belong to the set?
• No means No (no false negatives)
• Yes means Maybe (configurable false positives)
• False positive affects efficiency only (not efficacy)
Efficient query system
40. • Novel STRs:
o In 11 years (2003-2014) TRDB registered 1k novel STRs (0.42% growth)
o Useless for attackers until present in STRs databases
• Novel genes:
o Do not determine alone the contraction of a disease
o May have no relation with any disease
o Novel discoveries correlate diseases with known genes (limited
number)
• Novel genomic variations:
o No variation determines alone the identity or contraction of a disease
o covered by increasing population samples in allele frequency studies
Completeness of the method
42. Different levels of privacy
• Rare STRs and genomic modifications
– higher the likelihood of re-identification
• How to build a discrete filtering of sensitive
reads
– With multiple severity levels
43. Human genetic individuality
• How to define an individuality measure as a
function that,
– given a genome and a population,
– returns a numerical value reflecting
– their diversity in terms of human genetics
• Found an identical genome in the population
– individuality = 0
• No privacy-sensitive sequence of the population
found in the given genome
– individuality = 1
44. What are the determinants of human
genetic individuality?
• The complexity of human behaviour is enormous
• So the determinants of human genetic
individuality
– may be hard to predict from single genomic properties
• Follow a systems biology approach to reach for a
deeper understanding of the complexity of the
human genome by analysing what defines us as
individuals
45. PhD scholarship available
• PhD scholarship for the project:
– "What are the determinants of human genetic
individuality?"
• Under the supervision of
– Prof. Francisco Couto (LASIGE)
– and Prof. Margarida Gama-Carvalho (BioISI)
• Applications until October 21st (12PM, CET)
– How to apply:
http://biosys.campus.ciencias.ulisboa.pt/node/6
46. Sharing data?
• “Adherence to data-sharing policies is as
inconsistent as the policies themselves”
“351 papers covered by some data-sharing policy,
only 143 fully adhered to that policy” (~40%)
Corbyn, Zoë. "Researchers Failing to Make Raw Data Public." 2012-03-
30]. http://www. nature. com/news/2011/110914/full/news2011. 536.
html (2011).
• “More often than scientists would like to admit,
they cannot even recover the data associated
with their own published works”
Goodman, Alyssa, et al. "Ten simple rules for the care and feeding of
scientific data." PLoS Comput Biol 10.4 (2014): e1003542.
47. Reproducibility
• One of the main principles of the scientific
method
• But without access to data
– it is impossible (or very hard) to replicate results
48. Incentivize
rather than Enforce
• “to encourage data sharing, systematic
reward and recognition mechanisms are
necessary”.
– Principles of data management and sharing at
European Research Infrastructures
Couto, Francisco M. "Rating, recognizing and rewarding metadata integration and
sharing on the semantic web." Proceedings of the 10th International Conference
on Uncertainty Reasoning for the Semantic Web-Volume 1259. CEUR-WS. org,
2014.
49. Final Remarks
• A privacy-preserving environment for genomic
data analysis is feasible
• A privacy-preserving environment will help
promote data sharing
– Not the opposite
– a severe leak may reverse the public opinion trend
• Determinants of human genetic individuality
– essential study for a privacy-preserving
environment