SlideShare a Scribd company logo
Technische Universität München
Fabian Prasser
Lehrstuhl für Medizinische Informatik
Institut für Medizinische Statistik und Epidemiologie
Klinikum rechts der Isar der TU München
An overview of methods for data anonymization
Technische Universität München
Three types of de-identification / anonymization
• Masking identifiers in unstructured data
• Subject: clinical notes, …
• Methods: machine learning, regular expressions, …
• Implementations: MIST, MITdeid, NLM Scrubber
• Privacy preserving data analysis (interactive scenario)
• Subject: query results, …
• Methods: interactive differential privacy, query-set-size control, …
• Implementations: AirCloak, Airavat, Fuzz, PINQ, HIDE
• Transforming structured data (non-interactive scenario)
• Subject: tabular data, …
• Methods: generalization, suppression, randomization, ...
• Implementations: AnonTool, ARX, sdcMicro, μArgus, PARAT
219.03.2015
An overview of methods for data anonymization
Technische Universität München
Multiple aspects have to be balanced
• Main goal: Achieve a balance between data utility and privacy
• Complex task
• Many different types of methods need to be applied in an
integrated manner
• Methods may need to be parameterized
• Different aspects are interrelated
• Just the most important aspects and relationships
319.03.2015
Use cases
Types of data
Privacy models
Transformation models
Utility measures
Algorithms
Risk models
An overview of methods for data anonymization
Technische Universität München
Important aspects of use cases
• Who or what will process the data in which way?
• Humans, e.g., epidemiologists
• Different types of analyses
• Interactive vs. non-interactive
• Machines, i.e., data mining
• Classification vs. clustering
• How will the data be released?
• Access control
• Open access vs. restricted access
• Continous data publishing
• Multiple views vs. re-release (incremental vs. new attributes)
• Is the data distributed?
• Collaborative environments
• Vertical vs. horizontal vs. hybrid distribution
419.03.2015
[FWF11]
[FWF11]
[FWF11]
An overview of methods for data anonymization
Technische Universität München
Important properties of data
• Relational data
• Tabular data
• One row per individual
• Transactional data
• Data consisting of set-valued attributes
• Example: Follow-up collection of diagnosis codes
• Data with relational and transactional characteristics
• Dimensionality of data
• Mitigating re-identification is practically infeasible for
high-dimensional data
• Data with clusters
• Example: household structures
• Other types of data: Trajectory data, social network data
519.03.2015
[Agg05]
An overview of methods for data anonymization
Technische Universität München
Privacy models: some background
• Definition of (perfect) privacy
• “Anything that can be learned about a respondent from a statistical
database should be learnable without access to the database”
• Syntactic models
• Syntactic conditions on the released datasets
• No (direct) semantic implications regarding the above definition
• Instead: Assumptions about attack vectors and definition of (likely)
background knowledge and goals by classifying attributes
• Direct and indirect identifiers (or quasi-identifiers, or keys)
• Sensitive and insensitive attributes
• Semantic models
• Privacy models that relax a formalization of the above definition
• Much fewer assumptions need to be made about attackers
619.03.2015
formulated by [Dwork08]
[Swe02]
[Dal77]
An overview of methods for data anonymization
Technische Universität München
Risk and threat models
• Disclosure models
• Identity disclosure (re-identification, tuple linkage)
• Attribute disclosure (sensitive information disclosure)
• Membership disclosure (table linkage)
• Models for quantifying re-identification risks
• Super-population models: Population is modeled with probability
distributions parameterized with sample characteristics
• Decision rule by Dankar et al.: Combination of three models,
which has been evaluated for biomedical datasets
• Attacker models: May be used to derive/compile global risks
• Prosecutor scenario: Targets one specific individual
• Marketer scenario: Targets as many individuals as possible
• Journalist scenario: Targets any individual
719.03.2015
[Emam13]
[DEN+12]
[LLZ+12]
An overview of methods for data anonymization
Technische Universität München
Syntactic models against re-identification
• Goal: Prevent linkage attacks on quasi-identifiers
• Some models for relational data
• k-Anonymity: Requires groups (cells or equivalence classes) of
size ≥ k, which defines an upper bound on the re-identification risk
(over-) estimated with sample frequencies
• LKC-Privacy: Relaxed variant of k-anonymity + (ℓ-diversity)
• Risk-based approaches: Enforce thresholds on re-identification
risks, which may be quantified with super-population models
• HIPAA Safe Harbor: Heuristic with many predefined identifiers and
a few quasi-identifiers (regions and all kinds of dates). Contains
wildcards („any other unique identifying number, characteristic, or
code“). Provides sound legal protection for custodians in the US
• Some models for transactional data
• (km
)-Anonymity: k-Anonymity regarding ≤ m values from a set
819.03.2015
[Swe02]
[MFH+09]
[HIP]
[TMK08]
An overview of methods for data anonymization
Technische Universität München
Syntactic models against attribute disclosure
• Observation: Preventing linkage attacks is not enough
• Goal: Prevent knowledge gain from sensitive information associated
with an equivalence class
• Some models for relational data
• ℓ-Diversity: Sensitive values must be „well-represented”. Multiple
variants exist with different privacy/utility trade-offs
• t-Closeness: Distribution of sensitive values must not be „too
different“ from the overall dataset. Multiple variants exist
• p-Sensitive k-anonymity: Focus on identity & attribute disclosure
• LKC Privacy: ℓ-Diversity & relaxed k-anonymity
• Some models for transactional data
• (h, k, p)-coherence: km
-anonymity + protection against inference
• ρ-Uncertainty: protection against inference with fewer
assumptions
919.03.2015
[MFH+09]
[MGK+06]
[LLV07]
[TV06]
[CKR+10]
[XWF+08]
An overview of methods for data anonymization
Technische Universität München
Further syntactic models
• Models against membership disclosure for relational data
• Goal: Bounds on the certainty with which the presence of data
about an individual in a database can be inferred via linkage
• Upside: With strict thresholds, they provide semantic privacy
• Downside: Basically impossible to achieve
• δ-Presence: Relates sample counts to population counts
• c-Confident δ-presence: Relaxation of δ-presence in which
population characteristics are estimated
• Models for data which is relational and transactional
• (k, km
)-Anonymity: Mixture of k-anonymity and km
-anonymity
• Models for continuous publishing of relational data
• Approach by Byun et al.: Only supports insertions
• m-Invariance: Supports insertions, deletions, updates
1019.03.2015
[NAC07]
[NC10]
[TMK08]
[XT07]
An overview of methods for data anonymization
Technische Universität München
A semantic model: Differential Privacy
• Observation
• The formal notion of privacy is impossible to achieve
• Even for individuals that are not part of the statistical database
• Idea
• Do not compare an attacker's information about an individual
before and after accessing a statistical database, but
• Compare the risks for an individual when joining (or leaving) a
statistical database
• (Slightly) more formal
• -Differential Privacy:ϵ A (randomized) function fulfills -DP if theϵ
probability of every possible output value changes by a factor of at
most exp( ) when data about an individual is or is not contained inϵ
a database.
• Relaxations: ( ,ϵ δ)-DP, approximate DP
1119.03.2015
[Dwork 06, Dwork08]
[Dwork06]
[Dwork 06, Dwork08]
[PK08, LM12]
An overview of methods for data anonymization
Technische Universität München
A semantic model: Differential Privacy (cont'd)
• DP in interactive scenarios
• Sequential composition rule
• Privacy budget
• DP in non-interactive scenarios
• Release of contingency tables or marginals
• Relationships to syntactical models exist, e.g.,
• (k, β)-SDGS: Random sampling + k-anonymity fulfills ( , δ)-DPϵ
• t-Closeness (with a specific distance function): Implies
-DP regarding the sensitive attributesϵ
• Has been criticized in the context of biomedical research
• DP is often not a truthful mechanism: Functions are
randomized, often data is pertubated, e.g., by adding noise
• DP is not intuitive: What is a good value for ? What does itϵ
mean?
1219.03.2015
[FDE13]
[DS15]
[LQS11]
[BCD+07][BCD+07]
[FDE13]
An overview of methods for data anonymization
Technische Universität München
Measuring data utility
• Often used interchangeably with “loss of information”
• Exemplary utility measures for syntactic models
• Used for evaluating transformed datasets
• Discernibility: Based on sizes of equivalence classes
• Average equivalence class size: Analogously to discernibility
• (Non-uniform) entropy: Information theoretic measure
• Loss: Measures the coverage of the domain of attributes
• Utility constraints: Use cases are modeled as queries
• Exemplary utility measures for Differential Privacy
• Used for evaluating a method that fulfills DP
• Error: Absolute, relative, variance
• (α, δ)-Usefulness: P[distance ≤ α] ≥ δ
1319.03.2015
[BA05]
[LDD+05]
[GT09]
[LGM10]
[VI02]
[FDE13]
An overview of methods for data anonymization
Technische Universität München
Transformation methods
• Coding models
• Global recoding: Similar transformation for similar values
• Local recoding: Different transformations may be applied
• Truthful transformations
• Generalization: Based on domain generalization hierarchies
• Full-domain generalization: All values of an attribute are
generalized to the same level
• Subtree generalization: Different levels of generalization may
be applied
• Suppression: Removal of values of cells or complete tuples
• Top & bottom coding: Replacing values that exceed given
bounds
1419.03.2015
[FWF11]
[FWF11]
An overview of methods for data anonymization
Technische Universität München
Transformation methods (cont'd)
• Non-truthful transformations (Pertubation)
• Post-randomization: Randomly change categories of a
categorical variable according to predefined probabilities
• Value distortion: Multiplicative or additive noise
• Numerical rank swapping: Randomly swap values with other
values with a rank that does not differ by more than a predefined
threshold
• Microaggregation: Aggregate values in one group
• Replacing values: Distribution sample or distribution itself
• Methods on a structural level
• Random sampling: Randomly select a set of tuples
• Slicing: Partition the data horizontally and vertically and creates
links between partitions
1519.03.2015
[FWF11]
[LLZ+12]
[LQS11]
An overview of methods for data anonymization
Technische Universität München
Algorithms
• Transform data to meet privacy models
• Given transformation methods, data properties etc.
• Randomized algorithms
• Randomized functions for Differential Privacy
• Genetic search
• Search algorithms
• Optimal algorithms: Flash, Incognito, OLA
• Heuristic algorithms: Top-Down-Specialization
• Clustering algorithms: Iteratively merge groups
• Data (focus on tuples): Method by Tassa et al.
• Space (focus on taxonomies): For transactional data
• Partitioning algorithms: Iteratively split groups
• Data: Mondrian
1619.03.2015
[Dwork08a]
[EDI+09, LDR05, KPE+12]
[FWY05]
[GT10]
[Iye02]
[GLS14]
[LG13]
[LDR06]
An overview of methods for data anonymization
Technische Universität München
Thank you for your attention!
1719.03.2015
An overview of methods for data anonymization
Technische Universität München
Further Readings
• Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving
Data Publishing: Concepts and Techniques. Chapman & Hall/CRC,
ISBN: 1420091484, 2011.
• Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from
electronic health records while preserving privacy: A survey of
algorithms. J Biomed Inform, Vol. 50, p. 4-19, 2014
• Dankar FK, El Emam K. Practicing Differential Privacy in Health Care:
A Review. Trans. Data Privacy, Vol. 6:1, 2013.
• Gkoulalas-Divanis A, Loukides G. Anonymization of Electronic Medical
Records to Support Clinical Analysis. Springer. ISBN:
978-1-4614-5668-1, 2013
• El Emam K. Guide to the De-Identification of Personal Health
Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013
1819.03.2015
An overview of methods for data anonymization
Technische Universität München
References[Agg05] Aggarwal CC. On k-anonymity and the curse of dimensionality. PVLDB, p. 901–909, 2005.
[BA05] Bayardo RJ, Agrawal R. Data Privacy through Optimal k-Anonymization. ICDE. pp. 217–228, 2005.
[BCD+07] Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F et al. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. PODS, p. 273–282, 2007.
[CKR+10] Cao J, Karras P, Raïssi C, Tan K. rho-uncertainty: inference-proof transaction anonymization. PVLDB;3(1):1033–44. 2010.
[Dal77] Dalenius T. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, p. 429–444, 1977.
[DEN+12] Dankar F, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making, 12:66, 2012.
[DS15] Domingo-Ferrer J, Soria-Comas J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl.-Based Syst., 74:151–158, 2015.
[Dwork06] Dwork C. Differential Privacy. ICALP; 4052, pp. 1–12. Springer, 2006.
[Dwork08] Dwork C. An Ad Omnia Approach to Defining and Achieving Private Data Analysis. PinKDD, p. 1-13, 2008.
[Dwork08a] Dwork C. Differential privacy: A survey of results. TAMC. pp. 1–19, 2008.
[EDI+09] El Emam K, Dankar F, Issa R, Jonker E et al. A globally optimal k-anonymity method for the de-identification of health data. JAMIA, 16(5), pp. 670–682, 2009.
[Emam13] El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013.
[FDE13] Fida K, Dankar F, El Emam K. Practicing differential privacy in health care: A review. Trans. Data Privacy, 6(1):35–67, 2013.
[FWF11] Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011.
[FWY05] Fung BCM, Wang K, Yu PS. Top-Down Specialization for Information and Privacy Preservation. ICDE, 2005, pp. 205–216.
[GKK07] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with low information loss. VLDB, 2007, pp. 758–769.
[GT09] Gionis A, Tassa T. k-Anonymization with Minimal Loss of Information. IEEE Trans Knowl Data Eng, pp. 206–219, 2009.
[GT10] Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. Transactions on data privacy, vol. 3, pp. 149–175, 2010.
[GLS14] Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. JBI, Vol. 50, p. 4-19, 2014
[HIP] U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administrative simplification regulation text; 2006.
[Iye02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, 2002.
[KPE+12] Kohlmayer F, Prasser F, Eckert C, Kemper A, Kuhn KA. Flash: Efficient, Stable and Optimal K-Anonymity. PASSAT., pp. 708–717, Sep. 2012.
[LDD+05] LeFevre K, DeWitt DJ, Ramakrishnan R. Multidimensional K-Anonymity (TR-1521). Tech. rep. University of Wisconsin, 2005.
[LDR05] LeFevre K, DeWitt D, Ramakrishnan R. Incognito: Efficient full-domain k-anonymity. SIGMOD. 2005, pp. 49–60, 2005.
[LDR06] LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional K-Anonymity. ICDE, 2006.
[LG13] Loukides G, Gkoulalas-Divanis A. Utility-aware anonymization of diagnosis codes. J Biomed Health Inform. 2013;17(1):60–70.
[LGM10] Loukides G, Gkoulalas-Divanis A, Malin B, COAT: COnstraint-based anonymization of transactions. Knowl Inf Syst, 28:2, p. 251–282, 2010.
[LLV07] Li N, Li T, Venkatasubramanian S. t-Closeness: privacy beyond k-anonymity and l-diversity. ICDE; p. 106–15. 2007.
[LLZ+12] Li T, Li N, Zhang J, Molloy I. Slicing: A new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng, 24(3):561–574, 2012
[LM12] Li C, Miklau G. An adaptive mechanism for accurate query answering under differential privacy. PVLDB, 5(6):514–525, 2012.
[LQS11] Li N, Qardaji WH, Su D. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011.
[MFH+09] Mohammed N, Fung BCM, Hung PCK, and Lee CK. Anonymizing healthcare data: A case study on the blood transfusion service. SIGKDD, p. 1285–1294, 2009.
[MGK+06] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-Diversity: privacy beyond k-anonymity. ICDE; pp. 24, 2006.
[NAC07] Nergiz ME, Atzori M, Clifton C. Hiding the presence of individuals from shared databases. SIGMOD; p. 665–676, 2007.
[NC10] Nergiz ME, Clifton C. d-presence without complete world knowledge. IEEE Trans Knowl Data Eng; 22(6):868–83, 2010.
[PK08] Prasad S, Kasiviswanathan AS. A note on differential privacy: Defining resistance to arbitrary side information. CoRR abs/0803.3946, 2008.
[Swe02] Sweeney L. k-anonymity: a model for protecting privacy. IJUFKS;10:557–70, 2002.
[TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of set valued data. PVLDB;1(1):115–25, 2008.
[TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of setvalued data. PVLDB;1(1):115–25, 2008.
[TV06] Truta TM, Vinay B. Privacy protection: p-sensitive k-anonymity property. ICDE workshops; p. 94, 2006.
[VI02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, pp. 279–288, 2002.
[XT07] Xiao X, Tao Y. m-invariance: Towards privacy preserving republication of dynamic datasets. SIGMOD, 2007.
[XWF+08] Xu Y, Wang K, Fu AW-C, Yu PS. Anonymizing transaction databases for publication. KDD; p. 767–75, 2008.
1919.03.2015
An overview of methods for data anonymization

More Related Content

What's hot

Big data analytics
Big data analyticsBig data analytics
Big data analytics
Dr.Bhuvaneswari Velumani
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
4Science
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computing
prachupanchal
 
Neo4j Training Series - Spring Data Neo4j
Neo4j Training Series - Spring Data Neo4jNeo4j Training Series - Spring Data Neo4j
Neo4j Training Series - Spring Data Neo4j
Neo4j
 
Introduction to Cloud Computing and Cloud Infrastructure
Introduction to Cloud Computing and Cloud InfrastructureIntroduction to Cloud Computing and Cloud Infrastructure
Introduction to Cloud Computing and Cloud Infrastructure
SANTHOSHKUMARKL1
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Edureka!
 
Datascience and python
Datascience and pythonDatascience and python
Datascience and python
UmmeSalmaM1
 
Data mining
Data miningData mining
Data mining
Kinza Razzaq
 
FAIR data overview
FAIR data overviewFAIR data overview
Data science in health care
Data science in health careData science in health care
Data science in health care
Chetan Khanzode
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Privacy and Data Security
Privacy and Data SecurityPrivacy and Data Security
Privacy and Data Security
WilmerHale
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Srishti44
 
Healthcare and Cyber security
Healthcare and Cyber securityHealthcare and Cyber security
Healthcare and Cyber security
Brian Matteson, CISSP CISA
 
Text MIning
Text MIningText MIning
Text MIning
Prakhyath Rai
 
Data analytics
Data analyticsData analytics
Data analytics
Dr.Bhuvaneswari Velumani
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Datamining Tools
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Krishnaram Kenthapadi
 

What's hot (20)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computing
 
Neo4j Training Series - Spring Data Neo4j
Neo4j Training Series - Spring Data Neo4jNeo4j Training Series - Spring Data Neo4j
Neo4j Training Series - Spring Data Neo4j
 
Introduction to Cloud Computing and Cloud Infrastructure
Introduction to Cloud Computing and Cloud InfrastructureIntroduction to Cloud Computing and Cloud Infrastructure
Introduction to Cloud Computing and Cloud Infrastructure
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Datascience and python
Datascience and pythonDatascience and python
Datascience and python
 
Data mining
Data miningData mining
Data mining
 
FAIR data overview
FAIR data overviewFAIR data overview
FAIR data overview
 
Data science in health care
Data science in health careData science in health care
Data science in health care
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Privacy and Data Security
Privacy and Data SecurityPrivacy and Data Security
Privacy and Data Security
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Healthcare and Cyber security
Healthcare and Cyber securityHealthcare and Cyber security
Healthcare and Cyber security
 
Text MIning
Text MIningText MIning
Text MIning
 
Data analytics
Data analyticsData analytics
Data analytics
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
 

Similar to An overview of methods for data anonymization

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion
Alket Cecaj
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
Vaticle
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information Privacy
Micah Altman
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
bommaiah
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Kato Mivule
 
Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning
Shalin Hai-Jew
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
arx-deidentifier
 
Data mining
Data miningData mining
Data mining
pradeepa n
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
Experfy
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
Divakar Raj M
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
Sri Ambati
 
Characterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchCharacterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science Research
Micah Altman
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
Ali Septiandri
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
Novartis Institutes for BioMedical Research
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
Nozha Boujemaa
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
IT Network marcus evans
 

Similar to An overview of methods for data anonymization (20)

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information Privacy
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
 
Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
 
Data mining
Data miningData mining
Data mining
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Characterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchCharacterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science Research
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 

More from arx-deidentifier

An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
arx-deidentifier
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
arx-deidentifier
 
Anonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARXAnonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARX
arx-deidentifier
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
arx-deidentifier
 
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health DataARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
arx-deidentifier
 
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
arx-deidentifier
 

More from arx-deidentifier (6)

An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
 
Anonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARXAnonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARX
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health DataARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
 
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
 

Recently uploaded

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 

Recently uploaded (20)

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 

An overview of methods for data anonymization

  • 1. Technische Universität München Fabian Prasser Lehrstuhl für Medizinische Informatik Institut für Medizinische Statistik und Epidemiologie Klinikum rechts der Isar der TU München An overview of methods for data anonymization
  • 2. Technische Universität München Three types of de-identification / anonymization • Masking identifiers in unstructured data • Subject: clinical notes, … • Methods: machine learning, regular expressions, … • Implementations: MIST, MITdeid, NLM Scrubber • Privacy preserving data analysis (interactive scenario) • Subject: query results, … • Methods: interactive differential privacy, query-set-size control, … • Implementations: AirCloak, Airavat, Fuzz, PINQ, HIDE • Transforming structured data (non-interactive scenario) • Subject: tabular data, … • Methods: generalization, suppression, randomization, ... • Implementations: AnonTool, ARX, sdcMicro, μArgus, PARAT 219.03.2015 An overview of methods for data anonymization
  • 3. Technische Universität München Multiple aspects have to be balanced • Main goal: Achieve a balance between data utility and privacy • Complex task • Many different types of methods need to be applied in an integrated manner • Methods may need to be parameterized • Different aspects are interrelated • Just the most important aspects and relationships 319.03.2015 Use cases Types of data Privacy models Transformation models Utility measures Algorithms Risk models An overview of methods for data anonymization
  • 4. Technische Universität München Important aspects of use cases • Who or what will process the data in which way? • Humans, e.g., epidemiologists • Different types of analyses • Interactive vs. non-interactive • Machines, i.e., data mining • Classification vs. clustering • How will the data be released? • Access control • Open access vs. restricted access • Continous data publishing • Multiple views vs. re-release (incremental vs. new attributes) • Is the data distributed? • Collaborative environments • Vertical vs. horizontal vs. hybrid distribution 419.03.2015 [FWF11] [FWF11] [FWF11] An overview of methods for data anonymization
  • 5. Technische Universität München Important properties of data • Relational data • Tabular data • One row per individual • Transactional data • Data consisting of set-valued attributes • Example: Follow-up collection of diagnosis codes • Data with relational and transactional characteristics • Dimensionality of data • Mitigating re-identification is practically infeasible for high-dimensional data • Data with clusters • Example: household structures • Other types of data: Trajectory data, social network data 519.03.2015 [Agg05] An overview of methods for data anonymization
  • 6. Technische Universität München Privacy models: some background • Definition of (perfect) privacy • “Anything that can be learned about a respondent from a statistical database should be learnable without access to the database” • Syntactic models • Syntactic conditions on the released datasets • No (direct) semantic implications regarding the above definition • Instead: Assumptions about attack vectors and definition of (likely) background knowledge and goals by classifying attributes • Direct and indirect identifiers (or quasi-identifiers, or keys) • Sensitive and insensitive attributes • Semantic models • Privacy models that relax a formalization of the above definition • Much fewer assumptions need to be made about attackers 619.03.2015 formulated by [Dwork08] [Swe02] [Dal77] An overview of methods for data anonymization
  • 7. Technische Universität München Risk and threat models • Disclosure models • Identity disclosure (re-identification, tuple linkage) • Attribute disclosure (sensitive information disclosure) • Membership disclosure (table linkage) • Models for quantifying re-identification risks • Super-population models: Population is modeled with probability distributions parameterized with sample characteristics • Decision rule by Dankar et al.: Combination of three models, which has been evaluated for biomedical datasets • Attacker models: May be used to derive/compile global risks • Prosecutor scenario: Targets one specific individual • Marketer scenario: Targets as many individuals as possible • Journalist scenario: Targets any individual 719.03.2015 [Emam13] [DEN+12] [LLZ+12] An overview of methods for data anonymization
  • 8. Technische Universität München Syntactic models against re-identification • Goal: Prevent linkage attacks on quasi-identifiers • Some models for relational data • k-Anonymity: Requires groups (cells or equivalence classes) of size ≥ k, which defines an upper bound on the re-identification risk (over-) estimated with sample frequencies • LKC-Privacy: Relaxed variant of k-anonymity + (ℓ-diversity) • Risk-based approaches: Enforce thresholds on re-identification risks, which may be quantified with super-population models • HIPAA Safe Harbor: Heuristic with many predefined identifiers and a few quasi-identifiers (regions and all kinds of dates). Contains wildcards („any other unique identifying number, characteristic, or code“). Provides sound legal protection for custodians in the US • Some models for transactional data • (km )-Anonymity: k-Anonymity regarding ≤ m values from a set 819.03.2015 [Swe02] [MFH+09] [HIP] [TMK08] An overview of methods for data anonymization
  • 9. Technische Universität München Syntactic models against attribute disclosure • Observation: Preventing linkage attacks is not enough • Goal: Prevent knowledge gain from sensitive information associated with an equivalence class • Some models for relational data • ℓ-Diversity: Sensitive values must be „well-represented”. Multiple variants exist with different privacy/utility trade-offs • t-Closeness: Distribution of sensitive values must not be „too different“ from the overall dataset. Multiple variants exist • p-Sensitive k-anonymity: Focus on identity & attribute disclosure • LKC Privacy: ℓ-Diversity & relaxed k-anonymity • Some models for transactional data • (h, k, p)-coherence: km -anonymity + protection against inference • ρ-Uncertainty: protection against inference with fewer assumptions 919.03.2015 [MFH+09] [MGK+06] [LLV07] [TV06] [CKR+10] [XWF+08] An overview of methods for data anonymization
  • 10. Technische Universität München Further syntactic models • Models against membership disclosure for relational data • Goal: Bounds on the certainty with which the presence of data about an individual in a database can be inferred via linkage • Upside: With strict thresholds, they provide semantic privacy • Downside: Basically impossible to achieve • δ-Presence: Relates sample counts to population counts • c-Confident δ-presence: Relaxation of δ-presence in which population characteristics are estimated • Models for data which is relational and transactional • (k, km )-Anonymity: Mixture of k-anonymity and km -anonymity • Models for continuous publishing of relational data • Approach by Byun et al.: Only supports insertions • m-Invariance: Supports insertions, deletions, updates 1019.03.2015 [NAC07] [NC10] [TMK08] [XT07] An overview of methods for data anonymization
  • 11. Technische Universität München A semantic model: Differential Privacy • Observation • The formal notion of privacy is impossible to achieve • Even for individuals that are not part of the statistical database • Idea • Do not compare an attacker's information about an individual before and after accessing a statistical database, but • Compare the risks for an individual when joining (or leaving) a statistical database • (Slightly) more formal • -Differential Privacy:ϵ A (randomized) function fulfills -DP if theϵ probability of every possible output value changes by a factor of at most exp( ) when data about an individual is or is not contained inϵ a database. • Relaxations: ( ,ϵ δ)-DP, approximate DP 1119.03.2015 [Dwork 06, Dwork08] [Dwork06] [Dwork 06, Dwork08] [PK08, LM12] An overview of methods for data anonymization
  • 12. Technische Universität München A semantic model: Differential Privacy (cont'd) • DP in interactive scenarios • Sequential composition rule • Privacy budget • DP in non-interactive scenarios • Release of contingency tables or marginals • Relationships to syntactical models exist, e.g., • (k, β)-SDGS: Random sampling + k-anonymity fulfills ( , δ)-DPϵ • t-Closeness (with a specific distance function): Implies -DP regarding the sensitive attributesϵ • Has been criticized in the context of biomedical research • DP is often not a truthful mechanism: Functions are randomized, often data is pertubated, e.g., by adding noise • DP is not intuitive: What is a good value for ? What does itϵ mean? 1219.03.2015 [FDE13] [DS15] [LQS11] [BCD+07][BCD+07] [FDE13] An overview of methods for data anonymization
  • 13. Technische Universität München Measuring data utility • Often used interchangeably with “loss of information” • Exemplary utility measures for syntactic models • Used for evaluating transformed datasets • Discernibility: Based on sizes of equivalence classes • Average equivalence class size: Analogously to discernibility • (Non-uniform) entropy: Information theoretic measure • Loss: Measures the coverage of the domain of attributes • Utility constraints: Use cases are modeled as queries • Exemplary utility measures for Differential Privacy • Used for evaluating a method that fulfills DP • Error: Absolute, relative, variance • (α, δ)-Usefulness: P[distance ≤ α] ≥ δ 1319.03.2015 [BA05] [LDD+05] [GT09] [LGM10] [VI02] [FDE13] An overview of methods for data anonymization
  • 14. Technische Universität München Transformation methods • Coding models • Global recoding: Similar transformation for similar values • Local recoding: Different transformations may be applied • Truthful transformations • Generalization: Based on domain generalization hierarchies • Full-domain generalization: All values of an attribute are generalized to the same level • Subtree generalization: Different levels of generalization may be applied • Suppression: Removal of values of cells or complete tuples • Top & bottom coding: Replacing values that exceed given bounds 1419.03.2015 [FWF11] [FWF11] An overview of methods for data anonymization
  • 15. Technische Universität München Transformation methods (cont'd) • Non-truthful transformations (Pertubation) • Post-randomization: Randomly change categories of a categorical variable according to predefined probabilities • Value distortion: Multiplicative or additive noise • Numerical rank swapping: Randomly swap values with other values with a rank that does not differ by more than a predefined threshold • Microaggregation: Aggregate values in one group • Replacing values: Distribution sample or distribution itself • Methods on a structural level • Random sampling: Randomly select a set of tuples • Slicing: Partition the data horizontally and vertically and creates links between partitions 1519.03.2015 [FWF11] [LLZ+12] [LQS11] An overview of methods for data anonymization
  • 16. Technische Universität München Algorithms • Transform data to meet privacy models • Given transformation methods, data properties etc. • Randomized algorithms • Randomized functions for Differential Privacy • Genetic search • Search algorithms • Optimal algorithms: Flash, Incognito, OLA • Heuristic algorithms: Top-Down-Specialization • Clustering algorithms: Iteratively merge groups • Data (focus on tuples): Method by Tassa et al. • Space (focus on taxonomies): For transactional data • Partitioning algorithms: Iteratively split groups • Data: Mondrian 1619.03.2015 [Dwork08a] [EDI+09, LDR05, KPE+12] [FWY05] [GT10] [Iye02] [GLS14] [LG13] [LDR06] An overview of methods for data anonymization
  • 17. Technische Universität München Thank you for your attention! 1719.03.2015 An overview of methods for data anonymization
  • 18. Technische Universität München Further Readings • Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011. • Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform, Vol. 50, p. 4-19, 2014 • Dankar FK, El Emam K. Practicing Differential Privacy in Health Care: A Review. Trans. Data Privacy, Vol. 6:1, 2013. • Gkoulalas-Divanis A, Loukides G. Anonymization of Electronic Medical Records to Support Clinical Analysis. Springer. ISBN: 978-1-4614-5668-1, 2013 • El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013 1819.03.2015 An overview of methods for data anonymization
  • 19. Technische Universität München References[Agg05] Aggarwal CC. On k-anonymity and the curse of dimensionality. PVLDB, p. 901–909, 2005. [BA05] Bayardo RJ, Agrawal R. Data Privacy through Optimal k-Anonymization. ICDE. pp. 217–228, 2005. [BCD+07] Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F et al. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. PODS, p. 273–282, 2007. [CKR+10] Cao J, Karras P, Raïssi C, Tan K. rho-uncertainty: inference-proof transaction anonymization. PVLDB;3(1):1033–44. 2010. [Dal77] Dalenius T. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, p. 429–444, 1977. [DEN+12] Dankar F, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making, 12:66, 2012. [DS15] Domingo-Ferrer J, Soria-Comas J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl.-Based Syst., 74:151–158, 2015. [Dwork06] Dwork C. Differential Privacy. ICALP; 4052, pp. 1–12. Springer, 2006. [Dwork08] Dwork C. An Ad Omnia Approach to Defining and Achieving Private Data Analysis. PinKDD, p. 1-13, 2008. [Dwork08a] Dwork C. Differential privacy: A survey of results. TAMC. pp. 1–19, 2008. [EDI+09] El Emam K, Dankar F, Issa R, Jonker E et al. A globally optimal k-anonymity method for the de-identification of health data. JAMIA, 16(5), pp. 670–682, 2009. [Emam13] El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013. [FDE13] Fida K, Dankar F, El Emam K. Practicing differential privacy in health care: A review. Trans. Data Privacy, 6(1):35–67, 2013. [FWF11] Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011. [FWY05] Fung BCM, Wang K, Yu PS. Top-Down Specialization for Information and Privacy Preservation. ICDE, 2005, pp. 205–216. [GKK07] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with low information loss. VLDB, 2007, pp. 758–769. [GT09] Gionis A, Tassa T. k-Anonymization with Minimal Loss of Information. IEEE Trans Knowl Data Eng, pp. 206–219, 2009. [GT10] Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. Transactions on data privacy, vol. 3, pp. 149–175, 2010. [GLS14] Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. JBI, Vol. 50, p. 4-19, 2014 [HIP] U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administrative simplification regulation text; 2006. [Iye02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, 2002. [KPE+12] Kohlmayer F, Prasser F, Eckert C, Kemper A, Kuhn KA. Flash: Efficient, Stable and Optimal K-Anonymity. PASSAT., pp. 708–717, Sep. 2012. [LDD+05] LeFevre K, DeWitt DJ, Ramakrishnan R. Multidimensional K-Anonymity (TR-1521). Tech. rep. University of Wisconsin, 2005. [LDR05] LeFevre K, DeWitt D, Ramakrishnan R. Incognito: Efficient full-domain k-anonymity. SIGMOD. 2005, pp. 49–60, 2005. [LDR06] LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional K-Anonymity. ICDE, 2006. [LG13] Loukides G, Gkoulalas-Divanis A. Utility-aware anonymization of diagnosis codes. J Biomed Health Inform. 2013;17(1):60–70. [LGM10] Loukides G, Gkoulalas-Divanis A, Malin B, COAT: COnstraint-based anonymization of transactions. Knowl Inf Syst, 28:2, p. 251–282, 2010. [LLV07] Li N, Li T, Venkatasubramanian S. t-Closeness: privacy beyond k-anonymity and l-diversity. ICDE; p. 106–15. 2007. [LLZ+12] Li T, Li N, Zhang J, Molloy I. Slicing: A new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng, 24(3):561–574, 2012 [LM12] Li C, Miklau G. An adaptive mechanism for accurate query answering under differential privacy. PVLDB, 5(6):514–525, 2012. [LQS11] Li N, Qardaji WH, Su D. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011. [MFH+09] Mohammed N, Fung BCM, Hung PCK, and Lee CK. Anonymizing healthcare data: A case study on the blood transfusion service. SIGKDD, p. 1285–1294, 2009. [MGK+06] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-Diversity: privacy beyond k-anonymity. ICDE; pp. 24, 2006. [NAC07] Nergiz ME, Atzori M, Clifton C. Hiding the presence of individuals from shared databases. SIGMOD; p. 665–676, 2007. [NC10] Nergiz ME, Clifton C. d-presence without complete world knowledge. IEEE Trans Knowl Data Eng; 22(6):868–83, 2010. [PK08] Prasad S, Kasiviswanathan AS. A note on differential privacy: Defining resistance to arbitrary side information. CoRR abs/0803.3946, 2008. [Swe02] Sweeney L. k-anonymity: a model for protecting privacy. IJUFKS;10:557–70, 2002. [TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of set valued data. PVLDB;1(1):115–25, 2008. [TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of setvalued data. PVLDB;1(1):115–25, 2008. [TV06] Truta TM, Vinay B. Privacy protection: p-sensitive k-anonymity property. ICDE workshops; p. 94, 2006. [VI02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, pp. 279–288, 2002. [XT07] Xiao X, Tao Y. m-invariance: Towards privacy preserving republication of dynamic datasets. SIGMOD, 2007. [XWF+08] Xu Y, Wang K, Fu AW-C, Yu PS. Anonymizing transaction databases for publication. KDD; p. 767–75, 2008. 1919.03.2015 An overview of methods for data anonymization