Website with further information: http://arx.deidentifier.org
Description of this talk:
While a plethora of methods have been proposed for dealing with many aspects of de-identifying clinical data, only few (prototypical) implementations are available. Actually, the complexity of implementing privacy technologies is an often overlooked challenge.
In this talk we will present the open source data de-identification tool ARX, which has been carefully engineered to support multiple privacy technologies for relational datasets. Our tool bridges the gap between different scientific disciplines by integrating methods developed and used by the statistics community with data anonymization techniques developed by computer scientists.
ARX has been designed from the ground up to ensure scalability and it is able to process very large datasets on commodity hardware. The software implements a large set of
privacy models: (1) syntactic privacy models, such as k-anonymity, l-diversity, t-closeness and δ-presence, (2) statistical models for re-identification risks, and (3) differential privacy. In the talk, we will focus on measures to reduce the uniqueness of records. ARX also supports more than ten different methods for evaluating data utility, including loss, precision, non-uniform entropy and KL divergence.
In ARX, de-identification of data can be performed automatically, semi-automatically and manually using a complex method that integrates global recoding, local recoding, categorization, generalization, suppression, microaggregation and top/bottom-coding. All methods are accessible via a comprehensive cross-platform graphical user interface.
Effectively Troubleshoot 9 Types of OutOfMemoryError
Engineering data privacy - The ARX data anonymization tool
1. Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair for Biomedical Informatics
Institute for Medical Statistics and Epidemiologie
University of Technology Munich (TUM)
Engineering data privacy -
The ARX data anonymization tool
2. Technische Universität München
What is ARX?
●
= +
●
A tool for analyzing and reducing the uniqueness of records
in a (relational) dataset
●
Variety of methods
●
Highly scalable
●
Up to 50 dimensions (i.e. attributes)
●
Millions of records
●
(Semi-)automatically and/or manually
●
Comprehensive graphical user interface
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 2
Images: https://commons.wikimedia.org/ users: Ysangkok, Scarce2
statistics computer science
Methods from
4. Technische Universität München
Overview of methods implemented by ARX
Sample-based methods
• Fraction of sample uniques
• Average sample uniqueness
• k-anonymity
Population-based methods
• Model by Zayatz [1]
• Model by Hoshino [2]
• Model by Chen et al. [3] / Rinott [4]
• Model by Dankar et al. [5]
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 4
[1] Zayatz, L.V.: Estimation of the percent of unique population
elements on a microdata file using the sample. Statistical
Research Division Report Number: Census/SRD/RR-91/08 (1991)
[2] Hoshino, N.: Applying pitmans sampling formula to microdata
disclosure risk assessment. J Off Stat 17(4), 499520 (2001)
[3] Chen, G., Keller-McNulty, S.: Estimation of identification disclosure
risk in microdata. J Off Stat 14, 7995 (1998)
[4] Rinott, Y.: On models for statistical disclosure risk estimation. In:
Proc ECE/Eurostat Work Session Stat Data Confid, p. 275285 (2003)
[5] Dankar, F., Emam, K.E., Neisa, A., Roffey, T.:
Estimating the re-identification risk of clinical
data sets. BMC Med Inform Decis Mak 12(1), 66 (2012)
Global and local recoding
• Can be weighted
Methods
• Categorization
• Generalization
• Cell suppression
• Record suppression
• Micro-aggregation
• Top/bottom coding
Weighted and parameterized
• Ability to control the application
of different coding models
Methods
• AECS, Discernibility, Precision
• (Normalized) Mean squared error
• (Normalized) Non-uniform entropy
• KL divergence
• Loss
Measures for utility Coding models Measures for uniqueness
Transform
Visualize
Analyze
Adapt