Microdata anonymization considerations
Upcoming SlideShare
Loading in...5

Microdata anonymization considerations



Timing, data access types and degree of anonymization

Timing, data access types and degree of anonymization
in microdata dissemination



Total Views
Views on SlideShare
Embed Views



2 Embeds 6

https://twitter.com 5
https://www.linkedin.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • We often use the terms "confidentiality" and "privacy" interchangeably in our everyday lives. However, they mean distinctly different things. While confidentiality relates to information/data about an individual, privacy relates to a person and is a right rooted in common law. Privacy protects access to the person, whereas confidentiality protects access to the data. In the context of statistics – ‘confidentiality’ is the researcher’s agreement with the participant about how the participant’s identifiable private information will be handled, managed, and disseminated. Hence, confidentiality is an ethical duty. <br /> <br /> [Situations vary. In some cases the duty is easy and in some cases it is not.] <br /> <br /> How is this duty is performed by controlling the factors of (1) timing of data release, (2) data access types and (3) degree of anonymization: is my topic of presentation. <br />
  • I’ll keep two parallel tracks during my presentation. Generic track and Rwanda specific track. While talking about generic stuff, I’ll be often jumping off and on to Rwanda specific examples to illustrate my points
  • Lets dig deeper into the subject.
  • In most cases of statistical practices, the caveat is…. Microdata dissemination must maintain confidentiality of individual units: people, households or enterprises. <br /> <br /> Driven by Principle 6 of UN Fundamental Principles of Official Statistics. <br /> <br /> However, if in some cases, it facilitates the caveat, in others, the strict confidentiality is often invoked as a reason not to share any microdata
  • -In Rwanda, there is a strong legal basis – facilitating the caveat. <br /> -The law also provide for ‘PENALTIES ‘ in case of breach of statistical confidentiality <br />
  • Regarding the Principle 6 of UN Fundamental Principles of Official Statistics, if access becomes the casualty – then it is loss. <br /> <br /> Therefore, broadly accepted rationale is: though confidentiality should be upheld, access to data should not be jeopardised. <br /> <br /> See some benefits:
  • Access rationale is broadly accepted. <br />
  • In Rwanda, statistical law provides for the ‘assurance of access’.
  • It is obvious that seemingly conflicting ideas may pose some challenges, if applied simultaneously. <br /> <br /> It is therefore, a balancing act.
  • There is a constant struggle to minimize both.
  • What has added to the misery?
  • What recourse do we have? Is it possible to have harmony?
  • Though not easy, but it is possible and desirable for openness and privacy to co-exist. <br />
  • What are the decision factors?
  • What helps?
  • Leaves the pressure out, for microdata to appeal to ‘all’ / ‘normal’ users.
  • At NISR there is only one dataset which has Licensed data files - General Census of Population and Housing 2002. It is because the entire dataset is made available (though anonymized). The current Census where only 5% data will be released (after anonymization) will be Public Use Files.
  • The challenge is quite big here (read in the context of Big Data). We are learning. And though simple means are currently in use, we intend to move towards more complex arrangements where ‘balancing act’ is more optimized.

Microdata anonymization considerations Microdata anonymization considerations Presentation Transcript

  • Timing, data access types and degree of anonymization in microdata dissemination … Rajiv Ranjan NISR/UNDP-Rwanda Reflections on data confidentiality, privacy, and curation Regional Workshop on Microdata Dissemination Policy Kigali, Rwanda: 27 – 29 August 2014
  • Confidentiality concerns Access issues Legal basis Assurance Challenges Harmony Governance Practices Timing, data access types and degree of anonymization in microdata dissemination Scheme of the presentation
  • Confidentiality
  • Caveat Microdata dissemination must maintain confidentiality of individual units: people, households or enterprises. Individual data collected by statistical agencies for statistical compilation, whether they refer to natural or legal persons, are to be strictly confidential and used exclusively for statistical purposes. Principle 6 United Nations Fundamental Principles of Official Statistics http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
  • Legal basis in Rwanda Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article 17: Prohibited dissemination of information (N° 45/2013 of 16/06/2013) Data collected by the institutions of the national statistical system through surveys or any other method of collection are protected by statistical confidentiality. Statistical confidentiality implies that the dissemination of such data as well as statistical information which can be calculated from them, shall be conducted in a way that those who provided it are not identified whether directly or indirectly.
  • Access
  • Access benefits • Fosters diversity of research • Increases transparency and accountability • Mitigates duplication of data collection work • Increases the quality of data https://unstats.un.org/unsd/accsub-public/microdata.pdf
  • Access assurance in Rwanda The anonymous basic databases on individuals and other institutions shall be accessible to researchers who, however, shall be committed to : 1° make a written note, that they shall not communicate to any person the contents of such databases without the written authorization of the National Institute of Statistics of Rwanda; 2° give to the National Institute of Statistics of Rwanda, the findings of their research. Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article 19: Accessibility to anonymous basic database not to be published (N° 45/2013 of 16/06/2013)
  • Challenges
  • Balancing act Disclosure risks Information loss • In practice, the more the disclosure risks are reduced, the lower will be the expected utility of the microdata sets. • The objective remains to deal with the trade-off between disclosure risks and information loss. Source: Chris Skinner: Statistical Disclosure Control for Survey Data: http://personal.lse.ac.uk/skinnecj/SDC%20for%20survey%20data%20S3RI.pdf
  • Challenges [Emerging mash-ups] Datasets are being reused and combined with other datasets in ways never before thought possible, including for use that go beyond the original intent. [Growing motives] While there are promising research efforts underway to protect privacy, far more advanced efforts are presently in use to re-identify seemingly “anonymous” data [Improved access] Access to datasets have eased their discoverability and data could be used to re-identify previously de- identified datasets http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf
  • Complicating the challenges Disclosure risks Information loss Images: (1.) From the cover of ‘Open Data Now’ - a book by Joel Gurin, exploring how open data within public records will create new jobs, applications and other technology innovations . http://www.opendatanow.com & (2.) A project at PARIS21 on data revolution for post 2015 SDGs http://www.paris21.org/node/1654 Machine readability, Open standards and Free for reuse Post 20151 2
  • Harmony
  • Coexistence “There is nothing inherently contradictory about hiding one piece of information while revealing another, so long as the information we want to hide is different from the information we want to disclose.” http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2031808 - Felix T. Wu in Defining Privacy and Utility in Data Sets. Though not easy, but it is possible and desirable for openness and privacy to co-exist.
  • Decision factors Disclosure risks Information loss Sensitivity of the dataset Usage intent
  • Enabling dimensions • Asserting users types • Controlling release timing • Categorizing access methods • Varying the degree of anonymization Tools & Methods1 Governance Practices • Legal basis • Policy backing • Institutionalization • sdcMicro • sdcMicroGUI • Deterministic • Probabilistic 1: http://cran.r-project.org/web/packages/sdcMicro/vignettes/sdc_guidelines.pdf Anonymization
  • Governance
  • Law on the organisation of statistical activities in Rwanda (Feb 14, 2006) Law
  • Microdata Release Policy @ National Institute of Statistics of Rwanda Policy
  • Microdata Release Committee & Data curation team @ NISR Institutionalization
  • Practices
  • Users types served Govt. (Policy makers and researchers) International development agencies Research and academic institutions Students and professors Others (scientific researchers)
  • Release timing 6 – 24 monthsafter the 1st release of aggregated data from a survey/census Within DHS 2010 EICV(3) 2010-2011 Census 2012 7 7 ? Seasonal Agri Survey 2013 ? 24 Months Examples Integrated Household Living Conditions Survey (EICV)
  • Access methods Web-based distribution
  • Types of files/access 16 1 3 Open access (no restriction) Direct access or Public Use Files (some restrictions on use, but no screening of users) Research Use Files (or Scientific Use Files, or Licensed Files) Availability only in an enclave No access authorized Data not available Data available from external repo 4 Totalnoofstudies=24
  • Degree of anonymization • Suppressing/deleting the records of direct identifiers (e.g. name of the head of HH) and few indirect identifiers (e.g. sub-national admin boundaries) • Generalizing/replacing (recoding) some indirect identifiers with less specific but semantically consistent groupings of observation values (e.g. place of birth, occupation) • Perturbing/distorting some indirect identifiers by randomizing the values (e.g. clusters) Removing or modifying the identifying variables contained in the microdata The usual practice at NISR is to release microdata as Public Use Files. For example, in EICV3, the methods applied for anonymizing data were: Integrated Household Living Conditions Survey (EICV): EICV3 was done in 2010-2011 Variations in the degree of anonymization (and resulting access files/types) may be considered depending on the sensitivity of the dataset and the use.
  • e.g.: Recoding (Occupation)
  • @rajiv_r_in … Thank you! “87% of the U.S. population can be uniquely identified by date of birth + gender + zip” Latanya Sweeney, CMU latanyasweeney.org