This document discusses procedures for anonymizing microdata from official statistics to protect privacy while maintaining data utility. It describes balancing disclosure risk versus data utility and outlines a multi-step procedure involving describing the dataset and intended users, applying statistical disclosure control methods, and measuring disclosure risk and data utility. Recent research shows that a small number of spatio-temporal data points can uniquely identify individuals, and that a socio-demographic fingerprint containing as few as gender, date of birth, and municipality can uniquely identify many records. Parameters for statistical disclosure control methods include the data age, subsample size, and required levels of anonymity for identifying variables.
What Data Can Do: A Typology of Mechanisms
Angèle Christin .
International Journal of Communication > Vol 14 (2020) , de Angèle Christin del Departamento de Comunicación de Stanford University, USA titulado "What Data Can Do: A Typology of Mechanisms". Entre otras cosas es autora del libro "Metrics at Work.
The analysts are in the field of "knowledge". Intelligence refers to knowledge and the types of problems addressed are knowledge problems. So, we need a concept of work based on knowledge. We need a basic understanding of what we know and how we know, what we do not know, and even what can be known and what is not known. The analysis should provide a useful basis for conceptualizing intelligence functions, of which the most important are "estimation" and "prediction". The intelligence itself, in its basic form, has a decision-making function. Intelligence analysis applies individual and collective cognitive methods to assess data and test assumptions in a secret socio-cultural context.
DOI: 10.13140/RG.2.2.25298.40646
This presentation was provided by Dr. Micah Altman of MIT during the NISO Symposium, Privacy Implications of Research Data, held on September 11, 2016 in conjunction with the International Data Week events in Denver, Colorado.
This paper looks at the problem of privacy in the context
of Online Social Networks (OSNs). In particular, it examines the predictability of different types of personal information based on OSN data and compares it to the perceptions of users about the disclosure of their information. To this end, a real life dataset is composed. This consists of the Facebook data (images, posts and likes) of 170 people along with
their replies to a survey that addresses both their personal information, as well as their perceptions about the sensitivity and the predictability of different types of information. Importantly, we evaluate several learning techniques for the prediction of user attributes based on their OSN data. Our analysis shows that the perceptions of users with respect to
the disclosure of specific types of information are often incorrect. For instance, it appears that the predictability of their political beliefs and employment status is higher than they tend to believe. Interestingly, it also appears that information that is characterized by users as more sensitive, is actually more easily predictable than users think, and vice versa (i.e. information that is characterized as relatively less sensitive is less easily predictable than users might have thought).
Big Data & Privacy -- Response to White House OSTPMicah Altman
Big data has huge implications for privacy, as summarized in our commentary below:
Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.
On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a 'comprehensive review' on big data and privacy.
This lead to a series of workshop on big data and technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research project and from the BigData@CSAIL projects have contributed to these workshops and raised many thoughtful issues (and the workshop sessions are online and well worth watching).
My colleagues at the Berkman Center, David O'Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
"Reproducibility from the Informatics Perspective"Micah Altman
Dr. Altman will provide expert comment on the need for informatics modeling as part of the National Academies workshop: Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
This workshop focuses on the topic of addressing statistical challenges in assessing and fostering the reproducibility of scientific results by examining three issues from a statistical perspective: the extent of reproducibility, the causes of reproducibility failures, and potential remedies.
Empathic inclination from digital footprints
Marco Polignano, Pierpaolo Basile, Gaetano Rossiello, Marco de Gemmis and Giovanni Semeraro
University of Bari “Aldo Moro”, Dept. of Computer Science, Italy
What Data Can Do: A Typology of Mechanisms
Angèle Christin .
International Journal of Communication > Vol 14 (2020) , de Angèle Christin del Departamento de Comunicación de Stanford University, USA titulado "What Data Can Do: A Typology of Mechanisms". Entre otras cosas es autora del libro "Metrics at Work.
The analysts are in the field of "knowledge". Intelligence refers to knowledge and the types of problems addressed are knowledge problems. So, we need a concept of work based on knowledge. We need a basic understanding of what we know and how we know, what we do not know, and even what can be known and what is not known. The analysis should provide a useful basis for conceptualizing intelligence functions, of which the most important are "estimation" and "prediction". The intelligence itself, in its basic form, has a decision-making function. Intelligence analysis applies individual and collective cognitive methods to assess data and test assumptions in a secret socio-cultural context.
DOI: 10.13140/RG.2.2.25298.40646
This presentation was provided by Dr. Micah Altman of MIT during the NISO Symposium, Privacy Implications of Research Data, held on September 11, 2016 in conjunction with the International Data Week events in Denver, Colorado.
This paper looks at the problem of privacy in the context
of Online Social Networks (OSNs). In particular, it examines the predictability of different types of personal information based on OSN data and compares it to the perceptions of users about the disclosure of their information. To this end, a real life dataset is composed. This consists of the Facebook data (images, posts and likes) of 170 people along with
their replies to a survey that addresses both their personal information, as well as their perceptions about the sensitivity and the predictability of different types of information. Importantly, we evaluate several learning techniques for the prediction of user attributes based on their OSN data. Our analysis shows that the perceptions of users with respect to
the disclosure of specific types of information are often incorrect. For instance, it appears that the predictability of their political beliefs and employment status is higher than they tend to believe. Interestingly, it also appears that information that is characterized by users as more sensitive, is actually more easily predictable than users think, and vice versa (i.e. information that is characterized as relatively less sensitive is less easily predictable than users might have thought).
Big Data & Privacy -- Response to White House OSTPMicah Altman
Big data has huge implications for privacy, as summarized in our commentary below:
Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.
On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a 'comprehensive review' on big data and privacy.
This lead to a series of workshop on big data and technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research project and from the BigData@CSAIL projects have contributed to these workshops and raised many thoughtful issues (and the workshop sessions are online and well worth watching).
My colleagues at the Berkman Center, David O'Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
"Reproducibility from the Informatics Perspective"Micah Altman
Dr. Altman will provide expert comment on the need for informatics modeling as part of the National Academies workshop: Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
This workshop focuses on the topic of addressing statistical challenges in assessing and fostering the reproducibility of scientific results by examining three issues from a statistical perspective: the extent of reproducibility, the causes of reproducibility failures, and potential remedies.
Empathic inclination from digital footprints
Marco Polignano, Pierpaolo Basile, Gaetano Rossiello, Marco de Gemmis and Giovanni Semeraro
University of Bari “Aldo Moro”, Dept. of Computer Science, Italy
Scientific Reproducibility from an Informatics PerspectiveMicah Altman
This talk, prepared for the MIT Program on Information Science, and updating a talk at the National Academies workshop on reproducibility, frames reproducibility from an informatics perspective
Reproducibility from an infomatics perspectiveMicah Altman
Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
Towards a socio demographic fingerprint ch-iassist 2013
1. Towards a procedure to anonymise micro data
Anonymising data from official statistics for public use
IASSIST, Köln - 30.05.2013
Katelijne Gysen
katelijne.gysen@fors.unil.ch
2. 2
Outline
1. Promotion of official statistics
2. Anonymisation of data
2.1 Trade off: disclosure risk versus data utility
2.2 Procedure
2.3 Parameter setting for Statistical Disclosure Control (SDC)
1. Uniqueness and k-anonymity
3.1 Concepts
3.2 Recent research on mobility data
3.3 The real fingerprint
3.4 Socio-demographic fingerprint
3. 3
1. Promotion of official statistics
Data from National Statistical Institute (NSI)
Labour Force Survey
Survey on Structure of Earnings
SILC (Survey on Income and Living Conditions)
PISA (Education)
Swiss Health Survey
Population Census and Business Census, …
Micro data for research and teaching purposes
Collaboration with our NSI:
4. 4
2. Anonymisation of data
2.1 Trade-off dilemma: disclosure risk versus data utility
researcher versus data owner
Data utility
Data protection
5. 5
2.2 Procedure (1)Dataset
Release data
Risk / utility
Balance ?
Describe
Intrusion scenario
Apply
SDC methods
Describe
Dataset characteristics
Define
Target public
Release data
Disclosure risk ?
Measure
Data utility
Describe
access conditions
6. 6
2.2 Procedure (2)Dataset
Release data
Data utility ?
Describe
Intrusion scenario
Apply
SDC methods
Set
SDC parametersDescribe
Dataset characteristics
Define
Target public
SDC
parameters
met ?
Release data
Disclosure risk ?
Measure
Data utility
Describe
access conditions
7. 7
2.3 Parameter setting for Statistical Disclosure Control (SDC)
1. Age of the data (min.)
2. Subsample (min.)
3. Level of geographical detail (max.)
4. Global and individual risk (max.)
5. Number of indirect identifying variables (max.)
6. Degree of anonymity for socio-demographic characteristics
(min.)
9. 9
3.2 Recent research about mobility data
“… four, randomly chosen “spatio-temporal points” (for
example, mobile device pings to antennas)
is enough to: uniquely identify 95% of the individuals”.
The mobility pattern is apparently unique.
10. 10
3.3 The real fingerprint
“There are as many as 150 ridge characteristics (points) in the average fingerprint.
So how many points must a fingerprint examiner match in order to safely say the
prints are indeed those of a particular suspect?”
The answer is surprising.
“There is no standard number required. …
… In fact, the decision as to whether or not there is a match is left entirely to the
individual examiner. However, individual departments and agencies may have their
own set of standards in place that requires a certain number of points be matched
before making a positive identification.”
Source: http://www.leelofland.com/wordpress/comparing-fingerprints-whats-the-point
/
13. 13
References
de Montjoye, Y.A., Hidalgo C.A., Verleysen M., Blondel V.D. Unique in the crowd: the
privacy bounds of human mobility. Scientific Reports 3, article 1376, DOI:
10.1038/srep01376. 2013
Franconi, L., Public Use Files: practices and methods to increase quality of released
microdata. OECD, 2012.
Golle, P. Revisiting the uniqueness of simple demographics in the US population. Palo
Alto Research Center. 2006
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte Nordholt, E.,
Spicer, K. , De Wolf P.P., Statistical Disclosure Control. Wiley. 2012.
Sweeney, L. Simple Demographics often identify people uniquely. Carnegie Mellon
University, Data Privacy Working Paper 3. Pittsburgh 2000.
Sweeney, L. k-Anonymity: a model for protecting privacy. International Journal on
Uncertainty, Fuziness and Knowledge-based Systems, 10 (5), 2002, 557-570.
Meindl, B., Kowarik, A., Templ M. Guidelines for the anonymisation of microdata using R-
package sdcMicro. Vienna. 2012
14. 14
Find out more ?
about FORS: www.fors.unil.ch
about public microdata for research in CH: www.compass.unil.ch
Let’s connect !
Editor's Notes
25 juin 2013 Good afternoon everybody, When I got the invitation of our project leader to send in an abstract for this conference, first I was a bit hesitating that the subject I’m working on would fit. So I have to admit that arriving here – and being a sociologist -I started to count the word confidentiality and now I do not have this concern anymore. Let’s have a look.
25 juin 2013 As an introduction I will briefly talk about the data we work with and than I will move to the topic : anonymisation of data: first I present the subject as trade off – balancing excercise, than I will show you why this exercise might get complicated, so we developed a procedure to simplify again. The last point will be about uniqueness of people, I would like to introduce the concept socio-demographic fingerprint.
25 juin 2013 I’m lucky that the plenary session of this morning was about the same kind of data … so you’ll probably know what I’m talking about. Our task is to promote official statistics for research and in our case this means data stemming from the Federal Statistical Office in Neuchâtel. I just put some names to show you what it is about: The first three are European surveys, where Eurostat is giving guidelines, Pisa international, than we have a couple of national surveys, for subjects that have an equivalent in other countries. And something important to mention is the fact that we only work with micro data (rectangular): records and variables. Of course this is a project in coll. with our NSI ------- Why public micro data? Wirth, H. The data from the FSO rely in general on: Large samples - precision Long time series Good quality Definitions: Public data /offical data / data stemming from NSI, collected with public money Opengovernment data Micro data (variables for a record/entity on different characteristics) Anonymising: prevent for identification - Therefore one can apply SDC techniques : e.g. recoding, suppression of information, perturbate information. As an introduction, I will shortly talk about the datasets we are promoting for secondary use at FORS and about the big question that has to be answered when dealing with anonymisation of data. Core of the presentation, present/ the procedure FORS is a national centre of expertise in the social sciences. Its primary activities consist of: 1. production of survey data, including national and international surveys; 2. preservation and dissemination of data for use in secondary analysis; 3. research in empirical social sciences, with focus on survey methodology; 4. consulting services for researchers in Switzerland and abroad. FORS collaborates with researchers and research institutes in the social sciences in Switzerland and internationally. FORS is a national centre of expertise in the social sciences. Its primary activities consist of: 1. production of survey data, including national and international surveys; 2. preservation and dissemination of data for use in secondary analysis; 3. research in empirical social sciences, with focus on survey methodology; 4. consulting services for researchers in Switzerland and abroad. FORS collaborates with researchers and research institutes in the social sciences in Switzerland and internationally.
25 juin 2013 As these data are not just weather data or about your opinion on the weather these days … We have to recognize that the confidentiality of the data requires a special approach and treatment. Let’s keep it simple first: As all big questions in this world : it is all about a trade off – finding a balance. And it has been said before: It is finding a balance between the data utility and the disclosure risk, so there will be some data protection. To what extend? It should not be to difficult to argue where to put the slider, if it was not that quite a lot of elements play a role in the decision and you have most of the time different players : In our case it is our data service (representing researchers) and the NSI.
25 juin 2013 It can get complex, so in order to oversee the whole we put things into a scheme / procedure. That’s what the next slide is about. On the left you will find the different elements indicating you where to put that slider. Then you study how a potential intruder might try to disclose information (and again I do not have time to go into detail), but I will just mention two different ways: e.g. response knowledge might be a threat or the possibility to link data with other datasets. For the left part we are developing guidelines, that will appear as a kind of Checklist for Disclosure Potential. If you see that there is no disclosure risk because for example you have the accessible in a safe center or on remote access. You can make the data accessible. You will have to apply some methods to control for this statistical disclosure. Check the Balance and you can publish the data. This is nice, but the question : How many anonymisation is enough ? How many content do I need to have an interesting dataset ?
25 juin 2013 The reason why we had to extend the scheme with two other boxes being: The necessity to agree on a threshold – I called it - SDC parameters. Literature is talking about disclosure risk measurement. So we can use them. Let’s have a look at the parameters we used.
25 juin 2013 It is possible to fix some threshold for the next elements: In general: The older the data, the more difficult to disclose information The smaller the subsample, the more difficult to disclose The less detailed the geographical detail, the more difficult to disclose. The smaller the global and individual risk, the more difficult to disclose The smaller the nb and categories of indirect id. Var., the more difficult to disclose The higher the degree of anonymity for the socio-demographic characteristics, the more difficult to disclose. In the next and last part of my presentation I will concentrate on this degree of anonymity for socio-demographic characteristics.
First some basic concepts. Micro dataset you can devide the variables that are identifying and the var. that are not identifying. Identifying variables are variables that are either rare, observable or searchable. And we distinguish in general : direct identifiers and indirect identifiers. Some examples: It is common sense that you can not make data available with direct identifiers. And it is –in the meantime – common sense that you have to be careful with indirect identifiers as they may function as a quasi identifier. Easily said, if you know that the data you are looking at are from a female, living in Ecublens next to Lausanne, who was born on the 23.12.yyyy – could be me. The next question I will have is how many statistical look alikes do I have ? So, those indirect identifiers can be used as key to disclose information. That’s the reason why it is important to describe the degree of anonymity.
25 juin 2013 Just small jump into real world. I will just cite this work / just interesting. References are at the of the presentation. 2 points identify 50 % of the individuals. That ‘s what they call a virtual fingerprint.
25 juin 2013 The real world and about a real fingerprint
25 juin 2013 And now I will come back to one of our parameters: degree of socio-demographic characteristic. As our experience shows that the biggest risk comes from linking our datasets with datasets with socio-demographic characteristics we concentrate on obtaining knowledge about the uniqueness of people on those characteristics. We started to look at : gender, age and location. Then we extended with: civil status and nationality.
25 juin 2013 Some figures. You find here the anonimity of the swiss population, given some simple demographics.
25 juin 2013 FORS Ich werde Ihnen jetzt die Arbeit von unser Team COMPASS vorstellen. Wir fangen am Anfang an: es gibt in Neuchatel das Bundesamt für Statistik und es gibt in der Schweiz verbreitet Universitäten, Fachhochschulen, Hochschulen. Das Bundesamt für Statistik sammelt Daten und verarbeitet sie, wenn publiziert kommen da meistens Tabellen heraus. Sie verfügen aber auch über Datensätze und darum geht es hier: Datensätze sind Datenschätze. Die Universitäten haben Forschenden, studierenden. Sie Forschen und unterrichten. Man könnte fast a priori erahnen / sagen das da ein Interesse für Datensätze vorhanden ist. (sekundäre analyse). An sich wurde man sagen ok: da rufen wir einfach an und frage nach wegen ein Datensatz. Aber: Wo kann soll anrufen, wem soll ich fragen und welche Daten soll ich verlangen, welche sind am meisten geeignet für meine Forschungs/Unterrichtszwecke? Es gibt im BFS keine Abteilung /Kontaktstelle die als Aufgabe hat alles im Überblick dar zu stellen. Die Lücke möchten wir füllen. Natürlich habe ich mir erlaubt die ganze Situation vereinfacht dar zu stellen. Ich werde jetzt etwas mehr ins Detail treten.
25 juin 2013 FORS Ich werde Ihnen jetzt die Arbeit von unser Team COMPASS vorstellen. Wir fangen am Anfang an: es gibt in Neuchatel das Bundesamt für Statistik und es gibt in der Schweiz verbreitet Universitäten, Fachhochschulen, Hochschulen. Das Bundesamt für Statistik sammelt Daten und verarbeitet sie, wenn publiziert kommen da meistens Tabellen heraus. Sie verfügen aber auch über Datensätze und darum geht es hier: Datensätze sind Datenschätze. Die Universitäten haben Forschenden, studierenden. Sie Forschen und unterrichten. Man könnte fast a priori erahnen / sagen das da ein Interesse für Datensätze vorhanden ist. (sekundäre analyse). An sich wurde man sagen ok: da rufen wir einfach an und frage nach wegen ein Datensatz. Aber: Wo kann soll anrufen, wem soll ich fragen und welche Daten soll ich verlangen, welche sind am meisten geeignet für meine Forschungs/Unterrichtszwecke? Es gibt im BFS keine Abteilung /Kontaktstelle die als Aufgabe hat alles im Überblick dar zu stellen. Die Lücke möchten wir füllen. Natürlich habe ich mir erlaubt die ganze Situation vereinfacht dar zu stellen. Ich werde jetzt etwas mehr ins Detail treten.