SlideShare a Scribd company logo
1 of 15
Amnesia
Data anonymization made easy
Manolis Terrovitis
Athena Research Center
mter@imis.athena-innovation.gr
8th OpenAIRE Workshop
RDA 9th Plenary – 4/4/2017 Barcelona
anonymization
Original data anonymous data
Data anonymization?
• Data anonymization facilitates the publication of micro data(vs. aggregated
macrodata) , e.g., data used in scientific research
• Micro data often reveal important private information, e.g., the medical
condition of a person
• Laws about privacy become increasingly stricter
• Individuals are afraid to provide their data
• Companies are afraid to share data with experts
• The aim of anonymization methods is to allow sharing such data, without
compromising the privacy of the users.
Data anonymization and Amnesia
• Data anonymization
• Removal of direct identifiers, e.g., Names, SSN etc
• Removal of infrequent combinations of quasi-identifiers, e.g., unique
combinations of birth dates and zipcodes
• Infequent combinations are removed through generalization, e.g., birth date
14/01/1977 becomes **/**/1977
• Amnesia is a scalable anonymization tool
• It offers several versions of k-anonymity
• It allows the user to select and customize possible solutions
• It offers graphical tools that allow the user to analyze the anonymized dataset
• It is scalable and uses all available CPU cores in the anonymization process
3
Link attacks
k-anonymity
• Each entry becomes indistinguishable from other k-1
entries
• k-anonymity is achieved through suppression and
generalization
id Zipcode Age National. Disease
1 13053 28 Russian Heart Disease
2 13068 29 American Heart Disease
3 13068 21 Japanese Viral Infection
4 13053 23 American Viral Infection
5 14853 50 Indian Cancer
6 14853 55 Russian Heart Disease
7 14850 47 American Viral Infection
8 14850 49 American Viral Infection
9 13053 31 American Cancer
10 13053 37 Indian Cancer
11 13068 36 Japanese Cancer
12 13068 35 American Cancer
id Zipcode Age National. Disease
1 130** <30 ∗ Heart Disease
2 130** <30 ∗ Heart Disease
3 130** <30 ∗ Viral Infection
4 130** <30 ∗ Viral Infection
5 1485* ≥40 ∗ Cancer
6 1485* ≥40 ∗ Heart Disease
7 1485* ≥40 ∗ Viral Infection
8 1485* ≥40 ∗ Viral Infection
9 130** 3∗ ∗ Cancer
10 130** 3∗ ∗ Cancer
11 130** 3∗ ∗ Cancer
12 130** 3∗ ∗ Cancer
Structural information
• We need to anonymize all relevant information about a person,
not just a fixed size tuple
• Information tends to gather over time
• Information is linked through semantic properties, it’s schema is
irrelevant
• Personal data tend to accumulate over time
• Research focuses on simple data and complicated guaranties but
real world has complex data and requires simple guaranties
Limits of k-anonymity
• 2-anonymous
Fruits Meat Vegetables Fish
Vassilis Χ Χ
Manolis Χ Χ Χ
Eleni Χ
Maria Χ Χ
Kostas Χ Χ
Food
Vassilis Χ
Manolis Χ
Eleni Χ
Maria Χ
Kostas Χ
Limits of k-anonymity
• 22-anonymous
• Any combination
of m items will
not appear less
than k times
Fruits Meat Vegetables Fish
Vassilis Χ Χ
Manolis Χ Χ Χ
Eleni Χ
Maria Χ Χ
Kostas Χ Χ
Fruits Meat Other food
Vassilis Χ Χ
Manolis X Χ X
Eleni X
Maria Χ X
Kostas Χ X
Anonymization flexibility
9
Privacy Strength
Data quality
Techniques for Privacy Preserving Data
Publishing - Amensia
OpenAIRE General Assembly 2017 – Oslo, 15 February 10
Techniques for Privacy Preserving Data
Publishing - Amensia
11
Evaluation and tuning
12
• Anonymization techniques have not been tested in practice
extensively
• Mapping the social notion of privacy to technical notions is not
easy
• Data utility has not been studied extensively in research
• Few artificial information loss measures
• Data utility is difficult to estimate in practice
• Different applications have different needs
• No easy to quantify the loss of information
Evaluation and tuning
13
• Solution
• A wide range of methods
• Customizable algorithms
• Several utility metrics
• Try and error approach
• On line visual analytics
• User guided decisions
• Multiple solutions
Amensia Challenges
When to anonymize?
• We need to determine the nature and
size of the available data
• Issues like updates and overlaps between different
datasets are important
How to anonymize?
• Data anonymity reduces the quality of
the data
• The reduction must be guided to minimize the
effect on applications
• Different applications and users require
different anonymization methods
• We need to talk to experts
• To understand what they need
• To show them what we can do
14
NEED
We
real world trials!
15

More Related Content

Similar to Amnesia: Data anonymization made easy (8th OpenAIRE workshop)

Current trends in data security nursing research ppt
Current trends in data security nursing research pptCurrent trends in data security nursing research ppt
Current trends in data security nursing research ppt
Nursing Path
 
Standards for a blue ocean
Standards for a blue oceanStandards for a blue ocean
Standards for a blue ocean
MrsAlways RigHt
 

Similar to Amnesia: Data anonymization made easy (8th OpenAIRE workshop) (20)

Constraintsand challenges
Constraintsand challengesConstraintsand challenges
Constraintsand challenges
 
June2014 brownbag privacy
June2014 brownbag privacyJune2014 brownbag privacy
June2014 brownbag privacy
 
Sharing Qualitative Data - Challenges and Opportunities
Sharing Qualitative Data - Challenges and OpportunitiesSharing Qualitative Data - Challenges and Opportunities
Sharing Qualitative Data - Challenges and Opportunities
 
Privacy & Data Ethics
Privacy & Data EthicsPrivacy & Data Ethics
Privacy & Data Ethics
 
G0953643
G0953643G0953643
G0953643
 
Ethical Considerations in Data Analysis_ Balancing Power, Privacy, and Respon...
Ethical Considerations in Data Analysis_ Balancing Power, Privacy, and Respon...Ethical Considerations in Data Analysis_ Balancing Power, Privacy, and Respon...
Ethical Considerations in Data Analysis_ Balancing Power, Privacy, and Respon...
 
From personal health data to a personalized advice
From personal health data to a personalized adviceFrom personal health data to a personalized advice
From personal health data to a personalized advice
 
Watson – Beyond Jeopardy
Watson – Beyond Jeopardy Watson – Beyond Jeopardy
Watson – Beyond Jeopardy
 
Privacy & innovation digital enterprise
Privacy & innovation digital enterprisePrivacy & innovation digital enterprise
Privacy & innovation digital enterprise
 
An itinerary for FAIR and privacy respecting data-driven innovation and research
An itinerary for FAIR and privacy respecting data-driven innovation and researchAn itinerary for FAIR and privacy respecting data-driven innovation and research
An itinerary for FAIR and privacy respecting data-driven innovation and research
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
Legal and ethical considerations in nursing informatics
Legal and ethical considerations in nursing informaticsLegal and ethical considerations in nursing informatics
Legal and ethical considerations in nursing informatics
 
HSCIC's Professor Martin Severs previewing the HSCIC's forthcoming 'New Code ...
HSCIC's Professor Martin Severs previewing the HSCIC's forthcoming 'New Code ...HSCIC's Professor Martin Severs previewing the HSCIC's forthcoming 'New Code ...
HSCIC's Professor Martin Severs previewing the HSCIC's forthcoming 'New Code ...
 
Patient privacy and security
Patient privacy and securityPatient privacy and security
Patient privacy and security
 
Current trends in data security nursing research ppt
Current trends in data security nursing research pptCurrent trends in data security nursing research ppt
Current trends in data security nursing research ppt
 
Data explosion
Data explosionData explosion
Data explosion
 
Standards for a blue ocean
Standards for a blue oceanStandards for a blue ocean
Standards for a blue ocean
 
Machine learning applied in health
Machine learning applied in healthMachine learning applied in health
Machine learning applied in health
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issues
 
FSCI Sharing sensitive data
FSCI Sharing sensitive dataFSCI Sharing sensitive data
FSCI Sharing sensitive data
 

More from OpenAIRE

More from OpenAIRE (20)

10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call
 
9th Content Providers Community Call\
9th Content Providers Community Call\9th Content Providers Community Call\
9th Content Providers Community Call\
 
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)
 
8th Content Providers Community Call
8th Content Providers Community Call8th Content Providers Community Call
8th Content Providers Community Call
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managersOpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managers
 
What will it cost to manage and share my data?
What will it cost to manage and share my data?What will it cost to manage and share my data?
What will it cost to manage and share my data?
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
 
6th Content Providers Community Call
6th Content Providers Community Call6th Content Providers Community Call
6th Content Providers Community Call
 
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?
 
20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science
 
20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)
 
20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science
 
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
COVID-19: Activities, tools, best practice and contact points in Greece
 COVID-19: Activities, tools, best practice and contact points in Greece COVID-19: Activities, tools, best practice and contact points in Greece
COVID-19: Activities, tools, best practice and contact points in Greece
 
5th Content Providers Community Call
5th Content Providers Community Call5th Content Providers Community Call
5th Content Providers Community Call
 
4th Content Providers Community Call
4th Content Providers Community Call4th Content Providers Community Call
4th Content Providers Community Call
 

Recently uploaded

PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Cherry
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Cherry
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 

Recently uploaded (20)

Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 

Amnesia: Data anonymization made easy (8th OpenAIRE workshop)

  • 1. Amnesia Data anonymization made easy Manolis Terrovitis Athena Research Center mter@imis.athena-innovation.gr 8th OpenAIRE Workshop RDA 9th Plenary – 4/4/2017 Barcelona anonymization Original data anonymous data
  • 2. Data anonymization? • Data anonymization facilitates the publication of micro data(vs. aggregated macrodata) , e.g., data used in scientific research • Micro data often reveal important private information, e.g., the medical condition of a person • Laws about privacy become increasingly stricter • Individuals are afraid to provide their data • Companies are afraid to share data with experts • The aim of anonymization methods is to allow sharing such data, without compromising the privacy of the users.
  • 3. Data anonymization and Amnesia • Data anonymization • Removal of direct identifiers, e.g., Names, SSN etc • Removal of infrequent combinations of quasi-identifiers, e.g., unique combinations of birth dates and zipcodes • Infequent combinations are removed through generalization, e.g., birth date 14/01/1977 becomes **/**/1977 • Amnesia is a scalable anonymization tool • It offers several versions of k-anonymity • It allows the user to select and customize possible solutions • It offers graphical tools that allow the user to analyze the anonymized dataset • It is scalable and uses all available CPU cores in the anonymization process 3
  • 5. k-anonymity • Each entry becomes indistinguishable from other k-1 entries • k-anonymity is achieved through suppression and generalization id Zipcode Age National. Disease 1 13053 28 Russian Heart Disease 2 13068 29 American Heart Disease 3 13068 21 Japanese Viral Infection 4 13053 23 American Viral Infection 5 14853 50 Indian Cancer 6 14853 55 Russian Heart Disease 7 14850 47 American Viral Infection 8 14850 49 American Viral Infection 9 13053 31 American Cancer 10 13053 37 Indian Cancer 11 13068 36 Japanese Cancer 12 13068 35 American Cancer id Zipcode Age National. Disease 1 130** <30 ∗ Heart Disease 2 130** <30 ∗ Heart Disease 3 130** <30 ∗ Viral Infection 4 130** <30 ∗ Viral Infection 5 1485* ≥40 ∗ Cancer 6 1485* ≥40 ∗ Heart Disease 7 1485* ≥40 ∗ Viral Infection 8 1485* ≥40 ∗ Viral Infection 9 130** 3∗ ∗ Cancer 10 130** 3∗ ∗ Cancer 11 130** 3∗ ∗ Cancer 12 130** 3∗ ∗ Cancer
  • 6. Structural information • We need to anonymize all relevant information about a person, not just a fixed size tuple • Information tends to gather over time • Information is linked through semantic properties, it’s schema is irrelevant • Personal data tend to accumulate over time • Research focuses on simple data and complicated guaranties but real world has complex data and requires simple guaranties
  • 7. Limits of k-anonymity • 2-anonymous Fruits Meat Vegetables Fish Vassilis Χ Χ Manolis Χ Χ Χ Eleni Χ Maria Χ Χ Kostas Χ Χ Food Vassilis Χ Manolis Χ Eleni Χ Maria Χ Kostas Χ
  • 8. Limits of k-anonymity • 22-anonymous • Any combination of m items will not appear less than k times Fruits Meat Vegetables Fish Vassilis Χ Χ Manolis Χ Χ Χ Eleni Χ Maria Χ Χ Kostas Χ Χ Fruits Meat Other food Vassilis Χ Χ Manolis X Χ X Eleni X Maria Χ X Kostas Χ X
  • 10. Techniques for Privacy Preserving Data Publishing - Amensia OpenAIRE General Assembly 2017 – Oslo, 15 February 10
  • 11. Techniques for Privacy Preserving Data Publishing - Amensia 11
  • 12. Evaluation and tuning 12 • Anonymization techniques have not been tested in practice extensively • Mapping the social notion of privacy to technical notions is not easy • Data utility has not been studied extensively in research • Few artificial information loss measures • Data utility is difficult to estimate in practice • Different applications have different needs • No easy to quantify the loss of information
  • 13. Evaluation and tuning 13 • Solution • A wide range of methods • Customizable algorithms • Several utility metrics • Try and error approach • On line visual analytics • User guided decisions • Multiple solutions
  • 14. Amensia Challenges When to anonymize? • We need to determine the nature and size of the available data • Issues like updates and overlaps between different datasets are important How to anonymize? • Data anonymity reduces the quality of the data • The reduction must be guided to minimize the effect on applications • Different applications and users require different anonymization methods • We need to talk to experts • To understand what they need • To show them what we can do 14