Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

idalab seminar #16 - Lisa Martin

392 views

Published on

Shedding light on the byzantine world of privacy-enhancing technology

At the heart of privacy preserving data analysis lies a fundamental paradox: privacy preservation aims to hide, while data analysis aims to reveal. The two concepts may seem completely irreconcilable at first, but – using the right approach – they need not be. Our Data Strategist Lisa Martin spent two months researching this topic extensively, conducting interviews with industry experts and Startups alike. In this talk, Lisa will share her insights and we invite you to join us in discussing one of the most pressing issues of the 21st century: data privacy.

Speaker: Lisa Martin graduated from the University of Oxford with an MSc in Economics for Development. Prior to this, Lisa studied economics with a strong focus on econometrics in Tuebingen, Boston and Nuremberg, completing a BSc in International Economics. She has worked as an intern in the Economic & Market Intelligence Department at Daimler in Stuttgart, as well as in management consulting at Oliver Wyman in Munich. Lisa started working at idalab as a Summer Intern in August 2018 and subsequently joined our team as Data Strategist fulltime.

Published in: Data & Analytics
  • Login to see the comments

idalab seminar #16 - Lisa Martin

  1. 1. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 1 Agency for Data Science Machine learning & AI Mathematical modelling Data strategy Lisa Martin How to unlock valuable personal data for analysis idalab seminar | 17 January 2019
  2. 2. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 2 Why does privacy matter for businesses? Legal obligations Businesses are legally obliged to protect the privacy of individuals whose data they have been entrusted with. Reputational risks Businesses may have an interest in protecting their customers’ privacy beyond what’s legally required. Privacy breaches can cause scandals and lasting damage to a firm’s reputation. Intellectual property In addition to protecting personal information of customers, businesses may want to protect their own sensitive information and business secrets.
  3. 3. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 3 Privacy meets technology: privacy enhancing technologies can be used to preserve privacy in all digital domains Enisa PETs control matrix, HP laboratories Online Browsing Networks Data Science Analytics Machine Learning Cloud Systems Identity Mgmt PET Messaging The PET universe and the scope of this presentation
  4. 4. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 4 Privacy challenges & solutions
  5. 5. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 5 Privacy challenges in data science can be met by employing both technological and organizational solutions Data storage Data sharing Data sets linking Anonymization, encryption pseudonymization, techno- logical access controls Anonymization (synthetic data), pseudonymization, encryption Harmonized pseudonymization Data science task Privacy challenge Technological solutions Organizational solutions Separating the information value of the data from identifying elements Protecting identifying and sensitive information when sharing data for further use Combining inputs into a joint data set while keeping the inputs private Order data processing contract, escrow service Physical access controls, legal agreements Trusted third party Queries Differentially private anonymization Running queries on data without exposing individual database entries NA Information retrieval Homomorphic encryption Privately retrieving information without revealing the query NA
  6. 6. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 6 Privacy challenges in data science can be met by employing both technological and organizational solutions (cont’d) Exploratory analysis Machine learning Functional testing Multiparty computation Maybe pseudonymization, but with limited utility Pseudonymization, synthetic data, homo-morphic encryption hybrid Pseudonymization, synthetic data Homomorphic encryption, secret sharing Data science task Privacy challenge Technological solutions Organizational solutions Analyzing data with respect to its quality and properties in a privacy preserving way Training, testing and using ML models in a privacy preserving way Using private data to test the functionalities of data products Jointly calculating a function without disclosing any private inputs Trusted third party Order data processing contract Order data processing contract Order data processing contract
  7. 7. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 7 Technological solutions explained
  8. 8. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 8 Data can be anonymized, pseudonymized or encrypted to protect private and sensitive information Anonymization Irreversible modification of personal identifiers Encryption Encoding entire datasets to ensure that identifiers and other attributes are only accessible to authorized parties De-identification Modification of attributes which facilitate identification of individuals Pseudonymization Reversible modification of personal identifiers Generalization Masking Scrambling Encryption of PII Partially homomorphic encryption Synthetic data Hash functions Perturbation Suppression Top & bottom coding Fully homomorphic encryption Order-preserving encryption Deterministic encryption Probabilistic encryption Tokenization
  9. 9. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 9 Anonymization techniques usually score high in terms of privacy, but they may considerably decrease data utility Generalization Perturbation Suppression Top & bottom coding First Last 1950-1959 170-179 cm First Middle Last 1960-1969 180-189 cm First Middle Last 1960-1969 170-179 cm 179 cm 183 cm 175 cm First Last 07.09.1958 179 cm First Middle Last 19.08.1969 180 cm First Middle Last 24.03.1962 175 cm First Last 05.10.1957 181 cm First Middle Last 25.06.1970 181 cm First Middle Last 29.04.1961 176 cm Individual values, cells, or even entire columns are (selectively) removed Values that exceed given bounds are modified Values are perturbed by introducing noise (multiplicative or additive), replacing values, or (micro)aggregating Values of an attribute are generalized either full-domain (same value for all) or subtree (same value per subgroup) Walter White 07.09.1958 179 cm Chandler Muriel Bing 19.08.1969 183 cm Claire Hale Underwood 24.03.1962 175 cm Maturity Feasibility Maturity Feasibility Maturity Feasibility Maturity Feasibility Techniques
  10. 10. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 10 Anonymization deep dive Synthetic data www.statice.ai » Synthetic data is non-personal and non-sensitive, it can be freely stored and shared with multiple parties. » While statistical properties (conditional probabilities etc.) are identical, the synthetic dataset does not contain any real observations. » High-quality synthetic data may require considerable customization efforts. » Promising solution for e.g. basic analytics and functional testing of data products, but less useful if linkability to real life individuals is required. » The datasets below do not have a single observation in common, yet when expanded out, their statistical properties are identical. Original data (first five observations) Synthetic data (first five observations) A synthetic dataset is a newly created dataset which retains all the key statistical properties of the original dataset, but has no overlap with it.
  11. 11. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 11 Data can be anonymized, pseudonymized or encrypted to protect private and sensitive information Anonymization Irreversible modification of personal identifiers Encryption Encoding entire datasets to ensure that identifiers and other attributes are only accessible to authorized parties De-identification Modification of attributes which facilitate identification of individuals Pseudonymization Reversible modification of personal identifiers Generalization Masking Scrambling Encryption of PII Partially homomorphic encryption Synthetic data Hash functions Perturbation Suppression Top & bottom coding Fully homomorphic encryption Order-preserving encryption Deterministic encryption Probabilistic encryption Tokenization
  12. 12. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 12 Pseudonymized data retains higher utility, but it’s prone to re- identification through linkage attacks Hash functions Encryption of PII Masking Scrambling DFCD 3454 BBEA 07.09.1958 179 cm 0086 46BB FB7D 19.08.1969 183 cm 8FD8 7558 7851 24.03.1962 175 cm Wxxxxx Wxxxx 07.09.1958 179 cm Exxxxxxxx Jxxx Mxxxxx 19.08.1969 183 cm Mxxxxxx Gxxxx 24.03.1962 175 cm ltraeW hWtie 07.09.1958 179 cm lreCahdn riluMe gnBi 19.08.1969 183 cm reliaC Hlea wddreUono 24.03.1962 175 cm dhsaly dopal 07.09.1958 179 cm johuksly tbypls ipun 19.08.1969 183 cm jshpyl ohsl buklydvvk 24.03.1962 175 cm Individual values, cells, or even entire columns are partially suppressed or modified Data is modified by randomly changing the order of letters and digits Personally identifiable information is encrypted using an invertible encryption scheme Data entries of arbitrary length are mapped to fixed length ”hashes” using one-way functions Walter White 07.09.1958 179 cm Chandler Muriel Bing 19.08.1969 183 cm Claire Hale Underwood 24.03.1962 175 cm Maturity Feasibility Maturity Feasibility Maturity Feasibility Maturity Feasibility Techniques
  13. 13. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 13 Pseudonymization deep dive Tokenization www.protegrity.com » Simple and intuitive procedure » In combination with the corresponding static lookup tables, data is linkable to data subjects. The lookup table has thus to be kept secure to protect data privacy. » Difficult to assess what data needs to be pseudonymized, i.e. determining what constitutes direct and indirect identifiers. » Trade-off between data utility and data privacy: if too few identifiers are tokenized, data may be prone to ‘linkage attacks’ – if too many are tokenized, utility of the data declines. » Unlike most anonymization techniques, pseudonymization techniques like tokenization facilitate correlating observations (e.g. entries 1 and 7 were generated by the same data subject). Personal identifiers are replaced by ‘tokens’, so that data entries are not directly linkable to individuals.
  14. 14. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 14 Data can be anonymized, pseudonymized or encrypted to protect private and sensitive information Anonymization Irreversible modification of personal identifiers Encryption Encoding entire datasets to ensure that identifiers and other attributes are only accessible to authorized parties De-identification Modification of attributes which facilitate identification of individuals Pseudonymization Reversible modification of personal identifiers Generalization Masking Scrambling Encryption of PII Partially homomorphic encryption Synthetic data Hash functions Perturbation Suppression Top & bottom coding Fully homomorphic encryption Order-preserving encryption Deterministic encryption Probabilistic encryption Tokenization
  15. 15. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 15 Encryption can provide privacy and security at once, but it’s computationally complex and has limited usability Ada Popa, Redfield, Zeldovich, Balakrishnan (2011) CryptDB: Protecting Confidentiality with Encrypted Query Processing Order-preserving encryption Homomorphic encryption Deterministic encryption Probabilistic encryption Two equal values are mapped to identical ciphertexts Two equal values are mapped to different ciphertexts Computations on ciphertext are possible and yield the same results as computations on plaintext Order relations between ciphertexts are identical to the order relations between the underlying plaintexts Maturity Feasibility Maturity Feasibility Maturity Feasibility Maturity Feasibility Techniques Examples OPE scheme in cryptDB 𝑥 > 𝑦 → 𝐸𝑛𝑐 𝑥 > 𝐸𝑛𝑐(𝑦) DET scheme in cryptDB 𝐸𝑛𝑐 𝑥 = 𝐸𝑛𝑐(𝑥 + 0) AES with random initialization vector 𝐸𝑛𝑐 𝑥 ≠ 𝐸𝑛𝑐(𝑥 + 0) Paillier scheme 𝐷𝑒𝑐(𝐸𝑛𝑐 𝑥 × 𝐸𝑛𝑐 𝑦 ) = 𝑥 + 𝑦
  16. 16. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 16 Encryption deep dive Fully homomorphic encryption Prof. Dr. Delaram Kahrobaei (idalab seminar #12) » Partially homomorphic encryption schemes support only a limited number of operations, usually either addition or multiplication. » Fully homomorphic encryption schemes facilitate arbitrary computation on ciphertext. FHE schemes are homomorphic with respect to both addition and multiplication: 𝑑𝑒𝑐 𝑒𝑛𝑐 𝑎 + 𝑒𝑛𝑐 𝑏 = 𝑎 + 𝑏 𝑑𝑒𝑐 𝑒𝑛𝑐 𝑎 × 𝑒𝑛𝑐 𝑏 = 𝑎 × 𝑏 e𝑛𝑐(𝑎) 𝑎 𝑓(𝑎) e𝑛𝑐(𝑓 𝑎 ) e𝑛𝑐 𝑓 𝑒𝑣𝑎𝑙 e𝑛𝑐 » Noise in FHE schemes grows linearly with addition and quadratically with multiplication. » The expensive ‘bootstrapping’ procedure is used to reduce noise when it grows too large. » ‘Leveled’ FHE schemes don’t require boot- strapping but have limited usability. » Parametrization governs the complexity and security of FHE implementations. cipherspace plainspace encrypt(x) = 2x decrypt(c) = x/2 encrypt(x) = 2x decrypt(c) = x/2 6 + 10 = 16 3 + 5 = 8 (6 • 10) /2 = 30 3 • 5 = 15 Source: americanscientist.org » The security proofs rely on the computational hardness of complex mathematical problems, which make FHE schemes quantum secure.
  17. 17. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 17 Encryption deep dive (cont’d) Fully homomorphic encryption (FHE) » FHE simultaneously ensures data privacy and security. » High computational complexity makes FHE very costly. Computation using FHE is often one million times slower than computation on plaintext, so it’s usually not practical for Machine Learning. » There are numerous potential use cases for efficient FHE schemes in e.g. global optimization, benchmarking and data sharing across all industries. It may be particularly useful in health care and life sciences, where data utility and privacy requirements are equally high. » Secure FHE schemes are very complex and difficult to visualize. But very simple “toy” HE schemes which are partially homomorphic and cryptographically insecure can help to illustrate the idea: FHE can perform arbitrary computations on ciphertext. Decryption then yields the same results as does performing the computations on plaintext. ”Toy” HE scheme: Caesar cipher and concatenation HELLO URYYR HELLO WORLD JBEYQ WORLD HELLOWORLD URYYRJBEYQ HELLOWORLD = Encrypt Decrypt = =
  18. 18. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 18 Potential use cases
  19. 19. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 19 Information retrieval M&A research can be conducted privately using FHE Data vendor Information Potential buyer ? Query Data science task » A buyer wants to retrieve information about a potential acquisition target from a database. Privacy challenge » The buyer does not want the data vendor to know which firm they are interested in. Technological solution » The query can be encrypted using FHE, so that it can be run without decrypting it. The data vendor sends over the encrypted information, which the buyer can decrypt and analyze. Schematic illustration
  20. 20. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 20 Multiparty computation FHE facilitates privacy preserving supply chain optimization 1 The literature on Joint Economic Lot Size Models shows this analytically. Data science task » A supplier and a manufacturer want to optimize their supply chains. If they do so jointly, rather than individually, the overall result will be better.1 Privacy challenge » They don’t want to share their sensitive data on warehousing costs and production processes with each other. Technological solution » They use FHE to encrypt their data so that they can jointly optimize without giving each other insight into private data. Supplier Manufacturer Schematic illustration
  21. 21. © 2018 | idalab GmbH | Potsdamer Straße 68 | 10785 Berlin | idalab.de page 21 Q&A

×