Successfully reported this slideshow.

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

0

Share

Loading in …3
×
1 of 50
1 of 50

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

0

Share

Download to read offline

This paper describes how technologies such as data pseudonymisation and differential privacy technology enables access to sensitive data and unlocks data opportunities and value while ensuring compliance with data privacy legislation and regulations.

This paper describes how technologies such as data pseudonymisation and differential privacy technology enables access to sensitive data and unlocks data opportunities and value while ensuring compliance with data privacy legislation and regulations.

More Related Content

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

  1. 1. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy This paper describes how technologies such as data pseudonymisation and differential privacy technology enables access to sensitive data and unlocks data opportunities and value while ensuring compliance with data privacy legislation and regulations Alan McSweeney January 2022 alan@alanmcsweeney.com
  2. 2. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 2 Contents Introduction.......................................................................................................................................................................4 Personal Information .........................................................................................................................................................6 Third-Party Data Sharing And Data Access Framework .....................................................................................................7 Data Privacy Technologies...............................................................................................................................................10 Context Of Data Privatisation – Anonymisation, Pseudonymisation And Differential Privacy .......................................... 11 Data Sharing Use Cases...............................................................................................................................................14 Pseudonymisation ........................................................................................................................................................... 15 Why Pseudonymise Rather Than Anonymise? .............................................................................................................16 GDPR Origin Of Pseudonymisation .............................................................................................................................16 Growing Importance Of Pseudonymisation .................................................................................................................19 Approaches To Pseudonymisation...............................................................................................................................19 Pseudonymisation By Replacing ID Fields With Linking Identifier (Token) ...............................................................20 Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields..............................................21 ID Field Hashing Pseudonymisation.........................................................................................................................21 Hashing And Identifier Codes ..................................................................................................................................22 Hashing And Reversibility........................................................................................................................................23 ID Field Hashing Pseudonymisation With Data Salting And Peppering ....................................................................24 Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering ............................................25 Content Hashing Pseudonymisation........................................................................................................................26 Pseudonymisation And Data Lakes/Data Warehouses................................................................................................. 27 Pseudonymisation Implementation.............................................................................................................................28 Data Breaches and Attacks ..............................................................................................................................................28 Pseudonymisation and Data Breaches.........................................................................................................................29 Differencing Attack .....................................................................................................................................................30 Differencing Attack, Reconstruction Attack And Mosaic Effect.................................................................................... 31 Differential Privacy ..........................................................................................................................................................32 Data Privatisation and Differential Privacy Solution Architecture Overview.....................................................................34 Differential Privacy Platform Solution Service Management Processes .......................................................................36 Differential Privacy Platform Deployment Options...................................................................................................... 37 On-Premises Deployment .......................................................................................................................................38 Cloud Deployment ..................................................................................................................................................39 Differential Privacy and Data Attacks ..........................................................................................................................40 Data Privatisation and Differential Privacy Solution Planning ..........................................................................................40 Data Privatisation and Differential Privacy Solution Operation and Use...........................................................................41 Data Privatisation and Differential Privacy Next Steps.....................................................................................................43 Early Business Engagement and Differential Privacy Opportunity Validation...............................................................45 Differential Privacy Detailed Design ............................................................................................................................46 Differential Privacy Readiness Assessment.................................................................................................................. 47 Differential Privacy Architecture Sprint .......................................................................................................................49
  3. 3. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 3 List of Figures Figure 1 – Data Privacy Subject Areas ................................................................................................................................5 Figure 2 – Data Privacy and Data Utility Balancing Act.......................................................................................................5 Figure 3 – Data Sharing and Data Access Framework.........................................................................................................7 Figure 4 – Data Sharing and Access Topologies .................................................................................................................9 Figure 5 – Data Privatisation Spectrum ............................................................................................................................10 Figure 6 – Data Privacy Technologies............................................................................................................................... 11 Figure 7 – Context of Data Privatisation ...........................................................................................................................12 Figure 8 – Overview of Pseudonymisation ....................................................................................................................... 15 Figure 9 – Pseudonymisation for Data Sharing with External Business Partners...............................................................16 Figure 10 – Overview of Approaches to Pseudonymisation ..............................................................................................19 Figure 11 – Pseudonymisation By Replacing ID Fields With Linking Identifier...................................................................20 Figure 12 – Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields ....................................21 Figure 13 – ID Field Hashing Pseudonymisation ...............................................................................................................22 Figure 14 – ID Field Hashing Pseudonymisation With Data Salting And Peppering...........................................................24 Figure 15 – Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering ...................................25 Figure 16 – Content Hashing Pseudonymisation ..............................................................................................................26 Figure 17 – Pseudonymisation and Data Lakes/Data Warehouses .................................................................................... 27 Figure 18 – Pseudonymisation and Data Breaches ...........................................................................................................29 Figure 19 – Differential Privacy and Differencing Attacks................................................................................................. 31 Figure 20 – Differencing Attack, Reconstruction Attack And Mosaic Effect......................................................................32 Figure 21 – Differential Privacy Operation........................................................................................................................ 33 Figure 22 – Data Privatisation and Differential Privacy Balancing Act ..............................................................................34 Figure 23 – Operational Data Privatisation and Differential Privacy Solution Architecture ............................................... 35 Figure 24 – Sample High-Level On-Premises Deployment ...............................................................................................38 Figure 25 – Sample High-Level Cloud Deployment ..........................................................................................................39 Figure 26 – Data Privatisation and Differential Privacy Solution Journey..........................................................................43 Figure 27 – Approaches to Data Privatisation and Differential Privacy Solution Scoping and Definition ...........................44 Figure 28 – Early Business Engagement and Differential Privacy Opportunity Validation Process ....................................46 Figure 29 – Differential Privacy Detailed Design Views .................................................................................................... 47 Figure 30 – Areas Covered in Differential Privacy Readiness Assessment .........................................................................48
  4. 4. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 4 Introduction This paper examines the related concepts of data privatisation, data anonymisation, data pseudonymisation, and differential privacy. Data has value. To realise this value, it may need to be made more widely available, both within and outside your organisation for various types of access such as sharing data with outsourcing and service partners or making data available to research partners. This data sharing must be performed in the context of maintaining personal data privacy. This paper examines the technology options to provide different types access to data while preserving privacy and ensuring compliance with the many (and growing) data privacy regulatory and legislative requirements. You need to take a risk management approach to data sharing and third-party data access. Appropriate technology, appropriately implemented and operated is a means of managing and reducing risks of re-identification by making the time, skills, resources and money necessary to achieve this unrealistic. A demonstrable technology-based approach to data privacy supported by a data sharing business framework reduces an organisation’s liability in the event of data breaches. For example, with the EU GDPR (General Data Protection Regulation)1 where a data breach occurs, the controller is exempted from its notification obligations where it can show that the breach is ‘unlikely to result in a risk to the rights and freedoms of natural persons’2 such as when pseudonymised data leaks and the re-identification risk is remote. Organisations need a well-defined and implement process that enables them to make your data available as widely as possible without exposing them to risks associated with non-compliance with the wide range of differing data privacy regulations. Managing data privacy in the context of data access and sharing arrangements encompasses the areas of: • Data Governance • Privacy Management • Security Management • Risk Management 1 See http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%3A32016R0679 2 See GDPR recitals 80 and 85 and articles 27 and 33.
  5. 5. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 5 Figure 1 – Data Privacy Subject Areas Managing data privacy in the context of data access and sharing arrangements is a balancing access between data privacy and data utility. Perfect data privacy can be achieved by not sharing or making accessible any data irrespective of whether it contains personal identifiable information. The result of this is that data is unused. Perfect data utility can be achieved by sharing and making accessible all data. The result of this is that there is no data privacy. One aspect of data privacy management is taking a risk-based approach to this balancing act. Figure 2 – Data Privacy and Data Utility Balancing Act This paper describes some practical, realistic and achievable approaches to implementing data privatisation using pseudonymisation and differential privacy approaches as a means of addressing your data sharing and access requirements and opportunities.
  6. 6. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 6 This paper covers the following topics: • Personal Information – this discusses what is meant by personal information. • Third-Party Data Sharing And Data Access Framework – data sharing is a business issue enabled through technologies. But it is primarily a business concern and any arrangements should be grounded in a business framework. • Data Privacy Technologies and Context Of Data Privatisation – this discusses data privatisation approaches of anonymisation, pseudonymisation and differential privacy. It covers the GDPR origin of pseudonymisation, the growing importance of pseudonymisation and various approaches to pseudonymisation, hashing and pseudonymisation and data lakes/data warehouses. • Data Breaches and Attacks – the provides background information on data breaches and attacks and how data privatisation approaches provide protection against them. • Why Data Privatisation and Differential Privacy – this provides a context to the need for a robust, secure operational data privatisation and differential privacy technology framework. • Data Privatisation and Differential Privacy Solution Architecture Overview – how does a differential privacy solution sits within your existing information technology solution and data landscape, what are its components and what are the solution deployment options. • Data Privatisation and Differential Privacy Solution Planning – what does an exercise to plan for the implementation and operation of a successful data privatisation and differential privacy solution consists of. • Data Privatisation and Differential Privacy Solution Operation and Use – how the data privatisation and differential privacy solution is operated and used. • Differential Privacy Next Steps – this describes a set of possible next steps and types of engagement to allow you move along the data privatisation and differential privacy journey successfully. Personal Information Personal information is any information relating to an identified or identifiable natural person. This can be direct - information that directly identifies a single individual – or indirect or quasi-identifiers - information that can be used to identify an individual by being linked with other information. Quasi-identifiers include information such as date of birth, date of death, post code and others. These do not specifically link to an individual but such links can be determined. Personal information can be structured or unstructured such as free-form text or it can take other forms such as images (photographs, medical images) or other data types such as genomic data. Personal information can be stored in multiple different ways from database tables and columns to data formats such as documents and spreadsheets to image files. Personal information may also exist in the form of metadata attached to date files. The technologies underpinning data privatisation will need to handle all these data types and formats.
  7. 7. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 7 When considering data privatisation in the context of data access and sharing, the full set of personal information and the range of data formats should be considered. The approach to handling quasi-identifiers may be different to that taken for direct identifiers. Rather than completely removing them they could made more general such as month and year for date of birth or a data range could be specified. Third-Party Data Sharing And Data Access Framework Managing data privacy in the context of data access and data sharing is not just a technology concern. The selection, implementation and operation of technologies needed to ensure data privacy exist within a wider data sharing and access framework. Organisations that intend to share and provide access to data should define such a framework. This will provide an explicit approach rather than leaving such arrangements implicit and poorly defined. It will reduce the time and effort required to implement data access and sharing. It will ensure a consistent and coherent approach. The following diagram describes a possible structure for such a framework. Data access and sharing covers both internal, such as other business units other than the originating business unit accessing data, and external – third-parties being given access to data for business and research purposes. Figure 3 – Data Sharing and Data Access Framework This framework has the following dimensions: 1. Business and Strategy Dimension – this relates to the overall organisation posture relating to internal and external data access and sharing and needs to cover topics such as: • Overall Objectives, Purposes and Goals – this sets the context and overall direction of and the principles that will underpin data sharing and data access arrangements. The objectives, purposes and goals of these arrangements will be defined.
  8. 8. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 8 • Data Sharing Strategy – this will define the organisation’s strategy for internal and external data sharing and access – why it is being done, who will be allowed access to data, the types of data to which access will be granted, the types of access allowed and the technology approaches that will be used. • Risk Management, Governance and Decision Making – this will cover how data sharing and access arrangements will be governed and managed, how decisions will be made on these arrangements and how data sharing and access risks will be managed. • Charges and Payments – this will define the charges and payments structure, if applicable, that will apply to data access and sharing arrangements. • Monitoring and Reporting – this will document how the operation and use of data access and sharing arrangements will be monitored, audited and reported on. 2. Legal Dimension – this encompasses the legal aspects of data sharing and needs to cover topics such as: • Data Privacy Legislation and Regulation Compliance – this will cover the activities of researching and monitoring the data privacy legislative and regulatory landscape and any changes and developments that may impact data access and sharing. • Contract Development and Compliance – this will encompass the development, negotiation and implementation of contractual arrangements governing specific data access and sharing arrangements. 3. Technology Dimension – this covers technology and security standards and needs to cover topics such as: • Data Sharing and Data Access Technology Selection – this covers the arrangements and responsibilities for selecting the tools and technologies that will be used to implement data access and sharing. • Technology Standards Monitoring and Compliance – this will define the responsibilities for and scope of monitoring technology standards and developments, the organisation’s adoption of and compliance with those standards and managing change as the standards change. • Security Standards Monitoring and Compliance – this will describe both how data access and sharing security standards should be monitored, how security is implemented for data sharing and access arrangements and managing change as the standards change. 4. Development and Implementation Dimension – this relates to the implementation of data sharing technology tools and platforms and of specific data access and sharing arrangements and needs to cover topics such as: • Technology Platform and Toolset Selection and Implementation – the includes the selection and implementation of specific data access and sharing technologies covering security and access control, the range of data types and data access facilities being offered. • Functionality Model Development and Implementation – this relates to defining and implementing the data access and sharing functionality, features being offered and the tools and technologies that will support them. • Data Sharing and Access Implementations – this encompasses the specification and implementation of specific data access and sharing arrangements. • Data Sharing and Access Maintenance and Support – this covers the maintenance and support arrangements both of the overall data access and sharing tools, platforms and technologies as well as the specific arrangements.
  9. 9. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 9 5. Service Management Dimension – this defines the operational processes that should be defined and implemented in order to operate data sharing and needs to cover topics such as: • Service Management Processes – this defines the operational and service management processes that need to be implemented and operated. • Operational and Service Level Agreement Management – this covers the topic of defining and then managing and monitoring compliance with operational and service level agreements for data access and sharing arrangements. • Maintain Inventory of Data Sharing Arrangements – this covers the maintenance of a list of current and previous data sharing and access arrangements. • Service Monitoring and Reporting – this defines how the data sharing arrangements will be monitored and reported on. • Issue Handing and Escalation – this covers how any issues relating to the operation and use of data sharing will be recorded, handled and escalated. There are different data sharing and access arrangements. Figure 4 – Data Sharing and Access Topologies Data can be made available more widely within the organisation for purposes for which it was not originally collected. Data can be made publically available. Once this has been done, it will not be possible to control who uses it, the uses to which it is put or to recall it.
  10. 10. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 10 Data can be shared subject to some form of legal or contractual arrangement. Data can be shared through some form of controlled and secure facility. In the last two arrangements, some form of trust exists between the sharing entity and the data recipient. This sharing may be supported by penalties (after disclosure) or by technology (disclosure prevention) or both. Data can be pushed to the target or the data can be made available to the target through a pull or download facility. The data sharing and access framework should cover all these possibilities. Within the context of data access and sharing, data privatisation can be viewed as a spectrum from completely identifiable data to data that is not linked to individuals. Figure 5 – Data Privatisation Spectrum The data privacy risk is reduced as you move further to the right. Data utility may also be reduced as you move to the right. The data sharing and access framework should combine both the data sharing and access topology and data privatisation spectrum to get a more complete view of data access arrangements. Data Privacy Technologies Data privatisation is the removal of personal identifiable information (PII) from data. At a very high-level, data privatisation can be achieved in one or both of two ways: 1. Data Summarisation – sets of individual data records are compressed into summary statistics with all personal information removed 2. Data Tokenisation – the personal data within a dataset that allows an individual to be identified is replaced by a token (possibly generated from the personal data such as by hashing), either permanently (anonymisation) or reversibly (pseudonymisation)
  11. 11. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 11 Figure 6 – Data Privacy Technologies There are different routes to making data accessible and shareable within and outside the organisation without compromising compliance with data protection legislation and regulations and removing the risk associated with allowing access to personal data. • Differential Privacy – source data is summarised, and individual personal references are removed. The one-to-one correspondence between original and transformed data has been removed. • Anonymisation – identifying data is destroyed and cannot be recovered so individual cannot be identified. There is still a one-to-one correspondence between original and transformed data. • Pseudonymisation – identifying data is encrypted and recovery data/token is stored securely elsewhere. There is still a one-to-one correspondence between original and transformed data These technologies and approaches are not mutually exclusive – each is appropriate to different data sharing and data access use cases. Context Of Data Privatisation – Anonymisation, Pseudonymisation And Differential Privacy The wider context of data privatisation and specific approaches for enabling it such as anonymisation, pseudonymisation and differential privacy can be represented by the four interrelated areas of: • Value in Data Volumes and Data Assets – you have expended substantial resources in gathering and processing and generating data. This data has value that you want to realise by making it more widely available. The need to comply with the increasing body of data protection and privacy laws inhibits your ability to achieve this. • Data Privacy Laws and Regulations – you need to ensure that making your data available to a wider range of individuals and organisations does not breach the ever-increasing set of data protection and privacy legislation and regulations. All too frequently the cost of and concerns around ensuring this compliance prevents this wider data access.
  12. 12. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 12 • Technologies – the various data privatisation privacy technologies are mature, well-proven, industrialised and are independently certified. They can be used to provide controlled, secure access to your data while guaranteeing compliance with data protection and privacy legislation. Using these technologies will embed such compliance by design into your data sharing and access facilities. This will allow you to realise value from your data successfully. • Data Processes and Business Data Trends – the volumes of data available to organisations are increasing. The range of analysis tools and technologies available are increasing. Data storage is moving to cloud-platforms that can handle data volumes and provide analysis tools more easily than costly and complex on-premises solutions that are available only to larger organisations. Organisations are outsourcing more business processes to third parties. These outsourcing arrangements require the sharing of data. Figure 7 – Context of Data Privatisation To achieve the value inherent in your data you need to be able to make it appropriately available to others. You need a process that enables you to make your data available as widely as possible without exposing you to risks associated with non-compliance with the wide range of differing data privacy regulations. You need one data access framework and associated set of technologies that work for all data access and sharing while guaranteeing legislative and regulatory compliance.
  13. 13. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 13 Data Privatisation Topology – Data Privacy Laws and Regulations The landscape of data protection and privacy legislation and regulations is extensive, complex and growing – this is just a partial and incomplete view. Organisations that share data externally need to be able to guarantee compliance with all relevant and applicable legislation. Data Privatisation Topology – Value in Data Volumes and Data Assets Organisations have more and more data of increasing complexity that they want and need to share in order to generate value.
  14. 14. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 14 Data Privatisation Topology – Technologies There are a range of well-proven technologies available for ensuring data privacy. Data Privatisation Topology – Data Processes and Business Data Trends Organisations want to outsource their business processes and share their data with partners to gain access to specialist analytics and research skills and tools. Data Sharing Use Cases There are many data sharing use cases and scenarios that involve the sharing potential personal identifiable information such as: • Share data with other business functions within your organisation • Use third-party data processing and storage platform and facilities • Use third-party data access and sharing as a service platform and facilities • Use third-party data analytics platform and facilities • Engage third-party data research organisations to provide specialist services
  15. 15. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 15 • Share data with external researchers • Outsource business processes and enable data sharing with third parties • Share data with industry business partners to gain industry insights • Share data to detect and avoid fraud • Share customer data with service providers at the request of the customer • Enable customer switching • Participate in Open Data initiatives Pseudonymisation Pseudonymisation is an approach to deidentification where personally identifiable information (PII) values are replaced by tokens or artificial identifiers or pseudonyms. Pseudonymisation is one technique to assist compliance with EU General Data Protection Regulation (GDPR) requirements for secure storage of personal information. Pseudonymised is intended to be reversible: the pseudonymised data can be restored to its original state. Personal data fields can be individually pseudonymised so there is a one-to-one correspondence between original source data fields and transformed data fields or the personal data fields can be removed and replaced with a token. Figure 8 – Overview of Pseudonymisation
  16. 16. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 16 Why Pseudonymise Rather Than Anonymise? Personal identifiable data is pseudonymised when there is a need to re-identify the data, for example, after it has been worked on by a third-party either within or outside the organisation and the results of the processing need to be matched to the original data. The following diagram illustrates such a scenario. Figure 9 – Pseudonymisation for Data Sharing with External Business Partners The numbered steps are: 1. Original Data – this is the original collected or processed data containing personal identifiable information. 2. Pseudonymised Data – the personal identifiable information within the data is pseudonymised. 3. Pseudonymisation Key – there is separate pseudonymisation key that allows pseudonymised for be re-identified when needed. This needs to be kept separate from the pseudonymised data. 4. Pseudonymised Data Transmitted to Data Processor – the pseudonymised data is then sent to the external data processor for their use. 5. Processed Data with Additional Processed Data – the data is enriched with the results of additional processing. 6. Pseudonymised Data with Additional Processed Data Returned – the enriched data is returned to the organisation. 7. Original Data Merged with Additional Processed Data – the enriched data is re-identified using the previously created pseudonymisation key. Pseudonymisation can also be used as part of the archiving process for data containing personal identifiable information after its main processing has been completed and the data is being retained for historical purposes. GDPR Origin Of Pseudonymisation The use of pseudonymisation as a form of encryption of personal identifiable information gained importance and legitimacy from the GDPR. Pseudonymisation is referred to many times in the GDPR. The term pseudonymisation is defined in Article 4(5) of the GDPR: ‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such
  17. 17. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 17 additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person; Pseudonymisation is also referred to in Recitals 26 and 28 of the GDPR: Recital 26 The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes. Recital 28 The application of pseudonymisation to personal data can reduce the risks to the data subjects concerned and help controllers and processors to meet their data-protection obligations. The explicit introduction of ‘pseudonymisation’ in this Regulation is not intended to preclude any other measures of data protection. Article 32(1)(a), dealing with security refers to the pseudonymisation and encryption of personal data, uses pseudonymisation to mean changing personal data so that resulting data cannot be attributed to a specific person without the use of additional information. Article 89, covering safeguards and derogations relating to processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, refers to pseudonymisation as follows 1. Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards, in accordance with this Regulation, for the rights and freedoms of the data subject. Those safeguards shall ensure that technical and organisational measures are in place in particular in order to ensure respect for the principle of data minimisation. Those measures may include pseudonymisation provided that those purposes can be fulfilled in that manner. Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner. Article 6 (4), covering lawfulness of processing, refers to pseudonymisation as a means of possibly contributing to the compatibility of further use of data: Where the processing for a purpose other than that for which the personal data have been collected is not based on the data subject's consent or on a Union or Member State law which constitutes a necessary and proportionate measure in a democratic society to safeguard the objectives referred to in Article 23(1), the controller shall, in order to ascertain whether processing for another purpose is compatible with the purpose for which the personal data are initially collected, take into account, inter alia: (a) any link between the purposes for which the personal data have been collected and the purposes of the intended further processing;
  18. 18. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 18 (b) the context in which the personal data have been collected, in particular regarding the relationship between data subjects and the controller; (c) the nature of the personal data, in particular whether special categories of personal data are processed, pursuant to Article 9, or whether personal data related to criminal convictions and offences are processed, pursuant to Article 10; (d) the possible consequences of the intended further processing for data subjects; (e) the existence of appropriate safeguards, which may include encryption or pseudonymisation. Article 25 refers to pseudonymisation as a means to contribute to data protection by design and by default in data applications 1. Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects. Encryption is a form of pseudonymisation. The original data cannot be read. The process cannot be reversed without the correct decryption key. GDPR requires that this additional information be kept separate from the pseudonymised data. Pseudonymisation reduces risks associated with data loss or unauthorised data access. Pseudonymised data is still regarded as personal data and so remains covered by the GDPR. It is viewed as part of the Data Protection By Design and By Default principle. Pseudonymisation is not mandatory. Implementing pseudonymisation with old legacy IT systems and processes may be complex and expensive and, to that extent, pseudonymisation might be considered an example of unnecessary complexity within the GDPR. In relation to processing that does not require identification, it is appropriate to refer to Article 11. Article 11(1) provides that if the purposes for which a controller processes personal data do not, or no longer, require the identification of a data subject by the controller, the controller shall not be obliged to maintain, acquire or process additional information in order to identify the data subject for the sole purpose of complying with the GDPR. Where, in such cases, the controller is able to demonstrate that it is not in a position to identify the data subject, the controller shall inform the data subject accordingly, if possible and in such cases, Articles 15 to 20 shall not apply except where the data subject, for the purpose of exercising his or her rights under those articles, provides additional information enabling his or her identification. The GDPR has effectively made pseudonymisation the recommended approach to protecting personal identifiable information.
  19. 19. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 19 Growing Importance Of Pseudonymisation The Schrems II judgement3 has further increased the importance and relevance of data pseudonymisation. This increased the importance of pseudonymisation in relation to data transfers outside the EU. The judgement found that the US FISA (Foreign Intelligence Surveillance Act) does not respect the minimum safeguards resulting from the principle of proportionality and cannot be regarded as limited to what is strictly necessary. While the changes apply to transfers outside the EU, especially the US, they can be adopted pervasively to all data transfers to ensure consistency. The European Data Protection Board (EDPB) adopted version 2 of its recommendations on supplementary measures4 to enhance data transfer arrangement to ensure compliance with EU personal data protection of personal requirements. In this context, data pseudonymisation must ensure that: • Data is protected at the record and data set level as well as the field level so that the protection travels with the data wherever it is sent • Direct, indirect, and quasi-identifiers of personal information are protected • The approach must attempt to prevent against mosaic effect re-identification attacks by adding high levels of uncertainty to pseudonymisation techniques. Approaches To Pseudonymisation There are several potential approaches to pseudonymisation that can be implemented, as shown in the following diagram: Figure 10 – Overview of Approaches to Pseudonymisation These approaches include: • Replace IDAT Fields With Linking Identifier • Hash IDAT Fields 3 https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en 4 https://edpb.europa.eu/system/files/2021-06/edpb_recommendations_202001vo.2.0_supplementarymeasurestransferstools_en.pdf
  20. 20. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 20 • Hash IDAT Fields With Additional Salting/Peppering • Generate Hash From All Contents These approaches are explained in more detail in the next sections. In the following, IDAT means identifying data and refers to personal identifiable information and ADAT means analytic information. Pseudonymisation By Replacing ID Fields With Linking Identifier (Token) This approach involves replacing identifying data fields with a random value. These random values are then stored in a separate secure non-accessible set of data that links the random value to the original record. Figure 11 – Pseudonymisation By Replacing ID Fields With Linking Identifier
  21. 21. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 21 Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields Where there are multiple identifying data fields these can be replaced with random values. The multiple identifying data fields can be removed and replaced with single identifier. Figure 12 – Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields Replacing multiple source fields with a single token field reduces the granularity with which the original source data can be retrieved. The entire set of source fields must be retrieved from the depseudonymisation key and then the individual field required can be retrieved. ID Field Hashing Pseudonymisation The hashing approach to pseudonymisation involves replacing identifying data with a hash code of the data. So, for example the SHA3-512 hash of IDAT1 in hexadecimal is: 576c23e0ec773508ae7a03d1b286d75f3a7cfe524625b658a1961d3fa7b0ebb4cc01b3b530c63 4c9525631614ad3ebcb3afb69d33e5d8608a1587c2f43c16535 The SHA3-512 algorithm returns a 512-bit value. The hexadecimal value above is represented as the following binary string: 01010111011011000010001111100000111011000111011100110101000010001010111001111 01000000011110100011011001010000110110101110101111100111010011111001111111001 01001001000110001001011011011001010111101000011001011000011101001111111010011
  22. 22. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 22 11011000011101011101101001100110000000001101100111011010100110000110001100011 01001100100101010010010101100011000101100001010010101101001111101011110010110 01110101111101101101001110100110011111001011101100001100000100010100001010110 00011111000010111101000011110000010110010100110101 Storing a SHA3-512 hash code requires 64 bytes. In the case of some identifying data fields, this may be longer than the field itself. So pseudonymisation will increase storage requirements by replacing shorter fields with longer ones and by requiring the storage of separate depseudonymisation keys – see page 28 The input identifying data cannot be recalculated from hash directly. However, hash values can be easily and quickly calculated (“brute force” attack) and compared to pseudonymised values to generate the original identifying data. Figure 13 – ID Field Hashing Pseudonymisation Hashing And Identifier Codes If any of the IDAT fields contains a recognisable identifier code then brute force hash attacks are very feasible, even with modest computing resources. In general, identifying data tends to be more structured than other data – names, addresses, codes and so on. For example, consider an identifier code with a format such as:
  23. 23. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 23 AAA-NNN-NNN-C where: A is an upper-case alphabetic character N is a number from 0-9 C is a check character There are 17,576,000,000 possible combinations of this sample identifier code. This may appear to be a large number. But a single high-specification PC could calculate all the SHA3-512 hash values for these combinations in a few hours. So, unless the input to the hash generation is augmented with additional more random information, brute force attacks are feasible. The following illustrates how a small (single character) change (in the case, changing a character from lower to upper case) in the sample input value generates very different hash codes. Input SHA3-512 Hash ... no man has the right to fix the boundary of a nation. No man has the right to say to his country, "Thus far shalt thou go and no further", and we have never attempted to fix the "ne plus ultra" to the progress of ... e0ef7bd38b6b4bc6a27e7260d2162b2ea58cf5a fa5098072d0f735f9d73b67f9b9f699b8b098ec 41d44e117135e88b3cfb670876a2f34efd5734e 7ce80b64450 ... no man has the right to fix the boundary of a nation. No man has the right to say to his country, "Thus far shalt thou go and no further", and we have never attempted to fix the "Ne plus ultra" to the progress of ... e0ab9f0efb8f4cc2b89b73439f7b1365e687b17 b7e0bdc0ede00751a5a883ad8ee0877b9b6a303 2ad23521a7bc25a0b199e5c57cdb2cb5d7500c9 97e133c41a1 ... no man has the right to fix the boundary of a nation. No man has the right to say to his country, "Thus far shalt thou go and no further", and we have never attempted to fix the "ne Plus ultra" to the progress of ... 61361212da56a824559b81409cf02ba5f8c3bf4 1d4c8038faa885a183e1bdac1705eefad72594a f1fc3901aa55295c3166eb6635ca866f1e5cdf5 6c7ff0fb56a ... no man has the right to fix the boundary of a nation. No man has the right to say to his country, "Thus far shalt thou go and no further", and we have never attempted to fix the "ne plus Ultra" to the progress of ... 833d8b7cc47843cf74fd42cbbf782e87543c677 ecbdc1f7fe4d7ad9166557fac4c17d467fa8130 2a195e60a0a6f3f89c34e03a5c94eefcb3f19ca bcfd87a37ad Hashing And Reversibility The hash of a value is always the same – there is no randomness in hashing. However, as show above, hashes of very similar input values are very different. A very small input change leads to very large difference in the generated hash. For SHA3-512 a 0.5% change in input value leads to 85%-95% difference in hash output. So, given two hash values, it cannot be easily determined how similar the input values are or what the structure of the input values might be. This non-correlation property means the hash function is characterised by erratic behaviour in its output generation. Hashing process as a form of pseudonymisation is potentially vulnerable to brute force attacks as large number of hashes can be generated very easily and quickly. If you have some knowledge of the input value, you can generate large numbers of permutations and their hashes and compare values with the known hash to identify the original value. But ultimately you have to have the exact input value to generate the same hash: being very close is of no benefit Therefore, combining the original data with even a small amount of randomised data renders brute force attacks of hash values more complex.
  24. 24. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 24 ID Field Hashing Pseudonymisation With Data Salting And Peppering Salt is an additional different data item added to each identifying data field before hashing. Pepper is a fixed item of data added to record or field level data before hashing. With this approach the hashed identifying data is: HASH(CONCATENATE(IDATi+SALTi+PEPPER)) = For example, SHA3-512(CONCATENATE(IDAT1+SALT1+PEPPER)) = 3fa075114200b2327092f18067059ba81a5b191b33d5a10a2042673adcb119fac4dc5d3f63c60 d44e132f4db5996d416fd70216d4e055f1e5ccc0258ff15e1e1 This approach eliminates almost all the risk from brute force hash generation attacks unless approach to generating Salt and Pepper can be determined. Figure 14 – ID Field Hashing Pseudonymisation With Data Salting And Peppering While the Pepper value seems to add little to the randomisation of the hash, it makes determining the pseudo random number generator harder and thus makes the hash more secure.
  25. 25. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 25 One possible approach to generating the Salt is to use a cryptographically secure pseudo random number generator5 (PRNG) to generate salt values. Other less secure PRNGs are vulnerable to attacks. This ensures that the random salt values are very difficult to determine which in turn makes brute force attacks virtually impossible. The following show some examples of random numbers added to identifying data to generate has codes: HASH(CONCATENATE(IDAT1+1144360296176+2356573852518)) HASH(CONCATENATE(IDAT2+4700182946372+2356573852518)) HASH(CONCATENATE(IDAT3+1112492458021+2356573852518)) HASH(CONCATENATE(IDAT4+2755842713752+2356573852518)) HASH(CONCATENATE(IDAT5+6908485085952+2356573852518)) Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering Using this approach to augment the identifying data hash, in order to find the identifying data and additional random data used to generate a hash code, you need to know three pieces of information: 1. The structure of the identifying data in order to generate all possible permutations 2. The pseudo random number generator used to generate the Salt values 3. The specific Pepper code used, if this has been added. Figure 15 – Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering 5 Examples of cryptographically secure pseudo random number generators are: Fortuna - https://www.schneier.com/academic/fortuna/ PCG - https://www.pcg-random.org/
  26. 26. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 26 Content Hashing Pseudonymisation Content hashing involves generating the hash token from the entire record contents rather than just individual identifying fields. For example, the hash is generated from: SHA3-512(IDAT1,ADAT1,SALT1,PEPPER) = df767164078cb0779d06c1de02de74c62192461e82bbb0d01d60c3c3664c9c69111d5d2f07415 333e85cc04acfc1f7a204eadd8deead25a63c5a5ad343a5b3f2 This results in a very high degree of variability in the source data for the hashes. It increases the difficulty of identifying the source data that generated the hash code. Figure 16 – Content Hashing Pseudonymisation
  27. 27. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 27 Pseudonymisation And Data Lakes/Data Warehouses Data should be pseudonymised before the data lake and/or data warehouse is populated as part of a Data Privacy By Design And By Default approach. At a high-level the stages involved in this are: 1. As part of the standard ETL/ELT process, the source data is pseudonymised and the depseudonymisation key is created. 2. The pseudonymised data is passed to the data lake. The data may remain in the data lake or it may be used to populate the data warehouse. 3. The pseudonymised data created by the ETL/ELT process may be used to update the data warehouse directly, bypassing the data lake stage. 4. The pseudonymised data in the data lake is used to update the data warehouse. Figure 17 – Pseudonymisation and Data Lakes/Data Warehouses The data in the data warehouse can be made available for more general use within the organisation without any concerns about personal data being made available. This ensures compliance with GDPR article 6 (see page 17). In this case, pseudonymisation is used as part of the archiving process for data containing personal identifiable information after its main processing has been completed and the data is being retained for historical and analytical purposes.
  28. 28. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 28 Pseudonymisation Implementation As mentioned on page 22, storing a SHA3-512 hash code requires 64 bytes. In the case of some identifying data fields, this may be longer than the field itself. So pseudonymisation will increase storage requirements by replacing shorter fields with longer ones and by requiring the storage of separate depseudonymisation keys. For example, a table in an Oracle database with 10 million records, five IDAT fields each with an average length of 20 bytes, five ADAT fields each with an average length of 8 bytes and one index column of 8 bytes will require about 1.22 GB of storage. With pseudonymisation of individual IDAT fields, these will be replaced with 64 bytes each. The table size will increase to about 2.48 GB. There will also be a depseudonymisation key table that will hold both the original five IDAT fields each with an average length of 20 bytes and the five pseudonymisation fields of 64 bytes each as well as one index column of 8 bytes. This will occupy 2.95 GB of storage. So, in this example, pseudonymisation increases storage requirements from 1.22 GB to 5.43 GB, an increase of 4.21 GB. As mentioned on page 21, replacing multiple source IDAT fields with a single pseudonymisation hash reduces the granularity with which the original source data can be retrieved. The entire set of source fields must be retrieved from the depseudonymisation key and then the individual field required can be retrieved. This reduces the storage overhead. The use of a separate depseudonymisation key table is not required. The original source data with its personal identifiable information can be used as the depseudonymisation key. The pseudonymised data will need to store a link to the row in the original source data. The hash code contained in the pseudonymised data could be compared with a hash code generated from the source data. However, in this case, if the hash generation process was augmented with salting and peppering, the correct sale would have to be regenerated. Data Breaches and Attacks The objectives of data privatisation technologies are: • To prevent data breaches and attacks • To minimise or eliminate the impact of a data breach or attack Data privatisation technologies are just one of a number of layers of data protection an organisation should implement to its systems and data. Data access and data sharing arrangements introduce an additional level of data privatisation complexity in that the person or organisation being given access to the data may be the attacker. Or the data protection arrangements implemented and operated by the person or organisation being given data access may not have the same level of data protection arrangements as the source organisation. So, the source organisation should assume that data sharing and access arrangements are implicitly compromised and act accordingly. There are many security frameworks that can be used to define this wider organisation security framework, such as: • Center for Internet Security (CIS) Critical Security Controls – https://www.cisecurity.org/controls/
  29. 29. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 29 • Control Objectives for Information Technologies (COBIT) – https://www.isaca.org/resources/cobit • NIST: Cybersecurity Framework, 800-53, 800-171 – https://csrc.nist.gov/Projects/risk-management/sp800-53- controls/downloads • US FedRAMP (Federal Risk and Authorization Management Program – https://tailored.fedramp.gov/) Security Controls Baseline – https://tailored.fedramp.gov/static/APPENDIX%20A%20- %20FedRAMP%20Tailored%20Security%20Controls%20Baseline.xlsx • Cybersecurity Maturity Model Certification (CMMC) – https://www.acq.osd.mil/cmmc/documentation.html • Cloud Security Alliance (CSA)Cloud Controls Matrix (CCM) – https://cloudsecurityalliance.org/research/cloud- controls-matrix/ The analysis of these security standards and frameworks is outside the scope of this paper. Pseudonymisation and Data Breaches Pseudonymisation protects against data breaches by making data unusable should it be exposed. Figure 18 – Pseudonymisation and Data Breaches The ways in which pseudonymised data can be exposed and the impact of these breaches include: 1. The data may be exposed, accidentally or deliberately, by the entity with which the data is shared. If the data is correctly pseudonymised and if the pseudonymisation algorithm is protected then the impact of such a breached would be low.
  30. 30. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 30 2. The sharing organisation may cause the pseudonymised data to be exposed. For example, the data sharing mechanism used to share or provide access to the data may be compromised. The impact of such a breached would be low. 3. The depseudonymisation may be compromised. The risk of personal data re-identification will be high if this happens. 4. The pseudonymisation algorithm may be compromised. The risk of personal data re-identification will be high if this happens. Differencing Attack Differencing attacks work by running multiple partially overlapping queries can be run against summarised data until the results can be combined to identify an individual. Differencing attacks apply especially to differential privacy data access platforms. For example, the following set of queries can be run against the data: • How many people in the group are aged greater than N? • How many people in the group aged greater than N have attribute A? • How many people in the group aged greater than N have attribute B? • How many people with ages in the range N-9 to N-5 are male? • How many people with ages in the range N-4 to N are male? After a number of queries, you may be able to identify individuals or small numbers of individuals in a given age range of a given sex have a defined attribute. Apparently anonymous summary results can be combined to reveal potentially sensitive insights and comprise confidentiality. Differential privacy can be designed to reduce or eliminate the threat of differencing attacks by attaching a cost to each query. A budget is assigned to the dataset. The amount spent by queries against the dataset is tracked. When the budget is expended, no more queries can be run until the budget is increased. A differential privacy platform should be able to track queries performed by solution consumers given access to determine potential patterns of abuse.
  31. 31. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 31 Figure 19 – Differential Privacy and Differencing Attacks Differencing Attack, Reconstruction Attack And Mosaic Effect In addition to a differencing attack, there are various types of data attacks that can be performed on data as made available, without the need to obtain other data access: • A reconstruction attack uses the information from a differencing attack to identify how the original dataset was processed to create the summary. • A mosaic effect attack involves combining data from other data (public) sources to identify individuals. For example, apparently anonymised medical data containing dates of death can be combined with public death notice records to identify individual. This results in a data attack topology that should be monitored to ensure data privatisation is maintained.
  32. 32. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 32 Figure 20 – Differencing Attack, Reconstruction Attack And Mosaic Effect Differential Privacy Differential privacy allows for the (public) sharing of information about a group or aggregate by describing the patterns of groups within the group or aggregate while suppressing information about individuals in the group or aggregate. Source data is aggregated ad summarised and individual personal references are removed. The one-to-one correspondence between original and transformed data has been removed. A viewer of the information cannot (or should not be able to) tell if a specific individual's information was or was not used in the group or aggregate. This involves inserting noise into the results returned from a query of the data by a differential privacy middleware tool. The greater the noise introduced, the less usable the data will be but the re-identification risk will be reduced. It is a well-proven, widely used robust technique6. It aims to eliminate the possibility of re-identification of individuals from the dataset being analysed. Individual-specific information is always hidden. Differential privacy technologies are more complex than anonymisation and pseudonymisation as an approach to data privatisation. It will require more technical skills and the possible selection and implementation of a software platform. The remainder of this paper covers the topic of differential privacy in more detail. An effective data privatisation and differential privacy operational solution consists at its core of a computational layer that introduces deliberate randomisation into the summarised results returned from a data query. This means that the 6 See The Algorithmic Foundations of Differential Privacy https://www.cis.upenn.edu/~aaroth/privacybook.html.
  33. 33. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 33 action of running multiple queries across the dataset cannot be used to reconstruct the underlying individual records. It thus enables Privacy Preserving Data Mining (PPDM). The objective is to prevent access to or identification of specific, individual personal records or sensitive information while preserving the aggregated or structural properties on the data. Figure 21 – Differential Privacy Operation Differential privacy assigns a privacy budget to each dataset. The differential privacy engine introduces a fuzziness into the results of queries. Each query has a privacy cost. The total privacy expenditure across all queries by all users is tracked. When the budget has been spent, no further data queries can be performed until more privacy budget is allocated. Effective and usable data privatisation and differential privacy means finding the right balance between data privacy and data utility. At one extreme, the solution would be to completely delete or prevent any access to data. While this preserves absolute data privacy, it also eliminates the utility and usefulness of the data.
  34. 34. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 34 Figure 22 – Data Privatisation and Differential Privacy Balancing Act This results in a balancing act between three factors: 1. Level of Detail Contained in Results Presented 2. Amount and Complexity of Data Processing Allowed 3. Level of Data Privacy Relaxing or constraining one factor affects the other two. In other to determine the right equilibrium across these factors for your organisation and your data, you need to explicitly formalise your approach to data privacy and data utility in a policy. This policy should be made accessible and be able to be understood by those in charge of managing data. The policy should also be formally defined so its applicability and its subsequent implementation, operation and use can be verified. Differential privacy technology can them be used to operationalise this policy including monitoring its operation and use. Technology is a key enabler of data privatisation and differential privacy. It ensures and embeds Privacy By Design in your data access solution rather than data privacy concerns being addressed as an afterthought Data Privatisation and Differential Privacy Solution Architecture Overview This section describes the idealised architecture and design of an operational data privatisation and differential privacy solution. This essentially illustrates a reference architecture that you can use to determine what solution components are needed and what must be installed, implemented, and configured to create a usable and secure solution within your organisation. It can be used as a structured framework to define business and technical requirements. It can also be used to evaluate suitable products.
  35. 35. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 35 Figure 23 – Operational Data Privatisation and Differential Privacy Solution Architecture The numbered components of this are: 1. Core Data Privatisation/Differential Privacy Operational Platform – this is the core differential privacy platform. This can be installed on-premises or on a cloud platform such as AWS, Google Cloud and Azure. It takes and summarises data from designated data sources and provides different levels of and types of computational access to authorised users via a data API. It also provides a range of management and administration functions. 2. Data Sources – these represent data held in a variety of databases such as Oracle, SQL Server and other data storage systems such as HDFS, Cassandra, PostgreSQL and Teradata as well as external data stores such as AWS S3 and Azure. The differential privacy platform needs read-only access to these data sources. 3. Data Access Connector – these are connectors that enable read-only access to data held in the data sources. 4. Data Ingestion and Summarisation – this takes data from data sources, processes it and outputs in a format suitable for access. It includes features to manage data ingestion workflows, scheduling and error identification and handing. 5. Data Analysis Data Store – the core differential privacy platform creates pre-summarised versions of the raw data from the data sources. The platform never provides access to individual source data records. The data is encrypted while at rest in the data store. 6. Metadata Store – the platform creates and stores metadata about each data source. This is used to optimise data privacy of the result sets generated in response to data queries.
  36. 36. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 36 7. Batch Task Manager – in addition to running online data queries, asynchronous batch tasks can be run for longer data tasks. 8. Access and Usage Log – this logs data accesses. 9. User Access API – the platform provides an API for common data analytics tools such as Python and R to generate and retrieve privatised randomised sets of data summaries as well as providing data querying and analytics capabilities. Data results returned from queries is encrypted while in transit. 10. Data Visualisation Interface – this provides a data access and visualisation interface. 11. User Directory – the platform will use you existing user directories such as Active Directory or Azure Active Directory for user authentication and authorisation. 12. Authorised Internal Users – authorised internal users can access different datasets and perform different query types depending on their assigned access rights. 13. Authorised External Users – authorised external users can access different datasets and perform different query types depending on their assigned access rights. 14. Analytics and Reporting – this will allow you analyse and report on users accesses to data managed by the platform. 15. Monitoring, Logging and Auditing – this will log both system events and user activities. This information can be used both for platform management and planning as well as identifying potential patterns of data use and possible abuse. 16. Data Access Creation, Validation and Deployment – this will allow new data sources to be onboarded and allow existing data sources to be managed and updated. 17. Management and Administration – this will provide facilities to manage the overall platform such as adding and removing users and user groups and applying data privacy settings to different datasets. 18. Security and Access Control – this allows the management of different types of user access to different datasets. 19. Billing System Interface – you may want to charge for data access, either at a flat rate or by access or a mix of both. This represents an optional link to a financial management system to enable this Differential Privacy Platform Solution Service Management Processes Just like any other information technology solution, service management processes should be implemented for an operational differential privacy solution. Because a differential privacy solution exposes personal data, albeit in a summarised, randomised and anonymised manner, these service management processes are important. They should be part of any implementation project. This will maximise confidence in differential privacy technology in your organisation and reduce project risk. In turn, this will maximise the success of the platform and ensure that return on investment is optimised. The following table lists what we regard as being most important service management processes in the context of a differential privacy solution. Your organisation will already have invested in information technology service management processes. These should be extended to the differential privacy platform.
  37. 37. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 37 Service Management Process Overview and Scope Access Management This process is concerned with operationalising security management policies relating to enabling authorised users access the differential privacy platform and managing their access lifecycle. Availability Management This process relates to ensuring the differential privacy platform meets its agreed availability targets and obligations by planning, defining, measuring, analysing and improving availability. Capacity Management This is concerned with planning, defining, measuring, analysing and delivering the required facilities to ensure that the differential privacy platform has sufficient capacity to meet its service level commitments in the short-, medium- and long-term. Compliance Management This process is focused on ensuring that the design, operation and use of the differential privacy platform complies with legal and regulatory requirements and obligations. Knowledge Management This is about ensuring that knowledge about the implementation, operation and use of the differential privacy platform is collated, stored and shared, maximising reuse and eliminating the need for knowledge rediscovery. Operations Management This process is concerned with implementing and operating the housekeeping activities and tasks relating to the differential privacy solution, including monitoring and controlling the platform and backup and recovery. Risk Management This relates to the identification, evaluation and management of risks including threats to and vulnerabilities of the differential privacy solution. Security Management This is concerned with ensuring the confidentiality of the data assets contained in the differential privacy solution. Your organisation will already have invested in security management. This needs to be extended to the differential privacy solution. Service Continuity Management This is focused on ensuring the continuity of operation of and access to the differential privacy solution is maintained in the event of problems. Service Level Management This relates to the definition of and the subsequent monitoring of service level targets and service level agreements relating to the access to and use of the differential privacy solution. Differential Privacy Platform Deployment Options This section outlines two solution deployment options: on-premises and in the cloud.
  38. 38. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 38 On-Premises Deployment The following diagram illustrates the key components of an on-premises implementation of a differential privacy solution. Figure 24 – Sample High-Level On-Premises Deployment If users outside the organisation are to be given access to the data platform then either an existing external access facility will be used to provide secure access or a new facility will have to be implemented.
  39. 39. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 39 Cloud Deployment The following diagram illustrates the key components of a cloud implementation of a differential privacy solution. Figure 25 – Sample High-Level Cloud Deployment For a cloud deployment, the key differences relate to how on-premises data is processed and transferred to the cloud platform and how data access users outside the organisation authenticate using an approach such as Azure Active Directory.
  40. 40. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 40 Differential Privacy and Data Attacks Data Privatisation and Differential Privacy Solution Planning There are many different paths along the journey to the implementation of an operational data privatisation and differential privacy solution. The section Data Privatisation and Differential Privacy Next Steps on page 43 lists some of the possible stages along this journey. This section lists a possible set of activities and tasks that you can used to create a workplan for implementing a workable solution. The goal is to create an operational, supportable, maintainable, usable solution that provide access to your data without compromising data privacy and security. The implementation of a data privatisation and differential privacy solution is not very different from any other information technology solution that your organisation wants to implement. The following high-level set of steps can be iterated several times as you move from an initial pilot implementation to a complete production solution over time. • Create a prioritised inventory of potential data sources to which you would like to provide secure privatised computational access • Profile the data: understand the structure and contents of data, evaluate data quality and data conformance with standards, identify terms and metadata used to describe data and identify data relationships and dependencies, data sensitivity, Privacy Exposure Limit (PEL) and privacy requirements of each dataset • Define the data extract processes • Identify the target set of users for access to one or more of the datasets and define the type of access • Define and agree user access processes and security requirements
  41. 41. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 41 • Define the subsets of data to be made available for querying • Perform capacity planning and analysis in terms of raw data volumes, expected number and type of data access transactions, data refresh frequency, caching of results for performance, creation of materialised views and other factors that give rise to resource requirements • Define and agree platform audit logging and reporting, user activity monitoring, event, exception and alert handling processes • Define data access charging and billing • Define the platform operational administration, maintenance and support processes • Create a cost model for the solution including license costs, infrastructure, support and maintenance and any proposed revenue streams • Decide on the deployment approach • Define the organisational structures and service management processes needed to support the new solution • Decide on the data integration approach, especially if the solution is to be deployed on a cloud platform • Define the different types of training needed: administrator, support, data administrator, data query user • Create, review, validate and approve a differential privacy solution architecture design that incorporates the information gathered in the previous steps • Conduct a security review of the differential privacy solution • Acquire trial versions of platform licenses • Acquire deployment infrastructure, either on-premises or cloud • Configure the differential privacy platform and its data sources • Validate the platform • Allow user access to the platform in a phased and controlled manner Data Privatisation and Differential Privacy Solution Operation and Use The following table lists some key differential privacy platform use cases and what they entail. These can be embedded into operational service management processes that are listed in the section Differential Privacy Platform Solution Service Management Processes on page 36.
  42. 42. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 42 Data Privatisation and Differential Privacy Use Case Description User Enrolment The user must be defined in the organisation’s user directory. The process for enrolling users outside the organisation depends on the platform deployment model – on-premises or cloud. If the user is outside the organisation, then you may choose to use a cloud-based directory such as Azure Active Directory as a SAML identity provider. The user can be assigned to one of more groups, if needed. The user (or the groups to which the user belongs) will have different access rights to different datasets. The access rights include details on the subsets of data sources that can be queried, and the number and type of data queries the user can run before being prevented from running additional requests. Platform Usage Reporting and Analysis The usage of the platform can be analyses in several ways: 1. The overall platform performance, rate of usage, number of users, number and type of data query transactions, both online and batch, can be analysed and reported on. This will ensure that the platform is able to handle the current and expected future volume of data and its use. 2. The amount of data privacy exposed by user queries can be analysed to ensure that the privacy of data being made available is maintained. 3. Any charges for access to your data can be determined and bills generated. Addition of Data Source The data source should be profiled to understand its structure and content. A link must be defined between the data source and the differential privacy platform summarised data subset. The data refresh frequency must be defined. The Privacy Exposure Limit (PEL) of the dataset must be defined. This is the maximum amount of privacy exposed by all data queries run on the dataset. As queries are run, this is incremented. Once the limit has been reached, no further access is possible. Platform Security Auditing Platform auditing can be performed at three levels: 1. The overall differential privacy platform can be audited to ensure that it guarantees that no personal information can be disclosed. 2. The privacy settings of individual datasets can be audited to ensure that they are appropriate for the sensitivity of their information. 3. The use of the platform can be audited through the analysis of audit records collected to determine unusual patterns of queries by users.
  43. 43. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 43 Data Privatisation and Differential Privacy Next Steps The previous section Data Privatisation and Differential Privacy Solution Planning on page 40 contains a generic set of steps involved in the planning for differential privacy technology The journey to creating an industrialised and productionised differential privacy solution can involve a number of points at which a decision to proceed to the next stage in the journey can be made. Figure 26 – Data Privatisation and Differential Privacy Solution Journey In order to allow your organisation move along this journey we have identified a number of practical engagement exercises that are designed to answer specific questions you might have on in order to progress your differential privacy journey and to provide you with specific deliverables. These engagements are: 1. Early Business Engagement and Differential Privacy Opportunity Validation 2. Differential Privacy Design Process 3. Differential Privacy Readiness Assessment 4. Differential Privacy Architecture Sprint Implementing differential privacy technology is a means to an end rather than an end it itself. It is a way of resolving or addressing a data access problem or opportunity. These engagements are designed with this in mind. While these engagement types are described individually here, they can be combined to create a custom exercise to suit your specific needs. The following diagram illustrates at a high-level the scope of each of these engagements in terms of the duration and where they fit into your journey to the successful implementation of differential privacy in your organisation.
  44. 44. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 44 Figure 27 – Approaches to Data Privatisation and Differential Privacy Solution Scoping and Definition The following table summarises the characteristics of each of these engagements. What Question You Want Answered Engagement Type Level of Detail Included in Deliverable Likely Engagement Duration What You Get I want a consulting exercise to define new business structures and associated solutions to address the potential data access provision opportunity Early Business Engagement and Differential Privacy Opportunity Validation Medium to High Medium A validated differential privacy opportunity across the areas of: • Strategic fit • Options evaluation and identification • Procurement and implementation • Expected whole-life revenue and costs • Realistic and staged plan for achievement I want a full detailed design created from an initial not necessarily well- defined idea that I can pass to Differential Privacy Detailed High Medium A detailed end-to-end design for a differential privacy solution encompassing all solution
  45. 45. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 45 What Question You Want Answered Engagement Type Level of Detail Included in Deliverable Likely Engagement Duration What You Get solution delivery Design components I want generalised solution options identified to the potential data access provision opportunity Differential Privacy Readiness Assessment Low to Medium Medium An understanding the scope, requirements, objectives, approach, options for a differential privacy platform and to get a high-level understanding of the likely resources, timescale and cost required before starting the solution implementation I have a good idea of the potential data access solution I want and I am looking for a quick view of the solution options and their indicative costs, resources and timescales to implement Differential Privacy Architecture Sprint Low to Medium Short A high-level design for an end-to-end differential privacy solution focusing on technology aspects, that identifies if the solution is feasible, worthwhile and justifiable The following sections contain more detail on each of these engagement types. Early Business Engagement and Differential Privacy Opportunity Validation The engagement is concerned with analysing and defining the structure and operations of a business function within your organisation that will operate a differential privacy platform to provide controlled access to your data. It describes a target business model that includes identifying the differential privacy platform and its constituent components. The objective is to create a realistic, achievable, implementable and operable target differential privacy platform business justification to achieve the desired business targets. This is not an exact engagement with an easily defined and understood extent and duration. It has an essential investigative and exploratory aspect that means it has to have a necessary latitude. This is not an excuse for excessive analysis without reaching a conclusion. The goal is to produce results and answers within a reasonable time to allow decisions to be made based on evidence.
  46. 46. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 46 Figure 28 – Early Business Engagement and Differential Privacy Opportunity Validation Process The deliverables from this exercise will contain information in five key areas: strategic fit, options evaluation and identification, procurement and implementation, expected whole-life revenue and costs and a realistic and staged plan for achievement. Strategic Fit Options Evaluation and Identification Procurement and Implementation Whole-Life Revenue and Costs Realistic and Staged Plan for Achievement Business need and its contribution to the organisation’s data strategy Key benefits to be realised Critical success factors and how they will be measured. Cost/benefit analysis of realistic options for meeting the business need Statement of possible soft benefits that cannot be quantified in financial terms Identify preferred option and any trade- offs Proposed sourcing option with reasons Key features of proposed commercial arrangements Procurement approach/strategy with supporting details Statement of available funding and details of projected whole-life revenue from and cost of project (acquisition and operation), including all relevant costs Expected financial benefits Plan for achieving the desired outcome with key milestones and dependencies Contingency plans Risks identified and mitigation plan External supplier plans Resources, skills and experience required Differential Privacy Detailed Design This is a very comprehensive engagement that produces a detailed end-to-end design for a differential privacy solution for your organisation. This approach to solution design is based on using six views as a structure to gather information and to create the design. These six views are divided into two groups: • Core Solution Architecture Views – concerned with the kernel of the solution: • Business • Functional • Data
  47. 47. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 47 • Extended Solution Architecture Views – concerned with solution implementation and operation: • Technical • Implementation • Management and Operation Figure 29 – Differential Privacy Detailed Design Views The core dimensions/views define what the differential privacy solution must do, how it must operate and the results it will generate. The extended dimensions/views define how the solution must or should be implemented, managed and operated. They describe factors that affect, drive and support decisions made during the solution design process. Many of these factors will have been defined as requirements of the solution and so their delivery will be included in the solution design. Together these core and extended views describe the end-to-end solution design comprehensively. Differential Privacy Readiness Assessment The Differential Privacy Readiness Assessment is intended to allow the exploration of an as yet undefined solution that addresses a data access opportunity using differential privacy technology. The work is done from business, information technology and data perspectives. The objective is to understand the scope, requirements, objectives, approach, options for a differential privacy platform and to get a high-level understanding of the likely resources, timescale and cost required before starting the solution implementation. It looks to identify the changes needed within the organisation in order to successfully adopt differential privacy technology and use it to make your data more widely available.
  48. 48. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 48 Figure 30 – Areas Covered in Differential Privacy Readiness Assessment These domains of change can be categorised as follows: • Business-Oriented Change Areas − Facilities – existing and new facilities of the organisation, their types and functions − Business Processes – current and future business process definitions, requirements, characteristics, performance − Organisation and Structure – organisation resources and arrangement, business unit, function and team structures and composition, relationships, reporting and management, roles and skills • Technology-Oriented Change Areas − Technology and Infrastructure – current and future technical infrastructure including security, constraints, standards, technology trends, characteristics, performance requirements − Applications and Systems – current and future applications and systems including the core differential privacy platform and any extended components, their characteristics, constraints, assumptions, requirements, design principles, interface standards, connectivity to business processes − Information and Data – the data to which privatised access is to be provided, data and information architecture, data integration, data access and management, data security and privacy The analysis also included an extended change domain that covers the organisation operating environment and business landscape and the organisation data access and data availability strategy. This categorisation provides a structure for this engagement. It aims to define the changes needed across these domains that are needed to use differential privacy technology to enable data access.
  49. 49. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 49 Differential Privacy Architecture Sprint This engagement is designed to produce a high-level design for an end-to-end differential privacy technology solution. The focus is on the breadth of the technology solution rather than on depth and detail. This engagement recognises that the journey from initial business concept to operational solution is rarely simple. Not all business concepts progress to solution delivery projects and not all solution delivery projects advance to a completed operational solution. There is always an inevitable and necessary attrition during the process. There are many reasons why this should and could happen. Business and organisation needs and the operational environment both change. The allocation of budgets and resources are prioritised elsewhere. In this light, there is a need for a differential privacy solution design sprint that generates results quickly. There is a need to identify a feasible, worthwhile, justifiable concept that merit proceeding to implementation and to eliminate those that are not cost-effective The areas analysed in the differential privacy solution design sprint are: • Systems/Applications – these are existing systems and applications that will participate in the operation of the differential privacy solution and which may need to be changed and new systems and applications that will have to be delivered as part of the solution • System Interfaces – these are links between systems for the transfer and exchange of data • Actors – these are individuals, groups or business functions who will be involved in the operation and use of the differential privacy solution • Actor-System Interactions – interactions between Actors and Systems/Applications • Actor-Actor Interactions – interactions between Actors • Functions – these are activities that are performed by actors using facilities and functionality provided by systems • Processes – business processes required to operate the differential privacy solution and the business processes enabled by the solution, including new business processes and changes to existing business processes • Journey – standard journey through processes/functions and exceptions/deviations from this “happy path” • Logical Data View – data elements required • Data Exchanges – movement of data between Systems/Applications This set of information combines to provide a comprehensive view of the potential differential privacy solution at an early stage.
  50. 50. Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy Page 50 For more information, please contact: Alan McSweeney alan@alanmcsweeney.com

×