Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Supporting GDPR Compliance through effectively governing Data Lineage and Data Provenance


Published on

General Data Protection Regulation (GDPR) is a new set of EU guidelines governing how organisations handle personal data replacing the current Data Protection Act (DPA) and has been enforced since May 2018. With GDPR in place organizations need to process personal data lawfully, maintain this accurately for no longer than necessary, and in a secure way.

They should be able to report on the purposes of processing, the categories of personal data they control, and be able to demonstrate compliance with regards to GDPR policies. The challenge organizations face with regards to GDPR, being able to record every point where processing activities of personal data takes place and to showcase accountability with regards to this activity, has made data governance even more critical on the data lineage and data provenance aspects.

Governing data lineage enables the understanding of the organization’s data flow activities and to identify and document the legal justification for each type of activity. In addition GDPR requires evidence of records for the processing of personal data which implies the need to effectively record and govern data provenance.

In the current talk we are going to showcase how effectively governing data lineage and data provenance gives us the ability to verify that the processing of private data within an organization is compliant with GDPR regulatory requirements.

Published in: Technology
  • Login to see the comments

Supporting GDPR Compliance through effectively governing Data Lineage and Data Provenance

  1. 1. Paraskevi Zerva Cognition & Knowledge Representation Lead Supporting GDPR Compliance through effectively governing Data Lineage & Data Provenance
  2. 2. Context ❖ Introductions ❖ Definitions ❖ EDG Metadata Governance Platform & Use Cases ❖ EDG Showcase Data Lineage ❖ GDPR in a Nutshell ❖ How governing effectively Data Lineage supports GDPR Compliance ❖ GDPR Use Case for Time Limits of Personal Data Erasure – Data Retention ❖ GDPR Policies and Compliance ❖ GDPR Compliance Use Case 2
  3. 3. Introductions ❖ WhoAmI? ▪ Paraskevi Zerva ▪ Cognition & Knowledge Representation Lead (Entellect, Elsevier) ▪ Previously worked as an Information Architect for the Enterprise Data Governance at JP Morgan & Chase. ▪ PhD in ``Provenance of Data for Compositions of Services’’. ❖ What’s my focus? ▪ Work on the the data governance strategy for Elsevier Entellect to support effective data governance across Entellect’s software development life-cycle. ▪ Build a common representation for analysis & validation of Elsevier Entellect’s data. ▪ Consolidate data lineage & provenance information with other data assets to provide a unified data governance ecosystem. ❖ What I am going to talk about ? ▪ How governing effectively data lineage/provenance supports compliance for GDPR within the Enterprise Data Governance Platform. 3
  4. 4. Definitions 4 ❖ Data governance: ▪ is a set of processes that ensures that data assets are efficiently managed and enables gaining control and have a better understanding of your data, ▪ ensures that data can be trusted and organizations can show accountability about their data assets with regards to data quality, retention, data lineage etc., ▪ describes an evolutionary process for a company setting up the processes to handle information so that it may be utilized by the entire organization, ▪ encompasses data/metadata collection, analysis and validation of rules involving data (e.g., business (domain) rules, standards, data quality, entitlements, SOR, etc.) ❖ Data lineage refers to capturing the sequence of data flows involving a data element - it can be represented visually to discover the movement of data artefact from its source to its destination to understand where this originates from. ❖ Data provenance refers to the recording activity for the processing activities data (e.g., through provenance loggers). ❖ GDPR is the General Data Protection Regulation.
  5. 5. Enterprise Data Governance ❖ Unified platform for Corporate Technology to support the efficient data governance and metadata management. ❖ Team’s mission: ✓ Integrate CT metadata from various sources in one place in a common way (RDF), regardless of the input format ✓ Consolidates lineage/provenance information together with other metadata. ❖ EDG ingests different formats like XML, JSON, CSV (Collect) ❖ EDG translates the data/metadata into a common language format (RDF) (Standardize) ✓ Schemas are expressed as OWL ontologies. ✓ SHACL (shapes constraint language) is used for interface building. and different user’s representation with the same underlying core schema. ✓ SPIN is used for transformation. ❖ We form a connected graph data structure queryable across all internal and external reference datasets (Connect) 5 Collect Standardize Connect Refine
  6. 6. Enterprise Data Governance Ecosystem 6 ✓ Enterprise Metadata ✓ LDMs/PDMs ✓ Req Reports ✓ Glossaries ✓ Taxonomies ✓ Codelists ✓ External Standards Ingestion ✓ Data Models ✓ Provenance Logging ✓ Movement ✓ Feedback Unified Data/Metad ata Hub (EDG) Sources People, Processes, Tools, Services, Conformed Data ✓ RACI (roles) ✓ APIs ✓ Discovery ✓ Reporting Uses
  7. 7. Data Governance Business Cases ❖ Capture/manage governance requirements for the complete portfolio of CT applications. ❖ Support the software development lifecycle, compliance/regulatory requirements (GDPR). ❖ Demonstrate Data Lineage* where the different data sources originated from, to showcase accountability on control & understanding of the data for regulatory purposes. ❖ Exhibit Data provenance** of how the data is processed/transforms across the platform. ❖ Track data movement/data transfers between applications (Traceability***). ❖ Provide contextual alignment with firm-wide standards, taxonomies and glossaries. ❖ Provide validation capabilities for data quality and data accuracy. ❖ Exhibit accountability with regards to entitlements (by effectively governing data provenance). ❖ Ontologies are extended to provide crosswalks between models and ecosystems so we can answer questions such as : ✓ Which applications contain (S)PI data affected by GDPR? (Regulatory Reporting) ✓ S. Arabia has changed its retention policy – what applications are impacted? (Reporting) ✓ What are the owners of particular data requirements documentation? (RACI) 7 * Data lineage refers to capturing the sequence of data flows involving a data element - it can be represented visually to discover the data flow/movement from its source to its destination. ** Traceability indicates the ability to track a data construct back to the construct it was derived from e.g., the original system where this was created *** Data provenance refers to the recording activity (through provenance loggers)
  8. 8. EDG – Metadata Governance Model ❖ The diagram depicts how conceptually metadata from various sources is connected in EDG. ❖ Data Lineage flow connects the following artefacts: ➢ Business Terms (Data Dictionary Metadata) ➢ Data Requirements ➢ Logical Data Model Artefacts ➢ Physical Data Model Artefacts ➢ Application/Deployment (Technical) Metadata ❖ Data Traceability indicates the ability to be able to track the links to another artefact. ❖ Data Provenance allows to track: ➢ Ownership/Entitlements/Access Control Metadata ➢ Business Capability/Process Metadata 8 Data Lineage Data Traceability Data Provenance
  9. 9. EDG – Showcase Data Lineage 9
  10. 10. Data Assets 10
  11. 11. Logical Data Model 11
  12. 12. Physical Data Model 12
  13. 13. Physical Database Realization 13
  14. 14. Link to Technical Metadata 14
  15. 15. Logical Data Model 15
  16. 16. Logical Entity 16
  17. 17. Logical Attribute 17
  18. 18. Link to Data Requirement 18
  19. 19. Link to Standard Glossary 19
  20. 20. Physical Data Model 20
  21. 21. Physical Table 21
  22. 22. Physical Column 22
  23. 23. Data Lineage Diagram Logical/Physical Data Elements 23 Mapping to Technical Asset LDM Artefacts PDM Artefacts Mapping to Data Requirement Mapping to Standard Glossary PDM to LDM mapping
  24. 24. Data Lineage Logical Data Elements 24 LDM Artefacts Mapping to Data Requirement Mapping to Standard Glossary PDM to LDM mapping
  25. 25. Data Lineage Physical Data Elements 25 Mapping to Technical Asset PDM Artefacts
  26. 26. GDPR in a Nutshell ❖ GDPR is a new set of EU guidelines governing how organizations handle personal data replacing the current Data Protection Act (DPA) and was enforced from 25 May 2018. ❖ According to GDPR personal data should be processed: ➢ Fairly/lawfully ➢ Must retain accurate/kept up to date ➢ Kept no longer than is necessary (retention period) ➢ Processed in a secure way ❖ Controller and processor terms are used in GDPR to describe the parties involved in processing personal data (PI). ❖ Controller: the party that decides what data is extracted, the purpose used, who is involved in the processing. ✓ should be able to demonstrate compliance (accountability metrics). ✓ should be able to report on the purposes of processing/the categories PI it controls. ❖ Processor: the party responsible for processing the data on behalf of the controller. ✓ should maintain records of the categories of processing activities of PI & the means in which it’s processed. ✓ should be able to report on the data transfers of personal data to a third country or an international organization and can be held responsible for a data breach (requirement for breach notification). 26
  27. 27. How governing Data Lineage Supports GDPR Compliance ❖ GDPR Challenge: ➢ Record of personal data processing are required for evidencing/demonstrating compliance. ➢ Organizations are required to record every point where processing activities of personal data takes place and showcase accountability. ❖ Solution: ➢ GDPR makes data governance even more critical on the lineage aspect. ➢ Governance of data lineage enables the understanding of your data-flow activities & to identify and document legal justification for each type of activity. ➢ When data lineage is represented visually it allows discovery of the data flow/movement from its source to destination via various changes and how the data is transformed. ➢ On top of that the GDPR requires to evidence records of personal data processing that implies the need for Data Provenance. ➢ Data Provenance refers to the recording activity of how the data were derived/generated and processed. It allows to verify that the process and steps used to obtain a result complies with a set of given requirements. ➢ In our business case the given requirements are GDPR regulatory requirements therefore data lineage and provenance become the tools to showcase accountability with regards to GDPR compliance. 27
  28. 28. GDPR Governance Use Case ❖ GDPR Article 30 Data Requirement ➢ Provide time limits for erasure of the different categories of data required per record retention policy. ❖ Regulatory requirement translated to the creation of a report and accountability metrics that: ➢ Returns applications in scope of GDPR for Corporate Technology. ➢ Returns the record class code in scope applications based on the record retention policies per country. ➢ Notifies application owners in case there are changes on record retention updates and verifies compliance of new changes with regards to GDPR regulatory requirements. 28 * Record class codes are used to determine how long to keep each record for each jurisdiction. **A record class code (RCC) is a category used to group similar types of records in JPMC’s master record retention schedule. *** Record retention requirements are categorized by record class code by county and in some case by the business function of the record.
  29. 29. Retention Conceptual Model 29 ia.jpmc: SEAL_103249 a edg:BusinessApplication a ia.jpmc:RecordRetentionClass ia.jpmc:application RecordRetentionClass a ia.jpmc:dataRetentionPolicy ia.jpmc:dataRetentionPolicy SOR: SEAL SOR: Retention Manager Record Retention Policy Ontology Technical Standard Ontology
  30. 30. GDPR In Scope Business Application 30 GDPR compliance information Provenance information GDPR in Scope – contains pi OBK1060 | Payroll Services (GRM) PAY100 | Employee Compensation Contribution (GRM)record retention class Record Retention Class
  31. 31. Record Retention Code 31 Data Retention Record Class data retention policy
  32. 32. Record Retention Policy 32 country: retention period: disposition: retention event: retention period unit:
  33. 33. EDG – RCC Code Diagram 33 data retention policy
  34. 34. EDG – RCC Policy Diagram 34 country: disposition: retention period unit: retention period: retention event:
  35. 35. Query RCC Data for GDPR in scope Apps prefix rdfs: <> prefix edg: <> prefix ia.jpmc: <> SELECT DISTINCT ?appId ?appName ?gdprScopr ?lob ?rccClassCode ?rccLabel ?rccPolicy FROM <urn:x-evn-master:seal> FROM <urn:x-evn-master:grm> WHERE { { ?app a edg:BusinessApplication . ?app edg:name ?appName . ?app edg:identifier ?appId . ?app ia.jpmc:inScopeForGDPR ?gdprScope . ?app ia.jpmc:lineOfBusiness ?lob . FILTER regex(?gdprScope, “YES”) FILTER regex(?lineOfBusiness, “CT”) } } 35 appId appName gdprScope lob rccClassCode rccPolicy 35632 Payroll Application YES CT GRM_AUD_PAY_1060 Payroll Services 38537 KPMG Link – Global Business Travel YES CT GRM_AUD_PAY_1080 Payroll Accounting
  36. 36. SPIN Rules (1) - Inferencing #STEP 401: Create Record Classes CONSTRUCT { ?recordClassCodeU a ia.jpmc:DataRetentionRecordClass . ?recordClassCodeU rdfs:label ?rccLabel. ?recordClassCodeU edg:identifier ?rccClassCode. ?recordClassCodeU edg:name ?rccName. ?recordClassCodeU edg:description ?rccDescription. ?recordClassCodeU ia.jpmc.go.dataRetentionPolicy ?rccPolicyU. } WHERE { ?this a RetentionExport:RetentionExport. BIND (spl:object (?this, RetentionExport:recordClassCode) AS ?rccClassCode . BIND (spl:object (?this, RetentionExport:country) AS ?country . BIND (spl:object (?this, RetentionExport:countryCode) AS ?countryCode . BIND (spl:object (?this, RetentionExport:recordClassName) AS ?rccName . BIND (spl:object (?this, RetentionExport:recordClassDescription) AS ?rccDescription. BIND (ia.jpmc:BuildDataRetentionPolicyClassURI (?recordClassCode) AS ?recordClassCodeU. ?countryU country:countryId ?RDICountryCode . BIND (str(?RDICountryCode) AS cntryCodeLabel). FILTER (?countryCode = ?cntryCodeLabel) . BIND (fn:concat(?recordClassCode, "|", ?recordClassName, "(GRM)") AS ?rccLabel). BIND (ia.jpmc:BuildDataRetentionPolicyRecordURI(?recordClassCode, ?RDICountryCode) AS ?rccPolicyU). } 36
  37. 37. SPIN Rules (2) - Inferencing #STEP 402: Create Record Retention Policy CONSTRUCT { ?rccPolicyU a edg:DataRetentionPolicy . ?rccPolicyU rdfs:label ?policyLabel. ?rccPolicyU edg:identifier ?policyIdentifier. ?rccPolicyU edg:name ?policyName. ?rccPolicyU ia.jpmc.go:retentionDisposition ?disposition. ?rccPolicyU ia.jpmc.go:retentionEvent ?retentionEvent. ?rccPolicyU ia.jpmc.go:retentionPeriod ?retentionPeriod. ?rccPolicyU ia.jpmc.go:retentionPeriodUnit ?retentionPeriodUnit. ?rccPolicyU edg:country ?country. } WHERE { ?this a RetentionExport:RetentionExport. BIND (spl:object (?this, RetentionExport:policyIdentifier) AS ?policyIdentifier. BIND (spl:object (?this, RetentionExport:retentionDisposition) AS ?disposition . BIND (spl:object (?this, RetentionExport:retentionEvent) AS ?retentionEvent . BIND (spl:object (?this, RetentionExport:retentionPeriod) AS ?retentionPeriod . BIND (spl:object (?this, RetentionExport:retentionPeriodUnit) AS ?retentionPeriodUnit. BIND (spl:object (?this, RetentionExport:country) AS ?country . BIND (ia.jpmc:BuildDataRetentionPolicyClassURI (?recordClassCode) AS BIND (fn:concat(?policyIdentifier, "|", ?policyName, "(GRM)") AS ?policyLabel). BIND (ia.jpmc:BuildDataRetentionPolicyURI(?policyIdentifier, ?country, ?policyName) AS ?rccPolicyU). } 37
  38. 38. SHACL Property Constraint 38
  40. 40. GDPR POLICIES & COMPLIANCE ❖ Policies define guidelines for handling and implementing specific security or regulatory issues. ❖ With focus on the policy requirements for data protection we have built a policy/compliance model: ➢ Aiming on validating GDPR compliance for the compliance objects under policy target. ➢ Showcasing accountability with regards to GDPR policy requirements. Objectives ❖ Merge the gap between GDPR legislation obligations and operational level technology controls using semantic modelling to model the critical policy and compliance aspects. ❖ Use inferencing to preserve accountability of processing activities that handle PI data subject to regulatory compliance. 40
  41. 41. GDPR RETENTION COMPLIANCE 41 edg:Policy edg:DataPolicy rdfs:subclass rdfs:subclass edg:ComplianceAspect edg:Policy Requirement ia.jpmc:DataRetention RecordClass ia.jpmc:categorized ByCountry ia.jpmc:categorized ByBusinessFunction ia.jpmc:Country ia.jpmc:Business Function edg:compliesWith edg:DataRetention Policy rdf:type ia.jpmc:dataRetentionPolicy edg:RequirementAsset rdf:typeedg:hasRequirement edg:GDPRRegulatory Requirement rdfs:subclass
  43. 43. Thanks for attending ☺ Q & A Name: Paraskevi Zerva Email: Linkedin: 43
  44. 44. BACKUP SLIDES 44
  45. 45. Extending Data Lineage to Data Provenance 45 prov:Activity Non-Linear Activity Linear Activity Workflow Pipeline rdfs:subclass rdfs:subclass rdfs:subclass composedOf/contains Software Program Software Program Execution computationOf executionOf Software Program Computation runsOn prov:Agent prov:Server rdfs:subclass prov:Entity rdfs:subclass Processable Processable input subproperty of prov:wasUsed output subproperty of prov:wasGeneratedBy prov:wasDerivedFrom