General Data Protection Regulation (GDPR) is a new set of EU guidelines governing how organisations handle personal data replacing the current Data Protection Act (DPA) and has been enforced since May 2018. With GDPR in place organizations need to process personal data lawfully, maintain this accurately for no longer than necessary, and in a secure way.
They should be able to report on the purposes of processing, the categories of personal data they control, and be able to demonstrate compliance with regards to GDPR policies. The challenge organizations face with regards to GDPR, being able to record every point where processing activities of personal data takes place and to showcase accountability with regards to this activity, has made data governance even more critical on the data lineage and data provenance aspects.
Governing data lineage enables the understanding of the organization’s data flow activities and to identify and document the legal justification for each type of activity. In addition GDPR requires evidence of records for the processing of personal data which implies the need to effectively record and govern data provenance.
In the current talk we are going to showcase how effectively governing data lineage and data provenance gives us the ability to verify that the processing of private data within an organization is compliant with GDPR regulatory requirements.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Supporting GDPR Compliance through effectively governing Data Lineage and Data Provenance
1. Paraskevi Zerva
Cognition & Knowledge Representation Lead
p.zerva@elsevier.com
Supporting GDPR Compliance through effectively governing
Data Lineage & Data Provenance
2. Context
❖ Introductions
❖ Definitions
❖ EDG Metadata Governance Platform & Use Cases
❖ EDG Showcase Data Lineage
❖ GDPR in a Nutshell
❖ How governing effectively Data Lineage supports GDPR Compliance
❖ GDPR Use Case for Time Limits of Personal Data Erasure – Data Retention
❖ GDPR Policies and Compliance
❖ GDPR Compliance Use Case
2
3. Introductions
❖ WhoAmI?
▪ Paraskevi Zerva
▪ Cognition & Knowledge Representation Lead (Entellect, Elsevier)
▪ Previously worked as an Information Architect for the Enterprise Data Governance at
JP Morgan & Chase.
▪ PhD in ``Provenance of Data for Compositions of Services’’.
❖ What’s my focus?
▪ Work on the the data governance strategy for Elsevier Entellect to support effective data
governance across Entellect’s software development life-cycle.
▪ Build a common representation for analysis & validation of Elsevier Entellect’s data.
▪ Consolidate data lineage & provenance information with other data assets to provide
a unified data governance ecosystem.
❖ What I am going to talk about ?
▪ How governing effectively data lineage/provenance supports compliance for GDPR within
the Enterprise Data Governance Platform.
3
4. Definitions
4
❖ Data governance:
▪ is a set of processes that ensures that data assets are efficiently managed and
enables gaining control and have a better understanding of your data,
▪ ensures that data can be trusted and organizations can show accountability about
their data assets with regards to data quality, retention, data lineage etc.,
▪ describes an evolutionary process for a company setting up the processes to handle
information so that it may be utilized by the entire organization,
▪ encompasses data/metadata collection, analysis and validation of rules involving
data (e.g., business (domain) rules, standards, data quality, entitlements, SOR, etc.)
❖ Data lineage refers to capturing the sequence of data flows involving a data element - it
can be represented visually to discover the movement of data artefact from its source to
its destination to understand where this originates from.
❖ Data provenance refers to the recording activity for the processing activities data (e.g.,
through provenance loggers).
❖ GDPR is the General Data Protection Regulation.
5. Enterprise Data Governance
❖ Unified platform for Corporate Technology to support the efficient data governance and
metadata management.
❖ Team’s mission:
✓ Integrate CT metadata from various sources in one place in a common way (RDF), regardless
of the input format
✓ Consolidates lineage/provenance information together with other metadata.
❖ EDG ingests different formats like XML, JSON, CSV (Collect)
❖ EDG translates the data/metadata into a common language format (RDF) (Standardize)
✓ Schemas are expressed as OWL ontologies.
✓ SHACL (shapes constraint language) is used for interface building.
and different user’s representation with the same underlying core schema.
✓ SPIN is used for transformation.
❖ We form a connected graph data structure queryable
across all internal and external reference datasets (Connect)
5
Collect
Standardize
Connect
Refine
7. Data Governance Business Cases
❖ Capture/manage governance requirements for the complete portfolio of CT applications.
❖ Support the software development lifecycle, compliance/regulatory requirements (GDPR).
❖ Demonstrate Data Lineage* where the different data sources originated from, to showcase
accountability on control & understanding of the data for regulatory purposes.
❖ Exhibit Data provenance** of how the data is processed/transforms across the platform.
❖ Track data movement/data transfers between applications (Traceability***).
❖ Provide contextual alignment with firm-wide standards, taxonomies and glossaries.
❖ Provide validation capabilities for data quality and data accuracy.
❖ Exhibit accountability with regards to entitlements (by effectively governing data provenance).
❖ Ontologies are extended to provide crosswalks between models and ecosystems so we can
answer questions such as :
✓ Which applications contain (S)PI data affected by GDPR? (Regulatory Reporting)
✓ S. Arabia has changed its retention policy – what applications are impacted? (Reporting)
✓ What are the owners of particular data requirements documentation? (RACI)
7
* Data lineage refers to capturing the sequence of data flows involving a data element - it can be represented visually to
discover the data flow/movement from its source to its destination.
** Traceability indicates the ability to track a data construct back to the construct it was derived from e.g., the original
system where this was created
*** Data provenance refers to the recording activity (through provenance loggers)
8. EDG – Metadata Governance Model
❖ The diagram depicts how conceptually metadata from various sources is connected in EDG.
❖ Data Lineage flow connects the following artefacts:
➢ Business Terms (Data Dictionary Metadata)
➢ Data Requirements
➢ Logical Data Model Artefacts
➢ Physical Data Model Artefacts
➢ Application/Deployment (Technical) Metadata
❖ Data Traceability indicates the ability to
be able to track the links to another artefact.
❖ Data Provenance allows to track:
➢ Ownership/Entitlements/Access Control Metadata
➢ Business Capability/Process Metadata
8
Data Lineage
Data Traceability
Data Provenance
23. Data Lineage Diagram Logical/Physical
Data Elements
23
Mapping to
Technical Asset
LDM Artefacts
PDM Artefacts
Mapping to
Data Requirement
Mapping to
Standard Glossary
PDM to LDM
mapping
24. Data Lineage Logical Data Elements
24
LDM Artefacts
Mapping to
Data Requirement
Mapping to
Standard Glossary
PDM to LDM
mapping
26. GDPR in a Nutshell
❖ GDPR is a new set of EU guidelines governing how organizations handle personal data
replacing the current Data Protection Act (DPA) and was enforced from 25 May 2018.
❖ According to GDPR personal data should be processed:
➢ Fairly/lawfully
➢ Must retain accurate/kept up to date
➢ Kept no longer than is necessary (retention period)
➢ Processed in a secure way
❖ Controller and processor terms are used in GDPR to describe the parties involved in
processing personal data (PI).
❖ Controller: the party that decides what data is extracted, the purpose used, who is
involved in the processing.
✓ should be able to demonstrate compliance (accountability metrics).
✓ should be able to report on the purposes of processing/the categories PI it controls.
❖ Processor: the party responsible for processing the data on behalf of the controller.
✓ should maintain records of the categories of processing activities of PI & the means
in which it’s processed.
✓ should be able to report on the data transfers of personal data to a third country or an
international organization and can be held responsible for a data breach (requirement
for breach notification).
26
27. How governing Data Lineage Supports GDPR
Compliance
❖ GDPR Challenge:
➢ Record of personal data processing are required for evidencing/demonstrating compliance.
➢ Organizations are required to record every point where processing activities of personal data
takes place and showcase accountability.
❖ Solution:
➢ GDPR makes data governance even more critical on the lineage aspect.
➢ Governance of data lineage enables the understanding of your data-flow activities & to
identify and document legal justification for each type of activity.
➢ When data lineage is represented visually it allows discovery of the data flow/movement
from its source to destination via various changes and how the data is transformed.
➢ On top of that the GDPR requires to evidence records of personal data processing that
implies the need for Data Provenance.
➢ Data Provenance refers to the recording activity of how the data were derived/generated
and processed. It allows to verify that the process and steps used to obtain a result complies
with a set of given requirements.
➢ In our business case the given requirements are GDPR regulatory requirements therefore
data lineage and provenance become the tools to showcase accountability with regards to
GDPR compliance.
27
28. GDPR Governance Use Case
❖ GDPR Article 30 Data Requirement
➢ Provide time limits for erasure of the different categories of data required per record
retention policy.
❖ Regulatory requirement translated to the creation of a report and accountability
metrics that:
➢ Returns applications in scope of GDPR for Corporate Technology.
➢ Returns the record class code in scope applications based on the record retention
policies per country.
➢ Notifies application owners in case there are changes on record retention updates and
verifies compliance of new changes with regards to GDPR regulatory requirements.
28
* Record class codes are used to determine how long to keep each record for each jurisdiction.
**A record class code (RCC) is a category used to group similar types of records in JPMC’s master record retention schedule.
*** Record retention requirements are categorized by record class code by county and in some case by the business function
of the record.
29. Retention Conceptual Model
29
ia.jpmc: SEAL_103249 a
edg:BusinessApplication
ia.jpmc.gov:GRM_FUN1030 a
ia.jpmc:RecordRetentionClass
ia.jpmc:application
RecordRetentionClass
ia.jpmc.gov:GRM_FUN10300-AE
a ia.jpmc:dataRetentionPolicy
ia.jpmc:dataRetentionPolicy
SOR: SEAL
SOR: Retention Manager Record Retention Policy Ontology
Technical Standard Ontology
30. GDPR In Scope Business Application
30
GDPR compliance information
Provenance information
GDPR in Scope – contains pi
OBK1060 | Payroll Services (GRM) PAY100 | Employee Compensation Contribution (GRM)record retention class
Record Retention Class
40. GDPR POLICIES & COMPLIANCE
❖ Policies define guidelines for handling and implementing specific security or
regulatory issues.
❖ With focus on the policy requirements for data protection we have built a
policy/compliance model:
➢ Aiming on validating GDPR compliance for the compliance objects under policy target.
➢ Showcasing accountability with regards to GDPR policy requirements.
Objectives
❖ Merge the gap between GDPR legislation obligations and operational level
technology controls using semantic modelling to model the critical policy and
compliance aspects.
❖ Use inferencing to preserve accountability of processing activities that
handle PI data subject to regulatory compliance.
40