SlideShare a Scribd company logo
1 of 57
Download to read offline
Ā©2023 Databricks Inc. ā€” All rights reserved
Data Privacy
Patterns
Ā©2023 Databricks Inc. ā€” All rights reserved
Agenda
Data Privacy Patterns
Lesson Name Lesson Name
Lecture: Store Data Securely Lecture: Deleting Data in Databricks
ADE 3.1 - Follow Along Demo - PII Lookup Table
ADE 3.4 - Follow Along Demo - Processing Records
from CDF
ADE 3.2 - Follow Along Demo - Pseudonymized ETL ADE 3.5L - Propagating Changes with CDF Lab
ADE 3.3 - Follow Along Demo - Deidentiļ¬ed PII
Access
ADE 3.6 - Follow Along Demo: Propagating Deletes
with CDF
Lecture: Streaming Data and CDF
Ā©2023 Databricks Inc. ā€” All rights reserved
Store Data Securely
Ā©2023 Databricks Inc. ā€” All rights reserved
Learning Objectives
By the end of this lesson, you should be able to:
Identify common use cases for data privacy and describe optimal
approaches to meet compliance regulations
1
2
3
Secure and handle PII/sensitive data in data pipelines
Create Dynamic views to perform data masking and control access
to rows and columns
Ā©2023 Databricks Inc. ā€” All rights reserved
Regulatory Compliance
ā€¢ EU = GDPR (General Data Protection Regulation)
ā€¢ US = CCPA (California Consumer Privacy Act)
ā€¢ Simpliļ¬ed Compliance Requirements
ā€¢ Inform customers what personal information is collected
ā€¢ Delete, update, or export personal information as requested
ā€¢ Process request in a timely fashion (30 days)
Ā©2023 Databricks Inc. ā€” All rights reserved
Regulatory Compliance
ā€¢ Reduce copies of your PII
ā€¢ Find personal information quickly
ā€¢ Reliably change, delete, or export data
ā€¢ Built-in data skipping optimizations (Z-order) and housekeeping of
obsolete/deleted data (VACUUM)
ā€¢ Use transaction logs for auditing
How Databricks Simpliļ¬es Compliance
Ā©2023 Databricks Inc. ā€” All rights reserved
Manage Access to PII
ā€¢ Control access to storage locations with cloud permissions
ā€¢ Limit human access to raw data
ā€¢ Pseudonymize records on ingestion
ā€¢ Use table ACLs to manage user permissions
ā€¢ Conļ¬gure dynamic views for data redaction
ā€¢ Remove identifying details from demographic views
Ā©2023 Databricks Inc. ā€” All rights reserved
PII Data Security
ā€¢ Protects data at record level
ā€¢ Re-identiļ¬cation is possible
ā€¢ Pseudonymised data is still
considered PII
Two main data modeling approaches to meet compliance requirements
ā€¢ Protects entire dataset
ā€¢ Irreversibly altered
ā€¢ Non-linkable to original data
ā€¢ Multiple anonymization methods
might be used
Pseudonymization Anonymization
Name John Doe
B_Date 14/04/1987
Name User-321
B_Date 14/04/1987
Name John Doe
B_Date 14/04/1987
Name **********
Age 20-30
Ā©2023 Databricks Inc. ā€” All rights reserved
Pseudonymization
ā€¢ Switches original data point with pseudonym for later
re-identiļ¬cation
ā€¢ Only authorized users will have access to keys/hash/table for
re-identiļ¬cation
ā€¢ Protects datasets on record level for machine learning
ā€¢ A pseudonym is still considered to be personal data according to the
GDPR
ā€¢ Two main pseudonymization methods: hashing and tokenization
Overview of the approach
Ā©2023 Databricks Inc. ā€” All rights reserved
Pseudonymization
ā€¢ Apply SHA or other hash to all PII
ā€¢ Add random string "salt" to values
before hashing
ā€¢ Databricks secrets can be leveraged
for obfuscating salt value
ā€¢ Leads to some increase in data size
ā€¢ Some operations will be less efļ¬cient
Method: Hashing
ID SSN Salary_R
1 000-11-1111 53K
2 000-22-2222 68K
3 000-33-3333 90K
4 000-44-4444 72K
ID SSN Salary_R
1 1ffa0bf4002a968e7d8 53K
2 1d55ec7079cb0a6at0 68K
3 be85b326855e0e748 90K
4 da20058e59fe8d311f 72K
Ā©2023 Databricks Inc. ā€” All rights reserved
Pseudonymization
ā€¢ Converts all PII to keys
ā€¢ Values are stored in a secure
lookup table
ā€¢ Slow to write, but fast to read
ā€¢ De-identiļ¬ed data stored in
fewer bytes
Method: Tokenization ID SSN Salary_R
1 000-11-1111 53K
2 000-22-2222 68K
3 000-33-3333 90K
4 000-44-4444 72K
ID SSN Salary_R
1 1ffa0bf4002a968e7d8 53K
2 1d55ec7079cb0a6at0 68K
3 be85b326855e0e748 90K
4 da20058e59fe8d311f 72K
SSN SSN_Token
000-11-1111 1ffa0bf4002a968e7d8
000-22-2222 1d55ec7079cb0a6at0
000-33-3333 be85b326855e0e748
000-44-4444 da20058e59fe8d311f
Token Vault
Ā©2023 Databricks Inc. ā€” All rights reserved
Anonymization
ā€¢ Protects entire dataset (tables, databases or entire data catalogues)
mostly for Business Intelligence
ā€¢ Personal data is irreversibly altered in such a way that a data subject
can no longer be identiļ¬ed directly or indirectly
ā€¢ Usually a combination of more than one technique used in real-world
scenarios
ā€¢ Two main anonymization methods: data suppression and
generalization
Overview of the approach
Ā©2023 Databricks Inc. ā€” All rights reserved
Anonymization Methods
ā€¢ Exclude columns with PII from
views
ā€¢ Remove rows where
demographic groups are too
small
ā€¢ Use dynamic access controls to
provide conditional access to
full data
Method: Data Suppression
Source Table
ID SSN Salary_R
1 000-11-1111 53K
2 000-22-2222 68K
3 000-33-3333 90K
View with no PII
ID Salary_R
1 53K
2 68K
3 90K
Ā©2023 Databricks Inc. ā€” All rights reserved
ā€¢ Categorical generalization
ā€¢ Binning
ā€¢ Truncating IP addresses
ā€¢ Rounding
Anonymization Methods
Method: Generalization
Ā©2023 Databricks Inc. ā€” All rights reserved
ā€¢ Removes precision from data
ā€¢ Move from speciļ¬c categories to
more general
ā€¢ Retain level of speciļ¬city that
still provides insight without
revealing identity
Anonymization Methods
Method: Generalization ā†’ Categorical Generalization
Ā©2023 Databricks Inc. ā€” All rights reserved
Anonymization Methods
ā€¢ Identify meaningful divisions in
data and group on boundaries
ā€¢ Allows access to demographic
groups without being able to
identify individual PII
ā€¢ Can use domain expertise to
identify groups of interest
Method: Generalization ā†’ Binning
ID Department BirthDate
1 IT 28/09/1997
2 Sales 13/02/1976
3 Marketing 02/04/1985
4 Engineering 19/12/2002
ID Department Age_Range
1 IT 20-30
2 Sales 40-50
3 Marketing 30-40
4 Engineering 20-30
Ā©2023 Databricks Inc. ā€” All rights reserved
Anonymization Methods
IP addresses need special
anonymization rules;
ā€¢ Rounding IP address to /24 CIDR
ā€¢ Replace last byte with 0
ā€¢ Generalizes IP geolocation to
city or neighbourhood level
Method: Generalization ā†’ Truncating IP addresses
ID IP IP_Truncated
1 10.130.176.215 10.130.176.0/24
2 10.5.56.45 10.5.56.0/24
3 10.208.126.183 10.208.126.0/24
4 10.106.62.87 10.106.62.0/24
Ā©2023 Databricks Inc. ā€” All rights reserved
Anonymization Methods
ā€¢ Apply generalized rounding rules
to all number data, based on
required precision for analytics
ā€¢ Example:
ā€¢ Integers are rounded to multiples
of 5
ā€¢ Values less than 2.5 are rounded
to 0 or omitted from reports
ā€¢ Consider suppressing outliers
Method: Generalization ā†’ Rounding
ID Department Age_Range Salary
1 IT 20-30 1245.4
2 Sales 40-50 1300
3 Marketing 30-40 1134
ID Department Age_Range Salary_R
1 IT 20-30 1200
2 Sales 40-50 1300
3 Marketing 30-40 1100
Ā©2023 Databricks Inc. ā€” All rights reserved
Knowledge Check
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following terms refers to irreversibly
altering personal data in such a way that a data
subject can no longer be identiļ¬ed directly or
indirectly?
Select one response
A. Tokenization
B. Pseudonymization
C. Anonymization
D. Binning
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following is a regulatory compliance program
specifically for Europe?
Select one response
A. HIPAA
B. PCI-DSS
C. GDPR
D. CCPA
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following are examples of generalization?
Select two responses
A. Hashing
B. Truncating IP addresses
C. Data suppression
D. Binning
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following can be used to obscure personal
information by outputting a string of randomized
characters?
Select one response
A. Tokenization
B. Categorical generalization
C. Binning
D. Hashing
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo: PII Lookup Table
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo: Pseudonymized
ETL
Ā©2023 Databricks Inc. ā€” All rights reserved
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo: De-identiļ¬ed PII
Access
Ā©2023 Databricks Inc. ā€” All rights reserved
Streaming Data and
CDF
Ā©2023 Databricks Inc. ā€” All rights reserved
Learning Objectives
By the end of this lesson, you should be able to:
Explain how CDF is enabled and how it works
1
2
3
Discuss why ingesting from CDF can be beneļ¬cial for streaming data
Articulate multiple strategies for using Streaming data to create CDF
4
Discuss how CDF addresses past difļ¬culties propagating updates
and deletes
Ā©2023 Databricks Inc. ā€” All rights reserved
Streaming Data and Data Changes
Updates and Deletes in streaming data
ā€¢ In Structured Streaming, a data stream is treated as a table that is being
continuously appended. Structured Streaming expects to work with
data sources that are append only.
ā€¢ Changes in existing data (updates and deletes) breaks this expectation!
ā€¢ We need a deduplication logic to identify updated and deleted records.
ā€¢ Note: Delta transaction logs tracks ļ¬les than rows. Updating a single row
will point to a new version of the ļ¬le.
Ā©2023 Databricks Inc. ā€” All rights reserved
Solution 1: Ignore Changes
Prevent re-processing by ignoring deletes, updates and overwrites
ā€¢ Ignore transactions that delete
data at partition boundaries
ā€¢ No new data ļ¬les are written with
full partition removal
ā€¢ spark.readStream.format("delta")
.option("ignoreDeletes", "true")
ā€¢ Allows stream to be executed
against Delta table with upstream
changes
ā€¢ Must implement logic to avoid
processing duplicate records
ā€¢ Subsumes ignoreDeletes
ā€¢ spark.readStream.format("delta")
.option("ignoreChanges", "true")
Ignore Deletes Ignore Changes
Ā©2023 Databricks Inc. ā€” All rights reserved
Solution 2: Change Data Feeds (CDF)
Propage incremental changes to downstream tables
Raw Ingestion and
History
BRONZE
Filtered, Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
Change Data Feed
(CDF)
External feeds,
Other CDC output,
Extracts
AI & Reporting
Streaming
Analytics
Ā©2023 Databricks Inc. ā€” All rights reserved
What Delta Change Data Feed Does for You
Beneļ¬ts and use cases of CDF
Meet regulatory
needs
Full history available of
changes made to the
data, including deleted
information
BI on your
data lake
Incrementally update
the data supporting
your BI tool of choice
Unify batch and
streaming
Common change format
for batch and streaming
updates, appends, and
deletes
Improve ETL
Pipelines
Process less data during
ETL to increase
efļ¬ciency of your
pipelines by processing
only row-level changes
Ā©2023 Databricks Inc. ā€” All rights reserved
How Does Delta CDF Work?
Sample CDF data schema
PK B
A1 B1
A2 B2
A3 B3
Original Table (v1)
PK B
A1 B1
A2 Z2
A3 B3
A4 B4
Change data
(Merged as v2)
PK B Change Type Time Version
A2 B2 Preimage 12:00:00 2
A2 Z2 Postimage 12:00:00 2
A3 B3 Delete 12:00:00 2
A4 B4 Insert 12:00:00 2
Change Data Feed Output
A1 record did not receive an update or delete.
So it will not be output by CDF.
Ā©2023 Databricks Inc. ā€” All rights reserved
Consuming the Delta CDF
Stream-based Consumption
ā— Delta Change Feed is processed
as each source commit
completes
ā— Rapid source commits can result
in multiple Delta versions being
included in a single micro-batch
Batch Consumption
ā— Batches are constructed based in
time-bound windows which may
contain multiple Delta versions
A2 B2 Preimage 12:00:00 2
A2 Z2 Postimage 12:00:00 2
A3 B3 Delete 12:00:00 2
A4 B4 Insert 12:00:00 2
A5 B5 Insert
12:08:0
0
3 A6 B6 Insert 12:09:00 4
A6 B6 Preimage 12:10:05 5
A6 Z6 Postimage 12:10:05 5
A5 B5 Insert 12:08:00 3
A6 B6 Insert 12:09:00 4
12:00
A6 B6 Preimage 12:10:05 5
A6 Z6
Postimag
e
12:10:05 5
12:10:0
0
12:20
12:08 12:10:10
Stream - micro batches
Batch - every 10 mins
12:00
Ā©2023 Databricks Inc. ā€” All rights reserved
CDF Conļ¬guration
Important notes for CDF conļ¬guration
ā€¢ CDF is not enabled by default. It can be enabled;
ā€¢ At table level: ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
ā€¢ For all new tables: set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;
ā€¢ Change feed can be read by;
ā€¢ Version
ā€¢ Timestamp
Ā©2023 Databricks Inc. ā€” All rights reserved
Deleting Data in
Databricks
Ā©2023 Databricks Inc. ā€” All rights reserved
Learning Objectives
By the end of this lesson, you should be able to:
Explain the use case of CDF for removing data in downstream tables
1
2
3
Describe various methods of recording important data changes
Discuss how CDF can be leveraged to ensure deletes are committed
fully
Ā©2023 Databricks Inc. ā€” All rights reserved
Data Deletion in Databricks
Data deletion needs special attention!
ā€¢ Companies need to handle data deletion requests carefully to main
compliance with privacy regulations such as GDPR and CCPA.
ā€¢ PII for users need to be effectively and efļ¬ciently handled in Databricks,
including deletion.
ā€¢ These operations usually handled in pipelines separate from ETL
pipelines.
ā€¢ CDF data can be used to propagate deletion action to downstream table.
Ā©2023 Databricks Inc. ā€” All rights reserved
Recording Important Data Changes
Using commit messages
ā€¢ Delta Lake supports arbitrary commit messages that will be recorded
to the Delta transaction log and viewable in the table history. This can
help with later auditing.
ā€¢ Commit messages can be;
ā€¢ Set at global level
ā€¢ Can be speciļ¬ed as part of write operation. For example, data insertion can be
labeled based on processing type; manual, automated.
Ā©2023 Databricks Inc. ā€” All rights reserved
Propagating Data Deletion with CDF
How can CDF be used for propagating deletes?
ā€¢ Data deletion requests can be streamlined with automated triggers using
Structured Streaming.
ā€¢ CDF can be separately leveraged to identify records need to be deleted
or modiļ¬ed in downstream tables.
Note: When Delta Lake's history and CDF features are implemented, deleted
values are still present in older versions of the data.
ā€¢ We can solve this by deleting at a partition boundary.
Ā©2023 Databricks Inc. ā€” All rights reserved
CDF Retention Policy
Important notes for CDF conļ¬guration
ā€¢ File deletion will not actually occur until we VACUUM our table!
ā€¢ CDF records follow the same retention policy of the table. VACUUM
command deletes CDF data.
ā€¢ By default, the Delta engine will prevent VACUUM operations with less
than 7 days of retention. To manually run VACUUM for these ļ¬les;
ā€¢ Disable Sparkā€™s retention duration check (retentionDurationCheck.enabled)
ā€¢ Run VACUUM with DRY RUN to preview ļ¬les before permanently removing them
ā€¢ Run VACUUM with RETAIN 0 HOURS!
Ā©2023 Databricks Inc. ā€” All rights reserved
Can I perform
DML on a Live Table?
(i.e. GDPR)
43
Ā©2023 Databricks Inc. ā€” All rights reserved
Example GDPR use case
Using streaming live tables for ingestion and live tables after
44
Staging Bronze Silver Gold
CREATE STREAMING
LIVE TABLE
CREATE LIVE TABLE CREATE LIVE TABLE
{JSON files}
Limited Retention Configurable Retention
Correction / GDPR
LIVE Tables automatically reflect changes made to inputs
Ā©2023 Databricks Inc. ā€” All rights reserved
How can you ļ¬x it?
Updates, deletes, inserts and merges on streaming live tables.
Ensure compliance for retention periods on a table.
DELETE FROM my_live_tables.records
WHERE date < current_time() ā€“ INTERVAL 3 years
Scrub PII from data in the lake.
UPDATE my_live_tables.records
SET email = hash(email, salt)
45
Ā©2023 Databricks Inc. ā€” All rights reserved
DML works on streaming live tables only
DML on live tables is undone by the next update
UPDATE users
SET email = obfuscate(email)
Live Table Streaming Live Table
46
id
1
Users
email
****@gmail.com
CREATE OR REPLACE
users AS ā€¦.
user@gmail.com
UPDATE users
SET email = obfuscate(email)
id
1
Users
email
****@gmail.com
2 ****@hotmail...
INSERT INTO users ā€¦
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo: Processing
Records from CDF
Ā©2023 Databricks Inc. ā€” All rights reserved
Lab: Propagating
Changes with CDF
Ā©2023 Databricks Inc. ā€” All rights reserved
Learning Objectives
By the end of this lesson, you should be able to:
ā— Learn how to activate Change Data Feed (CDF) for a speciļ¬c table.
ā— Understanding of how to set up a streaming read operation to access
and work with the change data generated by CDF.
ā— Demonstrate how to use SQL's DELETE statement to remove speciļ¬c
records from a table based on criteria and then verify the success of
those deletions.
ā— Ensure data consistency by propagating delete operations from one
table to related tables when records are deleted in one table.
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo
CDF, Data Processing, Deletions, Consistency
ā— Enable CDF for a speciļ¬c table using SQL's ALTER TABLE command with the
delta.enableChangeDataFeed property set to true, allowing us to track changes
made to the table.
ā— Conļ¬gure a streaming read operation on the table with CDF enabled, using
options like 'format("delta")' and enabling continuous monitoring and processing
of change data.
ā— Use SQL's DELETE statement to selectively remove records from the table by
specifying the column name and value, aiding data management and cleanup.
ā— Establish data integrity by creating a temporary view with marked-for-deletion
record and execute delete operations on related tables using MERGE INTO
statements.
Ā©2023 Databricks Inc. ā€” All rights reserved
Demo: Propagating
Deletes with CDF
Ā©2023 Databricks Inc. ā€” All rights reserved
Knowledge Check
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following terms refers to irreversibly
altering personal data in such a way that a data
subject can no longer be identiļ¬ed directly or
indirectly?
Select one response
A. Tokenization
B. Pseudonymization
C. Anonymization
D. Binning
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following is a regulatory compliance program
specifically for Europe?
Select one response
A. HIPAA
B. PCI-DSS
C. GDPR
D. CCPA
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following are examples of generalization?
Select two responses
A. Hashing
B. Truncating IP addresses
C. Data suppression
D. Binning
Ā©2023 Databricks Inc. ā€” All rights reserved
Which of the following can be used to obscure personal
information by outputting a string of randomized
characters?
Select one response
A. Tokenization
B. Categorical generalization
C. Binning
D. Hashing
Ā©2023 Databricks Inc. ā€” All rights reserved
Change feed can be read by which of the following?
Select two responses
A. Version
B. Date modiļ¬ed
C. Timestamp
D. Size

More Related Content

Similar to Data Privacy Patterns in databricks for data engineering professional certification

SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008Denny Lee
Ā 
GDPR Compliance Made Easy with Data Virtualization
GDPR Compliance Made Easy with Data VirtualizationGDPR Compliance Made Easy with Data Virtualization
GDPR Compliance Made Easy with Data VirtualizationDenodo
Ā 
ppt-security-dbsat-222-overview-nodemo.pdf
ppt-security-dbsat-222-overview-nodemo.pdfppt-security-dbsat-222-overview-nodemo.pdf
ppt-security-dbsat-222-overview-nodemo.pdfcamyla81
Ā 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLEDB
Ā 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLEDB
Ā 
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD ProcessCillian Kieran
Ā 
Kangaroot EDB Webinar Best Practices in Security with PostgreSQL
Kangaroot EDB Webinar Best Practices in Security with PostgreSQLKangaroot EDB Webinar Best Practices in Security with PostgreSQL
Kangaroot EDB Webinar Best Practices in Security with PostgreSQLKangaroot
Ā 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersCloudera, Inc.
Ā 
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data Teams
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data TeamsEthyca CodeDriven - Data Privacy Compliance for Engineers & Data Teams
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data TeamsCillian Kieran
Ā 
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...EUDAT
Ā 
MobileDBSecurity.pptx
MobileDBSecurity.pptxMobileDBSecurity.pptx
MobileDBSecurity.pptxmissionsk81
Ā 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control DBmaestro - Database DevOps
Ā 
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119The Hacking Games - Security vs Productivity and Operational Efficiency 20230119
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119lior mazor
Ā 
Oracle Database 11g Security and Compliance Solutions - By Tom Kyte
Oracle Database 11g Security and Compliance Solutions - By Tom KyteOracle Database 11g Security and Compliance Solutions - By Tom Kyte
Oracle Database 11g Security and Compliance Solutions - By Tom KyteEdgar Alejandro Villegas
Ā 
Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Denodo
Ā 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at ScaleDataWorks Summit
Ā 
Data protection and privacy in the world of database DevOps
Data protection and privacy in the world of database DevOpsData protection and privacy in the world of database DevOps
Data protection and privacy in the world of database DevOpsRed Gate Software
Ā 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Ā 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governancessuser538b022
Ā 
Data base system.pptx
Data base system.pptxData base system.pptx
Data base system.pptxMrwafaAbbas
Ā 

Similar to Data Privacy Patterns in databricks for data engineering professional certification (20)

SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
Ā 
GDPR Compliance Made Easy with Data Virtualization
GDPR Compliance Made Easy with Data VirtualizationGDPR Compliance Made Easy with Data Virtualization
GDPR Compliance Made Easy with Data Virtualization
Ā 
ppt-security-dbsat-222-overview-nodemo.pdf
ppt-security-dbsat-222-overview-nodemo.pdfppt-security-dbsat-222-overview-nodemo.pdf
ppt-security-dbsat-222-overview-nodemo.pdf
Ā 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
Ā 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
Ā 
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process
#DEVWEEK2020 Data Privacy in the Tech Stack & CI/CD Process
Ā 
Kangaroot EDB Webinar Best Practices in Security with PostgreSQL
Kangaroot EDB Webinar Best Practices in Security with PostgreSQLKangaroot EDB Webinar Best Practices in Security with PostgreSQL
Kangaroot EDB Webinar Best Practices in Security with PostgreSQL
Ā 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game Changers
Ā 
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data Teams
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data TeamsEthyca CodeDriven - Data Privacy Compliance for Engineers & Data Teams
Ethyca CodeDriven - Data Privacy Compliance for Engineers & Data Teams
Ā 
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
Ā 
MobileDBSecurity.pptx
MobileDBSecurity.pptxMobileDBSecurity.pptx
MobileDBSecurity.pptx
Ā 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control
Ā 
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119The Hacking Games - Security vs Productivity and Operational Efficiency 20230119
The Hacking Games - Security vs Productivity and Operational Efficiency 20230119
Ā 
Oracle Database 11g Security and Compliance Solutions - By Tom Kyte
Oracle Database 11g Security and Compliance Solutions - By Tom KyteOracle Database 11g Security and Compliance Solutions - By Tom Kyte
Oracle Database 11g Security and Compliance Solutions - By Tom Kyte
Ā 
Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)
Ā 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
Ā 
Data protection and privacy in the world of database DevOps
Data protection and privacy in the world of database DevOpsData protection and privacy in the world of database DevOps
Data protection and privacy in the world of database DevOps
Ā 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
Ā 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governance
Ā 
Data base system.pptx
Data base system.pptxData base system.pptx
Data base system.pptx
Ā 

Recently uploaded

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Delhi Call girls
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
Ā 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
Ā 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
Ā 
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰ ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰																			ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰																			ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰ ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€ffjhghh
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
Ā 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
Ā 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
Ā 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
Ā 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
Ā 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Ā 

Recently uploaded (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
Ā 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
Ā 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Ā 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Ā 
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰ ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰																			ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰																			ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€
å®šåˆ¶č‹±å›½ē™½é‡‘ę±‰å¤§å­¦ęƕäøščƁļ¼ˆUCBęƕäøščƁ书ļ¼‰ ꈐē»©å•åŽŸē‰ˆäø€ęƔäø€
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
Ā 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Ā 
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
Ā 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Ā 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
Ā 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Ā 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Ā 

Data Privacy Patterns in databricks for data engineering professional certification

  • 1. Ā©2023 Databricks Inc. ā€” All rights reserved Data Privacy Patterns
  • 2. Ā©2023 Databricks Inc. ā€” All rights reserved Agenda Data Privacy Patterns Lesson Name Lesson Name Lecture: Store Data Securely Lecture: Deleting Data in Databricks ADE 3.1 - Follow Along Demo - PII Lookup Table ADE 3.4 - Follow Along Demo - Processing Records from CDF ADE 3.2 - Follow Along Demo - Pseudonymized ETL ADE 3.5L - Propagating Changes with CDF Lab ADE 3.3 - Follow Along Demo - Deidentiļ¬ed PII Access ADE 3.6 - Follow Along Demo: Propagating Deletes with CDF Lecture: Streaming Data and CDF
  • 3. Ā©2023 Databricks Inc. ā€” All rights reserved Store Data Securely
  • 4. Ā©2023 Databricks Inc. ā€” All rights reserved Learning Objectives By the end of this lesson, you should be able to: Identify common use cases for data privacy and describe optimal approaches to meet compliance regulations 1 2 3 Secure and handle PII/sensitive data in data pipelines Create Dynamic views to perform data masking and control access to rows and columns
  • 5. Ā©2023 Databricks Inc. ā€” All rights reserved Regulatory Compliance ā€¢ EU = GDPR (General Data Protection Regulation) ā€¢ US = CCPA (California Consumer Privacy Act) ā€¢ Simpliļ¬ed Compliance Requirements ā€¢ Inform customers what personal information is collected ā€¢ Delete, update, or export personal information as requested ā€¢ Process request in a timely fashion (30 days)
  • 6. Ā©2023 Databricks Inc. ā€” All rights reserved Regulatory Compliance ā€¢ Reduce copies of your PII ā€¢ Find personal information quickly ā€¢ Reliably change, delete, or export data ā€¢ Built-in data skipping optimizations (Z-order) and housekeeping of obsolete/deleted data (VACUUM) ā€¢ Use transaction logs for auditing How Databricks Simpliļ¬es Compliance
  • 7. Ā©2023 Databricks Inc. ā€” All rights reserved Manage Access to PII ā€¢ Control access to storage locations with cloud permissions ā€¢ Limit human access to raw data ā€¢ Pseudonymize records on ingestion ā€¢ Use table ACLs to manage user permissions ā€¢ Conļ¬gure dynamic views for data redaction ā€¢ Remove identifying details from demographic views
  • 8. Ā©2023 Databricks Inc. ā€” All rights reserved PII Data Security ā€¢ Protects data at record level ā€¢ Re-identiļ¬cation is possible ā€¢ Pseudonymised data is still considered PII Two main data modeling approaches to meet compliance requirements ā€¢ Protects entire dataset ā€¢ Irreversibly altered ā€¢ Non-linkable to original data ā€¢ Multiple anonymization methods might be used Pseudonymization Anonymization Name John Doe B_Date 14/04/1987 Name User-321 B_Date 14/04/1987 Name John Doe B_Date 14/04/1987 Name ********** Age 20-30
  • 9. Ā©2023 Databricks Inc. ā€” All rights reserved Pseudonymization ā€¢ Switches original data point with pseudonym for later re-identiļ¬cation ā€¢ Only authorized users will have access to keys/hash/table for re-identiļ¬cation ā€¢ Protects datasets on record level for machine learning ā€¢ A pseudonym is still considered to be personal data according to the GDPR ā€¢ Two main pseudonymization methods: hashing and tokenization Overview of the approach
  • 10. Ā©2023 Databricks Inc. ā€” All rights reserved Pseudonymization ā€¢ Apply SHA or other hash to all PII ā€¢ Add random string "salt" to values before hashing ā€¢ Databricks secrets can be leveraged for obfuscating salt value ā€¢ Leads to some increase in data size ā€¢ Some operations will be less efļ¬cient Method: Hashing ID SSN Salary_R 1 000-11-1111 53K 2 000-22-2222 68K 3 000-33-3333 90K 4 000-44-4444 72K ID SSN Salary_R 1 1ffa0bf4002a968e7d8 53K 2 1d55ec7079cb0a6at0 68K 3 be85b326855e0e748 90K 4 da20058e59fe8d311f 72K
  • 11. Ā©2023 Databricks Inc. ā€” All rights reserved Pseudonymization ā€¢ Converts all PII to keys ā€¢ Values are stored in a secure lookup table ā€¢ Slow to write, but fast to read ā€¢ De-identiļ¬ed data stored in fewer bytes Method: Tokenization ID SSN Salary_R 1 000-11-1111 53K 2 000-22-2222 68K 3 000-33-3333 90K 4 000-44-4444 72K ID SSN Salary_R 1 1ffa0bf4002a968e7d8 53K 2 1d55ec7079cb0a6at0 68K 3 be85b326855e0e748 90K 4 da20058e59fe8d311f 72K SSN SSN_Token 000-11-1111 1ffa0bf4002a968e7d8 000-22-2222 1d55ec7079cb0a6at0 000-33-3333 be85b326855e0e748 000-44-4444 da20058e59fe8d311f Token Vault
  • 12. Ā©2023 Databricks Inc. ā€” All rights reserved Anonymization ā€¢ Protects entire dataset (tables, databases or entire data catalogues) mostly for Business Intelligence ā€¢ Personal data is irreversibly altered in such a way that a data subject can no longer be identiļ¬ed directly or indirectly ā€¢ Usually a combination of more than one technique used in real-world scenarios ā€¢ Two main anonymization methods: data suppression and generalization Overview of the approach
  • 13. Ā©2023 Databricks Inc. ā€” All rights reserved Anonymization Methods ā€¢ Exclude columns with PII from views ā€¢ Remove rows where demographic groups are too small ā€¢ Use dynamic access controls to provide conditional access to full data Method: Data Suppression Source Table ID SSN Salary_R 1 000-11-1111 53K 2 000-22-2222 68K 3 000-33-3333 90K View with no PII ID Salary_R 1 53K 2 68K 3 90K
  • 14. Ā©2023 Databricks Inc. ā€” All rights reserved ā€¢ Categorical generalization ā€¢ Binning ā€¢ Truncating IP addresses ā€¢ Rounding Anonymization Methods Method: Generalization
  • 15. Ā©2023 Databricks Inc. ā€” All rights reserved ā€¢ Removes precision from data ā€¢ Move from speciļ¬c categories to more general ā€¢ Retain level of speciļ¬city that still provides insight without revealing identity Anonymization Methods Method: Generalization ā†’ Categorical Generalization
  • 16. Ā©2023 Databricks Inc. ā€” All rights reserved Anonymization Methods ā€¢ Identify meaningful divisions in data and group on boundaries ā€¢ Allows access to demographic groups without being able to identify individual PII ā€¢ Can use domain expertise to identify groups of interest Method: Generalization ā†’ Binning ID Department BirthDate 1 IT 28/09/1997 2 Sales 13/02/1976 3 Marketing 02/04/1985 4 Engineering 19/12/2002 ID Department Age_Range 1 IT 20-30 2 Sales 40-50 3 Marketing 30-40 4 Engineering 20-30
  • 17. Ā©2023 Databricks Inc. ā€” All rights reserved Anonymization Methods IP addresses need special anonymization rules; ā€¢ Rounding IP address to /24 CIDR ā€¢ Replace last byte with 0 ā€¢ Generalizes IP geolocation to city or neighbourhood level Method: Generalization ā†’ Truncating IP addresses ID IP IP_Truncated 1 10.130.176.215 10.130.176.0/24 2 10.5.56.45 10.5.56.0/24 3 10.208.126.183 10.208.126.0/24 4 10.106.62.87 10.106.62.0/24
  • 18. Ā©2023 Databricks Inc. ā€” All rights reserved Anonymization Methods ā€¢ Apply generalized rounding rules to all number data, based on required precision for analytics ā€¢ Example: ā€¢ Integers are rounded to multiples of 5 ā€¢ Values less than 2.5 are rounded to 0 or omitted from reports ā€¢ Consider suppressing outliers Method: Generalization ā†’ Rounding ID Department Age_Range Salary 1 IT 20-30 1245.4 2 Sales 40-50 1300 3 Marketing 30-40 1134 ID Department Age_Range Salary_R 1 IT 20-30 1200 2 Sales 40-50 1300 3 Marketing 30-40 1100
  • 19. Ā©2023 Databricks Inc. ā€” All rights reserved Knowledge Check
  • 20. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following terms refers to irreversibly altering personal data in such a way that a data subject can no longer be identiļ¬ed directly or indirectly? Select one response A. Tokenization B. Pseudonymization C. Anonymization D. Binning
  • 21. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following is a regulatory compliance program specifically for Europe? Select one response A. HIPAA B. PCI-DSS C. GDPR D. CCPA
  • 22. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following are examples of generalization? Select two responses A. Hashing B. Truncating IP addresses C. Data suppression D. Binning
  • 23. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following can be used to obscure personal information by outputting a string of randomized characters? Select one response A. Tokenization B. Categorical generalization C. Binning D. Hashing
  • 24. Ā©2023 Databricks Inc. ā€” All rights reserved Demo: PII Lookup Table
  • 25. Ā©2023 Databricks Inc. ā€” All rights reserved Demo: Pseudonymized ETL
  • 26. Ā©2023 Databricks Inc. ā€” All rights reserved
  • 27. Ā©2023 Databricks Inc. ā€” All rights reserved Demo: De-identiļ¬ed PII Access
  • 28. Ā©2023 Databricks Inc. ā€” All rights reserved Streaming Data and CDF
  • 29. Ā©2023 Databricks Inc. ā€” All rights reserved Learning Objectives By the end of this lesson, you should be able to: Explain how CDF is enabled and how it works 1 2 3 Discuss why ingesting from CDF can be beneļ¬cial for streaming data Articulate multiple strategies for using Streaming data to create CDF 4 Discuss how CDF addresses past difļ¬culties propagating updates and deletes
  • 30. Ā©2023 Databricks Inc. ā€” All rights reserved Streaming Data and Data Changes Updates and Deletes in streaming data ā€¢ In Structured Streaming, a data stream is treated as a table that is being continuously appended. Structured Streaming expects to work with data sources that are append only. ā€¢ Changes in existing data (updates and deletes) breaks this expectation! ā€¢ We need a deduplication logic to identify updated and deleted records. ā€¢ Note: Delta transaction logs tracks ļ¬les than rows. Updating a single row will point to a new version of the ļ¬le.
  • 31. Ā©2023 Databricks Inc. ā€” All rights reserved Solution 1: Ignore Changes Prevent re-processing by ignoring deletes, updates and overwrites ā€¢ Ignore transactions that delete data at partition boundaries ā€¢ No new data ļ¬les are written with full partition removal ā€¢ spark.readStream.format("delta") .option("ignoreDeletes", "true") ā€¢ Allows stream to be executed against Delta table with upstream changes ā€¢ Must implement logic to avoid processing duplicate records ā€¢ Subsumes ignoreDeletes ā€¢ spark.readStream.format("delta") .option("ignoreChanges", "true") Ignore Deletes Ignore Changes
  • 32. Ā©2023 Databricks Inc. ā€” All rights reserved Solution 2: Change Data Feeds (CDF) Propage incremental changes to downstream tables Raw Ingestion and History BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD Change Data Feed (CDF) External feeds, Other CDC output, Extracts AI & Reporting Streaming Analytics
  • 33. Ā©2023 Databricks Inc. ā€” All rights reserved What Delta Change Data Feed Does for You Beneļ¬ts and use cases of CDF Meet regulatory needs Full history available of changes made to the data, including deleted information BI on your data lake Incrementally update the data supporting your BI tool of choice Unify batch and streaming Common change format for batch and streaming updates, appends, and deletes Improve ETL Pipelines Process less data during ETL to increase efļ¬ciency of your pipelines by processing only row-level changes
  • 34. Ā©2023 Databricks Inc. ā€” All rights reserved How Does Delta CDF Work? Sample CDF data schema PK B A1 B1 A2 B2 A3 B3 Original Table (v1) PK B A1 B1 A2 Z2 A3 B3 A4 B4 Change data (Merged as v2) PK B Change Type Time Version A2 B2 Preimage 12:00:00 2 A2 Z2 Postimage 12:00:00 2 A3 B3 Delete 12:00:00 2 A4 B4 Insert 12:00:00 2 Change Data Feed Output A1 record did not receive an update or delete. So it will not be output by CDF.
  • 35. Ā©2023 Databricks Inc. ā€” All rights reserved Consuming the Delta CDF Stream-based Consumption ā— Delta Change Feed is processed as each source commit completes ā— Rapid source commits can result in multiple Delta versions being included in a single micro-batch Batch Consumption ā— Batches are constructed based in time-bound windows which may contain multiple Delta versions A2 B2 Preimage 12:00:00 2 A2 Z2 Postimage 12:00:00 2 A3 B3 Delete 12:00:00 2 A4 B4 Insert 12:00:00 2 A5 B5 Insert 12:08:0 0 3 A6 B6 Insert 12:09:00 4 A6 B6 Preimage 12:10:05 5 A6 Z6 Postimage 12:10:05 5 A5 B5 Insert 12:08:00 3 A6 B6 Insert 12:09:00 4 12:00 A6 B6 Preimage 12:10:05 5 A6 Z6 Postimag e 12:10:05 5 12:10:0 0 12:20 12:08 12:10:10 Stream - micro batches Batch - every 10 mins 12:00
  • 36. Ā©2023 Databricks Inc. ā€” All rights reserved CDF Conļ¬guration Important notes for CDF conļ¬guration ā€¢ CDF is not enabled by default. It can be enabled; ā€¢ At table level: ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true) ā€¢ For all new tables: set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; ā€¢ Change feed can be read by; ā€¢ Version ā€¢ Timestamp
  • 37. Ā©2023 Databricks Inc. ā€” All rights reserved Deleting Data in Databricks
  • 38. Ā©2023 Databricks Inc. ā€” All rights reserved Learning Objectives By the end of this lesson, you should be able to: Explain the use case of CDF for removing data in downstream tables 1 2 3 Describe various methods of recording important data changes Discuss how CDF can be leveraged to ensure deletes are committed fully
  • 39. Ā©2023 Databricks Inc. ā€” All rights reserved Data Deletion in Databricks Data deletion needs special attention! ā€¢ Companies need to handle data deletion requests carefully to main compliance with privacy regulations such as GDPR and CCPA. ā€¢ PII for users need to be effectively and efļ¬ciently handled in Databricks, including deletion. ā€¢ These operations usually handled in pipelines separate from ETL pipelines. ā€¢ CDF data can be used to propagate deletion action to downstream table.
  • 40. Ā©2023 Databricks Inc. ā€” All rights reserved Recording Important Data Changes Using commit messages ā€¢ Delta Lake supports arbitrary commit messages that will be recorded to the Delta transaction log and viewable in the table history. This can help with later auditing. ā€¢ Commit messages can be; ā€¢ Set at global level ā€¢ Can be speciļ¬ed as part of write operation. For example, data insertion can be labeled based on processing type; manual, automated.
  • 41. Ā©2023 Databricks Inc. ā€” All rights reserved Propagating Data Deletion with CDF How can CDF be used for propagating deletes? ā€¢ Data deletion requests can be streamlined with automated triggers using Structured Streaming. ā€¢ CDF can be separately leveraged to identify records need to be deleted or modiļ¬ed in downstream tables. Note: When Delta Lake's history and CDF features are implemented, deleted values are still present in older versions of the data. ā€¢ We can solve this by deleting at a partition boundary.
  • 42. Ā©2023 Databricks Inc. ā€” All rights reserved CDF Retention Policy Important notes for CDF conļ¬guration ā€¢ File deletion will not actually occur until we VACUUM our table! ā€¢ CDF records follow the same retention policy of the table. VACUUM command deletes CDF data. ā€¢ By default, the Delta engine will prevent VACUUM operations with less than 7 days of retention. To manually run VACUUM for these ļ¬les; ā€¢ Disable Sparkā€™s retention duration check (retentionDurationCheck.enabled) ā€¢ Run VACUUM with DRY RUN to preview ļ¬les before permanently removing them ā€¢ Run VACUUM with RETAIN 0 HOURS!
  • 43. Ā©2023 Databricks Inc. ā€” All rights reserved Can I perform DML on a Live Table? (i.e. GDPR) 43
  • 44. Ā©2023 Databricks Inc. ā€” All rights reserved Example GDPR use case Using streaming live tables for ingestion and live tables after 44 Staging Bronze Silver Gold CREATE STREAMING LIVE TABLE CREATE LIVE TABLE CREATE LIVE TABLE {JSON files} Limited Retention Configurable Retention Correction / GDPR LIVE Tables automatically reflect changes made to inputs
  • 45. Ā©2023 Databricks Inc. ā€” All rights reserved How can you ļ¬x it? Updates, deletes, inserts and merges on streaming live tables. Ensure compliance for retention periods on a table. DELETE FROM my_live_tables.records WHERE date < current_time() ā€“ INTERVAL 3 years Scrub PII from data in the lake. UPDATE my_live_tables.records SET email = hash(email, salt) 45
  • 46. Ā©2023 Databricks Inc. ā€” All rights reserved DML works on streaming live tables only DML on live tables is undone by the next update UPDATE users SET email = obfuscate(email) Live Table Streaming Live Table 46 id 1 Users email ****@gmail.com CREATE OR REPLACE users AS ā€¦. user@gmail.com UPDATE users SET email = obfuscate(email) id 1 Users email ****@gmail.com 2 ****@hotmail... INSERT INTO users ā€¦
  • 47. Ā©2023 Databricks Inc. ā€” All rights reserved Demo: Processing Records from CDF
  • 48. Ā©2023 Databricks Inc. ā€” All rights reserved Lab: Propagating Changes with CDF
  • 49. Ā©2023 Databricks Inc. ā€” All rights reserved Learning Objectives By the end of this lesson, you should be able to: ā— Learn how to activate Change Data Feed (CDF) for a speciļ¬c table. ā— Understanding of how to set up a streaming read operation to access and work with the change data generated by CDF. ā— Demonstrate how to use SQL's DELETE statement to remove speciļ¬c records from a table based on criteria and then verify the success of those deletions. ā— Ensure data consistency by propagating delete operations from one table to related tables when records are deleted in one table.
  • 50. Ā©2023 Databricks Inc. ā€” All rights reserved Demo CDF, Data Processing, Deletions, Consistency ā— Enable CDF for a speciļ¬c table using SQL's ALTER TABLE command with the delta.enableChangeDataFeed property set to true, allowing us to track changes made to the table. ā— Conļ¬gure a streaming read operation on the table with CDF enabled, using options like 'format("delta")' and enabling continuous monitoring and processing of change data. ā— Use SQL's DELETE statement to selectively remove records from the table by specifying the column name and value, aiding data management and cleanup. ā— Establish data integrity by creating a temporary view with marked-for-deletion record and execute delete operations on related tables using MERGE INTO statements.
  • 51. Ā©2023 Databricks Inc. ā€” All rights reserved Demo: Propagating Deletes with CDF
  • 52. Ā©2023 Databricks Inc. ā€” All rights reserved Knowledge Check
  • 53. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following terms refers to irreversibly altering personal data in such a way that a data subject can no longer be identiļ¬ed directly or indirectly? Select one response A. Tokenization B. Pseudonymization C. Anonymization D. Binning
  • 54. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following is a regulatory compliance program specifically for Europe? Select one response A. HIPAA B. PCI-DSS C. GDPR D. CCPA
  • 55. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following are examples of generalization? Select two responses A. Hashing B. Truncating IP addresses C. Data suppression D. Binning
  • 56. Ā©2023 Databricks Inc. ā€” All rights reserved Which of the following can be used to obscure personal information by outputting a string of randomized characters? Select one response A. Tokenization B. Categorical generalization C. Binning D. Hashing
  • 57. Ā©2023 Databricks Inc. ā€” All rights reserved Change feed can be read by which of the following? Select two responses A. Version B. Date modiļ¬ed C. Timestamp D. Size