Leveraging Apache Spark and
Delta Lake for Efficient Data
Encryption at Scale
Jason Hale and Daniel Harrington
Agenda
Background
Authors, Mars and the Petcare Data Platform.
CCPA
Enhanced privacy rights and consumer protection
Gecko
Deep dive into our Bespoke, Automated
CCPA compliance tool.
Background
Authors
• Masters’ in Physics– University of Exeter.
• Data Engineer for Mars Petcare - 16 months.
• Worked on ELT framework development and Gecko.
• Masters’ in Integrated Mechanical and Electrical
Engineering – University of Bath.
• Data Engineer for Mars Petcare - 10 months.
• Worked solely on Gecko.
Jason
Hale
Daniel
Harrington
We’re part of a
broad and diverse
family company
that’s constantly
evolving.
Copyright © 2020 Mars, Incorporated — Confidential 6
We’ve grown from our
beginnings in pet food in
1935
to a family of nutrition, health,
and services businesses today.
The Petcare Data Platform
Introduction
The Petcare Data
Platform
• The Data & Analytics team manages a platform of
anonymized data from across Mars Petcare’s brands
and businesses
• Our ingestion pipeline, Kyte, has ingested data from
across these business units to form the basis of the
Petcare Data Platform (PDP).
• We have built engines and designed processes that
have enhanced the business value and integrity of the
PDP.
• One of these is Gecko: our CCPA compliance
ecosystem designed for Spark and Delta Lake.
The PDP
Ingestion
Web Services
Activation/
Marketing
Engines
Transformations
California Consumer Privacy Act (CCPA)
CCPA
• California Consumer Privacy Act – Effective
January 2020
• Protects the personal information a
business collects about Consumers and
how it is used and shared
• Three key rights:
▪ Right to Opt Out.
▪ Right to Request Disclosure.
▪ Right to Request Deletion (Right To Be
Forgotten).
Our Mission
1. Handle CCPA Right to Forget requests
more efficiently, safely and effectively.
2. Increase the overall security of PII data in
the Petcare Data Platform (PDP) Vault.
3. Maintain Non-PII data structure, in order
to continue to provide analytical value
and overall data integrity.
The Gecko Ecosystem
Gecko Ecosystem
• The concept behind Gecko is to use row (client) level encryption for PII data, and to
store encryption keys in a single delta lake table in our lake.
• Gecko is made up of two core functions: Gecko Crawl and Gecko Delete
• Gecko Crawl:
▪ Handles encryption/decryption of PII data
▪ Generates a “Master Table” containing all PII within the PDP
• Gecko Delete:
▪ Handles CCPA compliance through redaction of encryption keys
Core Concepts
Gecko Crawl
Architecture
1. Key Generation
*Illustrative Example
• Loop through each source + table
• Salt = 16-byte binary string
• 1 salt per Source_Id
• Client Ids prioritized over primary
keys
1. Key Generation
+
= encryption key
password
*Illustrative Example
2. Data Encryption
• Multithreading notebooks at the table level: encrypt data in parallel.
• 3 locations required to encrypt due to 3 write locations in the ELT process.
• Join tables to the ID_SALT table for the configured Id_Column / Source_Id in order to derive the
Salt key.
• Fernet encryption udf applied across all PII columns.
• Encrypted data validated and overwrites existing ingested data.
• Ability to decrypt and obtain original PII when required and permitted
2. Data Encryption
*Illustrative Example
2. Data Encryption
▪ Individual files for each date
ingested
▪ Required to encrypt each file path
one by one (600 in some cases)
▪ Encryption process not easily
optimised
▪ Single delta table for all dates
▪ Single path to encrypt
▪ Encryption process easily optimised
by Spark
Delta (x1 ELT write loc)Parquet (x2 ELT write locs)
Optimizing Parquet Encryption
• Data wasn't partitioned across the
cluster leading to extremely low
utilization
• Made worse by skewed data sets
(typically those with free text fields)
• Runs took an extremely long time
Initial Shortfalls – Loop file by file
Optimizing Parquet Encryption
• Increase number of partitions after shuffle
removes skew effects & ensures Spark
parallelism
• Python concurrent futures allows us to
execute encryption logic for multiple parquet
files across multiple workers in parallel
• Massively increased cluster utilization &
reduced run time from days to hours
Solution: Parallelism with
threading + Spark
3. Master Table Generation
• Collect all PII in the PDP into a single Master Table
• Fields for each PII attribute: Name, Phone Number, Email, Address, Note
• Each field contains an array of encrypted PII for each Source_Id
PII Labeling and Collection for Future use
3. Master Table Generation
*Illustrative Example
Gecko Delete
Gecko Delete
• The Gecko Delete process offers superior
simplicity, consistency and tractability.
• The process is as follows:
1. Request is ingested into our configs.
2. The delete pipeline is triggered.
3. The request ID is used to filter our ID_SALT config.
4. The relevant salt is redacted.
• The Client’s PII is now irretrievable and
Non-PII data structure is maintained.
Gecko Delete
Vault: Petcare Data
Platform
Mars Petcare Business
Unit (Veterinary)
Right To Forget
Request:
Banfield ID
7586241
Right to forget
request comes from
the Business Unit
Via the OneTrust
System
Client is
identified within
the Vault Client’s Salt is redacted from
the ID table
• Without the Salt ID, the Encryption Key cannot be generated
• We only have to remove a single record from a single table but we achieve both of the following:
• From this point onwards, the Client PII can never be retrieved from the Vault
• All of the non-PII data stays exactly as it was in the lake, safely maintaining its overall integrity and value
7586241 [GECKO REDACTED]
*Illustrative Example
How have our Processes been Improved?
Benefits
Benefits
1. Security - Every instance of an individual’s
information is encrypted
2. Speed – Single filter and redact
3. Auditability – Config contains meta data about
deletions performed
4. Automation– Easily monitored as part of a BAU
process
5. Data Integrity – By not deleting rows data
structure and integrity is maintained
Next steps for Gecko
Future Work
Future Work
• Potential building of future custom NLP model for auto PII detection
on ingestion.
• API layer over Key access to enhance speed & security.
• Integration of Gecko module into the ingestion process itself.
Additions and Improvements
Thank you
@marsglobal
linkedin.com/company/mars/
facebook.com/mars
mars.com
Copyright © 2020 Mars,
Incorporated — Confidential
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

  • 1.
    Leveraging Apache Sparkand Delta Lake for Efficient Data Encryption at Scale Jason Hale and Daniel Harrington
  • 2.
    Agenda Background Authors, Mars andthe Petcare Data Platform. CCPA Enhanced privacy rights and consumer protection Gecko Deep dive into our Bespoke, Automated CCPA compliance tool.
  • 3.
  • 4.
    Authors • Masters’ inPhysics– University of Exeter. • Data Engineer for Mars Petcare - 16 months. • Worked on ELT framework development and Gecko. • Masters’ in Integrated Mechanical and Electrical Engineering – University of Bath. • Data Engineer for Mars Petcare - 10 months. • Worked solely on Gecko. Jason Hale Daniel Harrington
  • 5.
    We’re part ofa broad and diverse family company that’s constantly evolving.
  • 6.
    Copyright © 2020Mars, Incorporated — Confidential 6 We’ve grown from our beginnings in pet food in 1935 to a family of nutrition, health, and services businesses today.
  • 7.
    The Petcare DataPlatform Introduction
  • 8.
    The Petcare Data Platform •The Data & Analytics team manages a platform of anonymized data from across Mars Petcare’s brands and businesses • Our ingestion pipeline, Kyte, has ingested data from across these business units to form the basis of the Petcare Data Platform (PDP). • We have built engines and designed processes that have enhanced the business value and integrity of the PDP. • One of these is Gecko: our CCPA compliance ecosystem designed for Spark and Delta Lake.
  • 9.
  • 10.
  • 11.
    CCPA • California ConsumerPrivacy Act – Effective January 2020 • Protects the personal information a business collects about Consumers and how it is used and shared • Three key rights: ▪ Right to Opt Out. ▪ Right to Request Disclosure. ▪ Right to Request Deletion (Right To Be Forgotten).
  • 12.
    Our Mission 1. HandleCCPA Right to Forget requests more efficiently, safely and effectively. 2. Increase the overall security of PII data in the Petcare Data Platform (PDP) Vault. 3. Maintain Non-PII data structure, in order to continue to provide analytical value and overall data integrity.
  • 13.
  • 14.
    Gecko Ecosystem • Theconcept behind Gecko is to use row (client) level encryption for PII data, and to store encryption keys in a single delta lake table in our lake. • Gecko is made up of two core functions: Gecko Crawl and Gecko Delete • Gecko Crawl: ▪ Handles encryption/decryption of PII data ▪ Generates a “Master Table” containing all PII within the PDP • Gecko Delete: ▪ Handles CCPA compliance through redaction of encryption keys Core Concepts
  • 15.
  • 16.
  • 17.
    1. Key Generation *IllustrativeExample • Loop through each source + table • Salt = 16-byte binary string • 1 salt per Source_Id • Client Ids prioritized over primary keys
  • 18.
    1. Key Generation + =encryption key password *Illustrative Example
  • 19.
    2. Data Encryption •Multithreading notebooks at the table level: encrypt data in parallel. • 3 locations required to encrypt due to 3 write locations in the ELT process. • Join tables to the ID_SALT table for the configured Id_Column / Source_Id in order to derive the Salt key. • Fernet encryption udf applied across all PII columns. • Encrypted data validated and overwrites existing ingested data. • Ability to decrypt and obtain original PII when required and permitted
  • 20.
  • 21.
    2. Data Encryption ▪Individual files for each date ingested ▪ Required to encrypt each file path one by one (600 in some cases) ▪ Encryption process not easily optimised ▪ Single delta table for all dates ▪ Single path to encrypt ▪ Encryption process easily optimised by Spark Delta (x1 ELT write loc)Parquet (x2 ELT write locs)
  • 22.
    Optimizing Parquet Encryption •Data wasn't partitioned across the cluster leading to extremely low utilization • Made worse by skewed data sets (typically those with free text fields) • Runs took an extremely long time Initial Shortfalls – Loop file by file
  • 23.
    Optimizing Parquet Encryption •Increase number of partitions after shuffle removes skew effects & ensures Spark parallelism • Python concurrent futures allows us to execute encryption logic for multiple parquet files across multiple workers in parallel • Massively increased cluster utilization & reduced run time from days to hours Solution: Parallelism with threading + Spark
  • 24.
    3. Master TableGeneration • Collect all PII in the PDP into a single Master Table • Fields for each PII attribute: Name, Phone Number, Email, Address, Note • Each field contains an array of encrypted PII for each Source_Id PII Labeling and Collection for Future use
  • 25.
    3. Master TableGeneration *Illustrative Example
  • 26.
  • 27.
    Gecko Delete • TheGecko Delete process offers superior simplicity, consistency and tractability. • The process is as follows: 1. Request is ingested into our configs. 2. The delete pipeline is triggered. 3. The request ID is used to filter our ID_SALT config. 4. The relevant salt is redacted. • The Client’s PII is now irretrievable and Non-PII data structure is maintained.
  • 28.
    Gecko Delete Vault: PetcareData Platform Mars Petcare Business Unit (Veterinary) Right To Forget Request: Banfield ID 7586241 Right to forget request comes from the Business Unit Via the OneTrust System Client is identified within the Vault Client’s Salt is redacted from the ID table • Without the Salt ID, the Encryption Key cannot be generated • We only have to remove a single record from a single table but we achieve both of the following: • From this point onwards, the Client PII can never be retrieved from the Vault • All of the non-PII data stays exactly as it was in the lake, safely maintaining its overall integrity and value 7586241 [GECKO REDACTED] *Illustrative Example
  • 29.
    How have ourProcesses been Improved? Benefits
  • 30.
    Benefits 1. Security -Every instance of an individual’s information is encrypted 2. Speed – Single filter and redact 3. Auditability – Config contains meta data about deletions performed 4. Automation– Easily monitored as part of a BAU process 5. Data Integrity – By not deleting rows data structure and integrity is maintained
  • 31.
    Next steps forGecko Future Work
  • 32.
    Future Work • Potentialbuilding of future custom NLP model for auto PII detection on ingestion. • API layer over Key access to enhance speed & security. • Integration of Gecko module into the ingestion process itself. Additions and Improvements
  • 33.
  • 34.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.