Scaling Privacy in a Spark Ecosystem

Scaling Privacy with
Apache Spark
Aaron Colcord
Sr. Director Engineering, Northwestern Mutual
Don Durai Bosco
CTO and Co-Founder, Privacera

Agenda
▪ Our background
▪ Why privacy, security,
compliance?
▪ Approaches
▪ Ideal problem solve
▪ Real life meets ideal life

Backgrounds
▪ Building an Enterprise Scale Uniﬁed
Framework
▪ Very Long, Respected History ~ 160 Years
▪ Compliance is extremely important to us
▪ Agile Data vs Compliant Data
▪ Founded in 2016 by the creators of Apache
Ranger & Apache Atlas
▪ Extends Ranger's capabilities beyond traditional
Big Data environments to cloud (Databricks,
AWS, Azure, GCP, and more)
▪ Specializes in democratizing data for analytics,
while ensuring compliance with privacy
regulations (GDPR, CCPA, LGPD, HIPAA, & more)
• Privacera
• Northwestern Mutual

Why do we suddenly care about privacy?
• You care if you are regulated in any form
• Simple you need to show you can pass an audit
• You care if you store any information about your users
• Simple because governments have woken up with GDPR and CCPA
• You care if you want to democratize your data
• Simple because the use of your data can be scrutinized
We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and
technology really showed how important it was.

Have you ever...
• Collecting information about your customers can
• Improve the experience
• Allow the company to understand their business better
• At the core, privacy is a policy and legal obligation
• You have the data, it used to be your business to just secure it.
• Do you want your information monetized? Sold? Traded?
• Most companies don’t do this. But the privacy policy is there for you.
• Clicked ‘accept all’ on website, used a digital assistant..
Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or
EULA?

And it’s only going to pick up speed.
• More Regulations are arriving around privacy
• Increasing your ability to execute against data means respecting your user’s rights
• A part of maturity is being able to manage governance

More importantly, why do we care so much?
• Technology like Apache Spark opens the capability to
democratize your data.
• Most every company wants the marketplace to enrich
and share their data.
• Who inside that company can view it? Do we have the
controls to protect your information? Can we verify
that the information is used for the right purposes?

What is the difference between these?
▪ Preventing unauthorized
usage of systems
▪ Ensuring users don’t see the
incorrect information
▪ Creating boundaries to
enforce right action of the
system
• The process of making sure
your company and
employees follow all laws,
regulations, standards, and
ethical practices that apply
to your organization
• Compliance
• Security
• “Data privacy may be
deﬁned as the authorized,
fair, and legitimate
processing of personal
information”
• Consent rights
• Do not share
• Slippery space
• Privacy

Examine strategies to scale agile data w/privacy
• Build a metadata layer that deﬁnes PII in its schema
• Users and developers can and will change where PII is stored
• You can literally chase people to do the ‘right thing’ forever
• You could build views with permissions to certain users
• Not very scalable
• Plus you need to always show who accessed and why
• Are these security scenario?

Challenges to that strategy
• Is the metadata layer ﬂexible enough or should we think in policies?
• Privacy is inherently your organization’s position which may evolve based on regulation
• Can your development keep up with views?
• When you discover the extra 10,000 ﬁelds, can you keep up?
• Implement a framework that scales
• Security is not Privacy.
• Security has a different domain and set of principles.
• Remember we are protecting the usage of your data.

Ideal scalable system
▪ Revocation of
Consent
▪ Portability
▪ Erasure
▪ Rectification
▪ How is data used?
▪ Rights follow Data
Reuse
▪ Flexible to change
▪ Should align with a
Data Governance
program
▪ Should adapt to
changing data
▪ Proactive.
▪ Reclassification
• Classification
• User Rights
▪ How was it used?
▪ How was it
accessed?
▪ How was it
protected?
▪ Did it cross
borders?
• Audit/Governance
▪ Authorization of
User may change
▪ Supports Agile
Access
▪ Business Use is
preserved
▪ Automated
Systems obey
Privacy
• Access

User Rights at Scale
▪Revocation of Consent/ Right To Be Forgotten
▪Portability
▪Erasure
▪Rectiﬁcation
▪How is data used?
▪Rights follow Data Reuse
▪Flexible to change

S3 ADLS Redshift Snowflake Synapse
Privacy Challenges in Open Data Ecosystem
Athena Databricks HDInsight
EMR
Dremio Trino PrestoDB
PowerBI Tableau
Storage
SQL Engines
Data Virtualization
BI Tools
Marketing
Data
Analyst
Data
Scientist/A
rchitect

Tools & Technology
AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL
AUDIT COLLECTION AND REPORTING

Automated Data Discovery
● Automatically detect and catalog sensitive
data
● Detailed classiﬁcation, e.g. EMAIL, SSN,
GENDER, CC, PHONE_NUMBER, etc.
● Eliminate manual processes
● Catalog data as it is ingested
● Track data movement and propagate tag
● Catalog data across multiple cloud
services

Centralized Access Control
● Global Tag/Classification-based policies
● Purpose and Persona based policies
● Dynamic row filters v/s Views
● Dynamic masking or decryption
● Approval workflows with time and
purpose constraints

Centralized Auditing and Reporting
● Centralize auditing
● Monitoring data access by classiﬁcation
● Track usage by Purpose
● Generate attestation reports

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Scaling Privacy in a Spark Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Privacy in a Spark Ecosystem

Similar to Scaling Privacy in a Spark Ecosystem (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Scaling Privacy in a Spark Ecosystem