ViewShift: Hassle-Free Dynamic
Policy Enforcement for Every
Data Lake
Walaa Eldin Moustafa
Senior Staff Software
Engineer, LinkedIn
May 2024
Khai Tran
Senior Staff Software
Engineer, LinkedIn
Data Protection Scene
Can you relate?
Too many policies Too much data
GDPR
DMA Consent
PII
Right to be forgotten
CCPA
Privacy by design
Anonymization
The Rise of Privacy and Compliance
• Privacy Dashboards
• Data Export
• Ad preferences
• Security checkups
• Data Deletion
Solution is Easy!
Only 2 machines
Solution is Easy!
Only 2 machines
Policies
Data Lake Metadata
SQL
Views
Data & Applications
Compliance 🎉
Why Views
• Expressive
• Express multiple policies with
projections, filters, joins, UDFs.
• Portable
• Executable on multiple engines.
• Modular
• Can be drop-in replacement to
underlying data
• Agile
• Roll-out new views, rollback to
previous views
CREATE VIEW T1_UC1 AS
SELECT
CASE WHEN consent = ‘ALLOW’
THEN a ELSE obf(a)
FROM T1, Settings
WHERE Settings.ID = T1.ID
Why Views
Column level filtering
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 a8 b8 c8 d8 e8
9 a9 b9 c9 d9 e9
n b c d e
1 b1 c1 d1 e1
2 b2 c2 d2 e2
3 b3 c3 d2 e3
4 b4 c4 d4 e4
5 b5 c5 d5 e5
6 b6 c6 d6 e6
7 b7 c7 d7 e7
8 b8 c8 d8 e8
9 b9 c9 d9 e9
Why Views
Column level masking
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 a8 b8 c8 d8 e8
9 a9 b9 c9 d9 e9
n a b c d e
1 # b1 c1 d1 e1
2 # b2 c2 d2 e2
3 # b3 c3 d2 e3
4 # b4 c4 d4 e4
5 # b5 c5 d5 e5
6 # b6 c6 d6 e6
7 # b7 c7 d7 e7
8 # b8 c8 d8 e8
9 # b9 c9 d9 e9
Why Views
Row level filters
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 a8 b8 c8 d8 e8
9 a9 b9 c9 d9 e9
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
8 a8 b8 c8 d8 e8
Why Views
Cell level masking
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 a8 b8 c8 d8 e8
9 a9 b9 c9 d9 e9
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 # b3 c3 # e3
4 a4 b4 c4 d4 e4
5 a5 # c5 d5 #
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 # b8 c8 d8 e8
9 a9 b9 # # #
n a b c d
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d2
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
7 a7 b7 c7 d7
8 a8 b8 c8 d8
9 a9 b9 c9 d9
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 # b8 c8 # e8
9 a9 b9 c9 d9 e9
Why Views
Multiple types of masking
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d2 e3
4 a4 b4 c4 d4 e4
5 a5 b5 c5 d5 e5
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 a8 b8 c8 d8 e8
9 a9 b9 c9 d9 e9
n a b c d e
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 # b3 c3 # e3
4 a4 b4 c4 d4 e4
5 a5 # c5 d5 #
6 a6 b6 c6 d6 e6
7 a7 b7 c7 d7 e7
8 # b8 c8 d8 e8
9 a9 b9 # # #
Why Views
Multiple Query Engines, Multiple Metadata Sources
https://github.com/linkedin/coral
Metastore
Any
SQL
Dialect
All
SQL
Dialects
Compliance Views
Tables
T1
T2
T3
T4
T5
Metadata store
Compliance Views
Tables
T1
T2
T3
T4
T5
Views
T1_UC1 T1_UC2 T1_UC3
T2_UC1 T2_UC2 T2_UC3
T3_UC1 T3_UC2 T3_UC3
T4_UC1 T4_UC2 T4_UC3
T5_UC1 T5_UC2 T5_UC3
Metadata store
Compliance Views
Tables
T1
T2
T3
T4
T5
Views
T1_UC1 T1_UC2 T1_UC3
T2_UC1 T2_UC2 T2_UC3
T3_UC1 T3_UC2 T3_UC3
T4_UC1 T4_UC2 T4_UC3
T5_UC1 T5_UC2 T5_UC3
CREATE VIEW T1_UC1 AS
SELECT
CASE WHEN Settings.consent = ‘ALLOW’
THEN a ELSE obf(a)
FROM T1, Settings
WHERE Settings.ID = T1.ID
How to rollout views?
How to roll out views?
Not user facing migration!
Large scale migration?
● Expensive & Slow
● Exposes context-specific
view names
● Hard to evolve, include
new policies
● Does not work for views
ViewShift
Table & View catalog
T1 T1_UC1
ViewShift
Table & View catalog
T1 T1_UC1
ViewShift: Benefits
Dynamically route tables to
views at runtime!
● Transparent
● Familiar names
● Works for next regulation
● Easy version management Table & View API
T1 T1_UC1 T1 T1_UC1
Table & View catalog
Table & View API
SELECT * FROM T1 SELECT * FROM T1_UC1
ViewShift: Architecture
Plugin within a plugin
Query Engine
Tables and Views Plugin
Table identifier
View identifier
Table object
View object
ViewShift: Architecture
Plugin within a plugin
Query Engine
Table loadTable(Identifier identifier);
View loadView(Identifier identifier);
Table identifier
View identifier
Table object
View object
ViewShift: Architecture
Plugin within a plugin
Query Engine
Tables and Views Plugin
ViewShift Plugin
Table identifier View object
Table identifier
Context map
View identifier
Context session conf
ViewShift: Architecture
Plugin within a plugin
Query Engine
Table identifier View object
Table identifier
Context map
View identifier
Context session conf
Table loadTable(Identifier identifier);
View loadView(Identifier identifier);
Identifier getViewShiftView(Identifier
identifier, Map<String, String> contextMap);
The policy-based enforcement/masking system
Data Policies
Data Labels Lakehouse
Tables
SQL code
Business
Applications
Privacy Views
Compile
Access
Policy Engine Query Engine
The policy-based enforcement/masking system
Data Policies
Data Labels Lakehouse
Tables
SQL code
Business
Applications
Privacy Views
Compile
Access
Policy Engine Query Engine
Privacy View: SQL representation of applicable policies on a table access
for a given business purpose
Example
Demographic
memberId yearBorn gender
1 1991 F
2 1992 M
memberId adsAllowAge
1 False
2 True
Preferences
Example
Learning
Application
SELECT memberId, yearBorn
FROM Demographic
spark.sql.viewshift.enabled=true
Demographic
memberId yearBorn gender
1 1991 F
2 1992 M
memberId adsAllowAge
1 False
2 True
Preferences
memberId yearBorn
1 1991
2 1992
Example
Learning
Application
Ads
Application
SELECT memberId, yearBorn
FROM Demographic
Demographic
spark.sql.viewshift.enabled=true
memberId yearBorn gender
1 1991 F
2 1992 M
memberId adsAllowAge
1 False
2 True
Preferences
memberId yearBorn
1 1991
2 1992
memberId yearBorn
1 null
2 1992
Sounds
familiar?
OOP
Polymorphism
Sounds
familiar?
OOP
Polymorphism
Sounds
familiar?
OOP
Encapsulation
Behind the scene
Learning
Application
Ads
Application
SELECT memberId, yearBorn
FROM Demographic
spark.sql.viewshift.enabled=true
memberId yearBorn gender
1 1991 F
2 1992 M
memberId yearBorn
1 1991
2 1992
memberId yearBorn
1 null
2 1992
Behind the scene
Learning
Application
Ads
Application
SELECT memberId, yearBorn
FROM Demographic
SELECT memberId, yearBorn
FROM Learning.Demographic
WHERE memberId = 1
SELECT memberId, yearBorn
FROM Ads.Demographic
WHERE memberId = 1
ViewShiftPlugin
spark.sql.viewshift.enabled=true
memberId yearBorn gender
1 1991 F
2 1992 M
memberId yearBorn
1 1991
2 1992
memberId yearBorn
1 null
2 1992
Behind the scene
Learning
Application
Ads
Application
SELECT memberId, yearBorn
FROM Demographic
SELECT memberId, yearBorn
FROM Learning.Demographic
WHERE memberId = 1
SELECT memberId, yearBorn
FROM Ads.Demographic
WHERE memberId = 1
ViewShiftPlugin
Privacy View Privacy View
spark.sql.viewshift.enabled=true
memberId yearBorn gender
1 1991 F
2 1992 M
memberId yearBorn
1 1991
2 1992
memberId yearBorn
1 null
2 1992
How ViewShiftPlugin got implemented
TableName Purpose ViewName
Demographic Ads Ads.Demographic
Demographic Learning Learning.Demographic
... … ...
View Mapping Table
How ViewShiftPlugin got implemented
Ads
Application
Ads usage
purpose
Runs with Maps to
Ads Application Identity
TableName Purpose ViewName
Demographic Ads Ads.Demographic
Demographic Learning Learning.Demographic
... … ...
View Mapping Table
Policy Engine
Data Policies
Data Labels
Policy Matching Matching
Table
SQL Compilation View SQL
Label: AGE
Rule:
if adsAllowAge:
KEEP
else:
ERASE
Label: AGE
Rule:
if adsAllowAge:
KEEP
else:
ERASE
Policy Engine Example – Policy Matching
Purpose: Ads
Label: AGE
Rule:
if adsAllowAge:
KEEP
else:
ERASE
TableName Field Label
Demographic memberId KEY
Demographic yearBorn AGE
Demographic gender GENDER
Data Labels
TableName Purpose ApplicablePolicies
Demographic Ads [{"Field":"yearBorn",
"Policy":"AdsPolicyForAge"}]
Demographic Learning []
AdsPolicyForAge
Matching Table
Policy Engine Example – SQL compilation
TableName Purpose ApplicablePolicies
Demographic Ads [{"Field":"yearBorn",
"Policy":"AdsPolicyForAge"}]
Demographic Learning []
Matching Table
SELECT
memberId,
CASE
WHEN HAS_CONSENT(memberId, "adsAllowAge") THEN yearBorn
ELSE NULL
END as yearBorn,
gender
FROM Demographic
Ads.Demographic
SELECT *
FROM Demographic
Learning.Demographic
Purpose: Ads
Label: AGE
Rule:
if adsAllowAge:
KEEP
else:
ERASE
Policy Engine Example – SQL compilation
TableName Purpose ApplicablePolicies
Demographic Ads [{"Field":"yearBorn",
"Policy":"AdsPolicyForAge"}]
Demographic Learning []
Matching Table
SELECT
memberId,
CASE
WHEN HAS_CONSENT(memberId, "adsAllowAge") THEN yearBorn
ELSE NULL
END as yearBorn,
gender
FROM Demographic
Ads.Demographic
SELECT *
FROM Demographic
Learning.Demographic
Purpose: Ads
Label: AGE
Rule:
if adsAllowAge:
KEEP
else:
ERASE
HAS_CONSENT(memberId: BIGIN,
consentName: VARCHAR):
Returns true iff memberId has consent
on consentName
Privacy views in operations
View delivery
• A pipeline to create/update views
every hour
• Maintaining tens of thousands of
views in production
• Views are versioned
View consumptions
• Seamless migration with no code
change for existing applications:
o Views are schema preserving
o ViewShift for transparent routing
o Minimum computation overhead
• A system to audit view usages
Thank you!
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

  • 1.
    ViewShift: Hassle-Free Dynamic PolicyEnforcement for Every Data Lake Walaa Eldin Moustafa Senior Staff Software Engineer, LinkedIn May 2024 Khai Tran Senior Staff Software Engineer, LinkedIn
  • 2.
    Data Protection Scene Canyou relate? Too many policies Too much data GDPR DMA Consent PII Right to be forgotten CCPA Privacy by design Anonymization
  • 3.
    The Rise ofPrivacy and Compliance • Privacy Dashboards • Data Export • Ad preferences • Security checkups • Data Deletion
  • 4.
  • 5.
    Solution is Easy! Only2 machines Policies Data Lake Metadata SQL Views Data & Applications Compliance 🎉
  • 6.
    Why Views • Expressive •Express multiple policies with projections, filters, joins, UDFs. • Portable • Executable on multiple engines. • Modular • Can be drop-in replacement to underlying data • Agile • Roll-out new views, rollback to previous views CREATE VIEW T1_UC1 AS SELECT CASE WHEN consent = ‘ALLOW’ THEN a ELSE obf(a) FROM T1, Settings WHERE Settings.ID = T1.ID
  • 7.
    Why Views Column levelfiltering n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 a8 b8 c8 d8 e8 9 a9 b9 c9 d9 e9 n b c d e 1 b1 c1 d1 e1 2 b2 c2 d2 e2 3 b3 c3 d2 e3 4 b4 c4 d4 e4 5 b5 c5 d5 e5 6 b6 c6 d6 e6 7 b7 c7 d7 e7 8 b8 c8 d8 e8 9 b9 c9 d9 e9
  • 8.
    Why Views Column levelmasking n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 a8 b8 c8 d8 e8 9 a9 b9 c9 d9 e9 n a b c d e 1 # b1 c1 d1 e1 2 # b2 c2 d2 e2 3 # b3 c3 d2 e3 4 # b4 c4 d4 e4 5 # b5 c5 d5 e5 6 # b6 c6 d6 e6 7 # b7 c7 d7 e7 8 # b8 c8 d8 e8 9 # b9 c9 d9 e9
  • 9.
    Why Views Row levelfilters n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 a8 b8 c8 d8 e8 9 a9 b9 c9 d9 e9 n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 8 a8 b8 c8 d8 e8
  • 10.
    Why Views Cell levelmasking n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 a8 b8 c8 d8 e8 9 a9 b9 c9 d9 e9 n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 # b3 c3 # e3 4 a4 b4 c4 d4 e4 5 a5 # c5 d5 # 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 # b8 c8 d8 e8 9 a9 b9 # # #
  • 11.
    n a bc d 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d2 4 a4 b4 c4 d4 5 a5 b5 c5 d5 6 a6 b6 c6 d6 7 a7 b7 c7 d7 8 a8 b8 c8 d8 9 a9 b9 c9 d9 n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 # b8 c8 # e8 9 a9 b9 c9 d9 e9 Why Views Multiple types of masking n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 a3 b3 c3 d2 e3 4 a4 b4 c4 d4 e4 5 a5 b5 c5 d5 e5 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 a8 b8 c8 d8 e8 9 a9 b9 c9 d9 e9 n a b c d e 1 a1 b1 c1 d1 e1 2 a2 b2 c2 d2 e2 3 # b3 c3 # e3 4 a4 b4 c4 d4 e4 5 a5 # c5 d5 # 6 a6 b6 c6 d6 e6 7 a7 b7 c7 d7 e7 8 # b8 c8 d8 e8 9 a9 b9 # # #
  • 12.
    Why Views Multiple QueryEngines, Multiple Metadata Sources https://github.com/linkedin/coral Metastore Any SQL Dialect All SQL Dialects
  • 13.
  • 14.
    Compliance Views Tables T1 T2 T3 T4 T5 Views T1_UC1 T1_UC2T1_UC3 T2_UC1 T2_UC2 T2_UC3 T3_UC1 T3_UC2 T3_UC3 T4_UC1 T4_UC2 T4_UC3 T5_UC1 T5_UC2 T5_UC3 Metadata store
  • 15.
    Compliance Views Tables T1 T2 T3 T4 T5 Views T1_UC1 T1_UC2T1_UC3 T2_UC1 T2_UC2 T2_UC3 T3_UC1 T3_UC2 T3_UC3 T4_UC1 T4_UC2 T4_UC3 T5_UC1 T5_UC2 T5_UC3 CREATE VIEW T1_UC1 AS SELECT CASE WHEN Settings.consent = ‘ALLOW’ THEN a ELSE obf(a) FROM T1, Settings WHERE Settings.ID = T1.ID
  • 16.
  • 17.
    How to rollout views? Not user facing migration! Large scale migration? ● Expensive & Slow ● Exposes context-specific view names ● Hard to evolve, include new policies ● Does not work for views
  • 18.
    ViewShift Table & Viewcatalog T1 T1_UC1
  • 19.
    ViewShift Table & Viewcatalog T1 T1_UC1
  • 20.
    ViewShift: Benefits Dynamically routetables to views at runtime! ● Transparent ● Familiar names ● Works for next regulation ● Easy version management Table & View API T1 T1_UC1 T1 T1_UC1 Table & View catalog Table & View API SELECT * FROM T1 SELECT * FROM T1_UC1
  • 21.
    ViewShift: Architecture Plugin withina plugin Query Engine Tables and Views Plugin Table identifier View identifier Table object View object
  • 22.
    ViewShift: Architecture Plugin withina plugin Query Engine Table loadTable(Identifier identifier); View loadView(Identifier identifier); Table identifier View identifier Table object View object
  • 23.
    ViewShift: Architecture Plugin withina plugin Query Engine Tables and Views Plugin ViewShift Plugin Table identifier View object Table identifier Context map View identifier Context session conf
  • 24.
    ViewShift: Architecture Plugin withina plugin Query Engine Table identifier View object Table identifier Context map View identifier Context session conf Table loadTable(Identifier identifier); View loadView(Identifier identifier); Identifier getViewShiftView(Identifier identifier, Map<String, String> contextMap);
  • 25.
    The policy-based enforcement/maskingsystem Data Policies Data Labels Lakehouse Tables SQL code Business Applications Privacy Views Compile Access Policy Engine Query Engine
  • 26.
    The policy-based enforcement/maskingsystem Data Policies Data Labels Lakehouse Tables SQL code Business Applications Privacy Views Compile Access Policy Engine Query Engine Privacy View: SQL representation of applicable policies on a table access for a given business purpose
  • 27.
    Example Demographic memberId yearBorn gender 11991 F 2 1992 M memberId adsAllowAge 1 False 2 True Preferences
  • 28.
    Example Learning Application SELECT memberId, yearBorn FROMDemographic spark.sql.viewshift.enabled=true Demographic memberId yearBorn gender 1 1991 F 2 1992 M memberId adsAllowAge 1 False 2 True Preferences memberId yearBorn 1 1991 2 1992
  • 29.
    Example Learning Application Ads Application SELECT memberId, yearBorn FROMDemographic Demographic spark.sql.viewshift.enabled=true memberId yearBorn gender 1 1991 F 2 1992 M memberId adsAllowAge 1 False 2 True Preferences memberId yearBorn 1 1991 2 1992 memberId yearBorn 1 null 2 1992
  • 30.
  • 31.
  • 32.
  • 33.
    Behind the scene Learning Application Ads Application SELECTmemberId, yearBorn FROM Demographic spark.sql.viewshift.enabled=true memberId yearBorn gender 1 1991 F 2 1992 M memberId yearBorn 1 1991 2 1992 memberId yearBorn 1 null 2 1992
  • 34.
    Behind the scene Learning Application Ads Application SELECTmemberId, yearBorn FROM Demographic SELECT memberId, yearBorn FROM Learning.Demographic WHERE memberId = 1 SELECT memberId, yearBorn FROM Ads.Demographic WHERE memberId = 1 ViewShiftPlugin spark.sql.viewshift.enabled=true memberId yearBorn gender 1 1991 F 2 1992 M memberId yearBorn 1 1991 2 1992 memberId yearBorn 1 null 2 1992
  • 35.
    Behind the scene Learning Application Ads Application SELECTmemberId, yearBorn FROM Demographic SELECT memberId, yearBorn FROM Learning.Demographic WHERE memberId = 1 SELECT memberId, yearBorn FROM Ads.Demographic WHERE memberId = 1 ViewShiftPlugin Privacy View Privacy View spark.sql.viewshift.enabled=true memberId yearBorn gender 1 1991 F 2 1992 M memberId yearBorn 1 1991 2 1992 memberId yearBorn 1 null 2 1992
  • 36.
    How ViewShiftPlugin gotimplemented TableName Purpose ViewName Demographic Ads Ads.Demographic Demographic Learning Learning.Demographic ... … ... View Mapping Table
  • 37.
    How ViewShiftPlugin gotimplemented Ads Application Ads usage purpose Runs with Maps to Ads Application Identity TableName Purpose ViewName Demographic Ads Ads.Demographic Demographic Learning Learning.Demographic ... … ... View Mapping Table
  • 38.
    Policy Engine Data Policies DataLabels Policy Matching Matching Table SQL Compilation View SQL
  • 39.
    Label: AGE Rule: if adsAllowAge: KEEP else: ERASE Label:AGE Rule: if adsAllowAge: KEEP else: ERASE Policy Engine Example – Policy Matching Purpose: Ads Label: AGE Rule: if adsAllowAge: KEEP else: ERASE TableName Field Label Demographic memberId KEY Demographic yearBorn AGE Demographic gender GENDER Data Labels TableName Purpose ApplicablePolicies Demographic Ads [{"Field":"yearBorn", "Policy":"AdsPolicyForAge"}] Demographic Learning [] AdsPolicyForAge Matching Table
  • 40.
    Policy Engine Example– SQL compilation TableName Purpose ApplicablePolicies Demographic Ads [{"Field":"yearBorn", "Policy":"AdsPolicyForAge"}] Demographic Learning [] Matching Table SELECT memberId, CASE WHEN HAS_CONSENT(memberId, "adsAllowAge") THEN yearBorn ELSE NULL END as yearBorn, gender FROM Demographic Ads.Demographic SELECT * FROM Demographic Learning.Demographic Purpose: Ads Label: AGE Rule: if adsAllowAge: KEEP else: ERASE
  • 41.
    Policy Engine Example– SQL compilation TableName Purpose ApplicablePolicies Demographic Ads [{"Field":"yearBorn", "Policy":"AdsPolicyForAge"}] Demographic Learning [] Matching Table SELECT memberId, CASE WHEN HAS_CONSENT(memberId, "adsAllowAge") THEN yearBorn ELSE NULL END as yearBorn, gender FROM Demographic Ads.Demographic SELECT * FROM Demographic Learning.Demographic Purpose: Ads Label: AGE Rule: if adsAllowAge: KEEP else: ERASE HAS_CONSENT(memberId: BIGIN, consentName: VARCHAR): Returns true iff memberId has consent on consentName
  • 42.
    Privacy views inoperations View delivery • A pipeline to create/update views every hour • Maintaining tens of thousands of views in production • Views are versioned View consumptions • Seamless migration with no code change for existing applications: o Views are schema preserving o ViewShift for transparent routing o Minimum computation overhead • A system to audit view usages
  • 43.

Editor's Notes

  • #2 Alright. How many of you have dealt with compliance before? Either through enforcing it Or through adhering to compliance rules Okay, looks like almost all of you.
  • #3 So, more likely than not, you have had days where you have felt like our friend. the staff engineer here. who's very overwhelmed with terms that he keeps hearing, such as GDPR, CCPA, DMA, PII, right to be forgotten, privacy by design, and so many of those buzzwords. On the other side, he's responsible for managing a very large data lake with so many data applications on top of it and wants to make everything work. This sounds like a very challenging problem.
  • #4 At the same time, privacy by default is on the rise. Companies managing user data are implementing various controls to empower users to manage their privacy and security effectively. They provide tools such as privacy dashboards and options to export user data , which might be used in other services or retained for records. There are also tools to adjust ad preferences to control what type of data the platform can use to display ads to users, alongside frequent security checkups and options to delete personal data from the site.
  • #5 Although managing compliance might sound overwhelming, the solution is surprisingly simple. To handle compliance, you only need two key components: one is the policy engine, and the other is the query engine.
  • #6 Here’s how it works: Initially, policies inferred from regulations or internal guidelines are represented in a structured format and are kept in a policy store. This is fed into the policy engine along with data lake metadata, which includes table schemas along with column policy annotations. The policy engine then produces a set of SQL views encoding the necessary transformations according to data usage. These views are subsequently fed into a query engine along with the data and user applications, and are used to implement compliant data applications. Therefore, this workflow simplifies the journey from a complex maze of policies and tables in the data lake to a clear path towards compliance.
  • #7 But why views? Views offer a range of beneficial properties that make them flexible and effective for compliance. They are expressive, allowing the representation of multiple policies through tools like projections, filters, UDFs, and joins. They are portable, thanks to SQL’s nature, allowing execution across various engines with minimal adjustments. Views are also modular, serving as drop-in replacements for underlying data; by simply substituting table names with view names while preserving schemas, the same code can operate with additional logic encapsulated within the view. Moreover, views are agile, enabling the deployment of new views or reversion to previous ones with minimal impact. This agility allows for quick bug fixes or policy updates.
  • #8 At the expressivity level, let us demonstrate some key ways in which views can be used to apply policy-specific transformations. For instance, views enable column-level filtering—by excluding certain columns from a table, we can tailor the data presented to the consumer.
  • #9 Views can also be used for column-level masking, where instead of removing a column entirely, we mask or redact the data within it
  • #10 Furthermore, views can implement row-level filters to exclude unqualified rows from results
  • #11  or even perform cell-level masking, where specific data points are obscured based on the individual user’s consent and the data domain.
  • #12 These diverse masking capabilities are not limited to single-view applications; a single table can support multiple views, each representing a different method of data masking and applicable in distinct contexts. We will explore more of this versatility throughout the presentation.
  • #13 Views are typically stored as metadata in metadata stores like the Hive metastore, and more recently, Iceberg has begun supporting views in its metadata structure. Engines operating on these metadata stores can execute views within their environments, which underscores the importance of portability. At LinkedIn, we leverage tools like Coral for dialect translation, allowing us to define policy views in one SQL dialect and have them executed across any engine, in any other dialect.
  • #14 Let's bring everything together. Imagine we have a data lake containing numerous tables, all stored in a metadata store.
  • #15 The next step involves the policy engine generating a set of views corresponding to those tables. For instance, for Table 1 (T1), we might create three distinct views. Each view represents a separate transformation tailored to a specific use case, hence the view names are labeled as T1UC1 for Use Case 1, and so forth
  • #16 Consequently, each view embodies its unique logic and transformation, operating within a designated context.
  • #17 Now, let’s consider how we can roll out views effectively, given their potent capabilities for ensuring privacy and data protection. One approach is to forcibly change user or application logic by replacing every table name with a new view name in every script. However, this brute force method is not the best approach.
  • #18 It requires a large-scale migration, which can be expensive and slow, and it risks exposing context-specific information in the view names. For example, a table with a business-critical meaning might end up with multiple suffixes, diluting its core significance. If we manually migrate to views, then when a new version of a view is needed, or a new regulation or policy is introduced, we might find the approach unmanageable and not user-friendly.
  • #19 Can we do better? Yes, we can. Here’s where the architecture we refer to as ViewShift comes in. In this system, the user script remains intact, still referring to the table, but during execution, the tables are automatically replaced by the compliance view. For example, let's look at this depiction where the Spark engine attempts to resolve an identifier for a table, such as T1, which originates from the user script. This engine interacts with a metadata store and a catalog connection layer for this purpose. The catalog implementation can then transparently return the corresponding privacy view or an obfuscated view related to the table. Even though the script is written as shown on the left-hand side of the slide, the actual execution simulates as if the user had originally scripted using views.
  • #20 What's particularly advantageous about this approach is its flexibility across different programming languages and platforms. It can be implemented in SQL, Scala, on Trino, or Spark, because the architecture is adaptable to each engine using the same underlying principles.
  • #21 To summarize the overarching benefits of the ViewShift rollout technique—and I will delve into more detailed architecture in the upcoming slides— it's transparent. Users do not need to modify their code to adopt new view names; they continue to use familiar table identifiers. It also supports upcoming policies because new views can be introduced and enforced seamlessly, facilitating easy management of versions, whether for upgrades or rollbacks.
  • #22 Before we explore the ViewShift architecture, let's examine the conventional query engine architecture. Typically, this includes a query engine layer with an internal connector layer, often referred to as the tables and views plugin. Its primary role is to resolve identifiers of tables and views into corresponding objects. When the engine parses and analyzes a query, it submits the identifier to this plugin, which returns the appropriate object for further processing. Normally, a table identifier prompts the return of a table object, and a view identifier prompts a view object.
  • #23 Before the introduction of ViewShift, query engines were typically configured with a Tables and Views Plugin—a fundamental component that interprets table and view identifiers and corresponds them to actual table and view objects within the database. This foundational architecture, which can be likened to a plugin within a larger system, sets the stage for the capabilities of ViewShift. The diagram illustrates a straightforward but critical relationship: when a query is executed, the engine uses this plugin to resolve the names of tables and views to their respective objects, forming the basis for query execution.
  • #24 However, with ViewShift, we've adjusted how table identifiers are resolved. We introduce an additional plugin within the tables and views plugin, tasked with mapping table identifiers to their corresponding view identifiers based on the applicable policy and context. This is especially crucial when a single table identifier may correspond to multiple views, and the appropriate view needs to be selected based on the current context. The context map, part of the View Plugin API, facilitates this by ensuring that alongside the table identifier, a specific view identifier is returned, thus substituting a view object in place of a table object.
  • #25 Compared to the conventional implementation where a table returns a table object and a view returns a view object, our new architecture embeds a transformative plugin that allows table requests to return view objects, thereby seamlessly integrating privacy by default through ViewShift.
  • #26 Now, I will hand over the presentation to Khai, who will discuss an end-to-end use case that leverages ViewShift for a recent compliance initiative at LinkedIn. Over to you, Khai.