Mark Grover | Lyft | @mark_grover
Alyssa Ransbury | Square | @alyssaran
Slides: cutt.ly/a6n
Amundsen: From Discovering to Securing Data
Who we are
• Product Manager at Lyft
• Committer/PMC on various
Apache projects
• Previously developer on
Spark at Cloudera
2
• Security Engineer at Square
• Leads and supports on privacy
engineering efforts across
dozens of product teams
Problem
3
>35% consumer time spent on discovering & validating
trustworthy data
4
Analyst/DS workflow and time spent on each step
Other side effects
1. Interrupt heavy culture
● What data should I use for
X?
● Is that trustworthy?
2. Increased DB load
SELECT * … LIMIT 100
SELECT COUNT(*) over x days
Requirements &
solutions
5
1. What kind of information? (aka ABC of metadata)
6
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
Terminology borrowed from Ground paper
Data Lineage TODAY
Short answer: Any data within your organization
Long answer:
2. About what data?
7
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes
Develop breadth of applications
8
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Search based Lineage based Network based Programmatic
Where is the
table/dashboard for X?
What does it contain?
I am changing a data model,
who are the owner and most
common users?
I want to follow a
power user in my team.
Access metadata
programmatically
Does this analysis
already exist?
This table’s delivery was
delayed today, I want to
notify everyone downstream.
I want to bookmark
tables of interest and
get a feed of data
delay, schema change,
incidents.
Put (pull / push)
metadata
programmatically
Other requirements
● Leverage as much data automatically as possible
● Preferably, open source and healthy community
● Easy to set up
Goal: Reduce time to find trusted data w/ versatile graph
Meet Amundsen
10
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Search for data, or browse
Search for datasets
See details of the data set
See detailed descriptions and profile of the column
Preview existing data, if perms allow
Search for existing dashboards (aka reports)
16
Dashboard detail page
17
Search for co-workers!
Search for data owned and used by your peers!
Impact
20
21
“This is God’s
work” - Head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” - Data
Scientist,Lyft
A6n @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages
Roles of Amundsen users at Lyft
22
Penetration rate:
DS (aka analyst): 81%
RS (aka DS): 71%
PM: 22%
SWE: 17%
Cust Serv: 7% (12/390)
Sp. Ops: 67% (10/15)
Sp. Op Leads: 53% (9/17)
Economist: 100% (7/7)
Cust. Quality: 78% (7/9)
Growth Mktg: 25% (6/24)
Roadmap
23
Roadmap (subject to change, not ordered)
• Tighter Lineage integration / visualization
• Better view integration
• ACL integration, allow only specific roles to edit descriptions
• Show search context for what matched
• Index more resources (notebooks, Kafka topics, etc.)
24
Amundsen Open Source
25
700
Community
members
150+
Companies in
the community
20+
Companies using
in production
Amundsen Open Source Community
26
ProminentusersActivecommunity
Develop breadth of applications
27
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Square:
Scaling Data Security /
Privacy
28
29
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
30
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
31
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
32
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
34
The graph
Benefits of a graph
The Problem
36
● Relied mostly on manual work to understand our data
● We have A LOT of data, and growing
● NOT scalable
● Lots of goals related to data privacy on top of requirements from laws like
GDPR / CCPA
37
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Store more
specific
metadata
2. Automate
data
inventorying
3. Support
more granular
access
controls
1. Store more specific
metadata
38
39
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Store more
specific
metadata
New Metadata Types
40
PII
Semantic
Type
Data
Storage
Security
Data
Subject
Type
PII
Semantic
Type
for data in
Column
PII Semantic Type
41
A set value describing the general contents of data in a column, with a focus on
discovering data that fits into one of three buckets
● Sensitive by itself
● Could be sensitive when taken together with other data
● Links to sensitive data, like an internal identification token
Some
Column
Data
Data Storage Security
42
A set value describing the form of data in a column
PII
Semantic
Type
for data in
Column
Some
Column
Data
Data
Storage
Security
for data in
Column
Data Subject Type
43
A set value describing users whose information exists in a column
● Example: square_product_user
PII
Semantic
Type
for data in
Column
Some
Column
Data
Data
Storage
Security
for data in
Column
Data
Subject
Type
for data in
Column
Example
44
PII
Semantic
Type
Column 1
Data
Storage
Security
encrypted
Data
Subject
Type
app_customer
Data
Storage
Security
plaintext
Column 2
UI Changes
45
2. Automate data
inventorying*
46
* where possible
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
2. Automate
data
inventorying
Why?
● We have a lot of data!
● Metadata should be kept current at scale
● New data sources get created all the time
● Mistakes happen
48
Flagging Possibly Sensitive Data Locations
49
Some
Column
Data
Data
Storage
Security
encrypted
Flagging Possibly Sensitive Data Locations
50
Some
Column
Data
Data
Storage
Security
encrypted
Flag
Flags
Contain information about:
● The metadata type it relates to (i.e. PII Semantic Type, Data Storage Security)
● The possible value (i.e. encrypted for Data Storage Security)
● Human readable description
51
Some
Column
Data
Flag
Reason
Result
Reviewer
Checking for Data Sensitivity
● Column name
○ Keyword partial or full match
○ Literal PII match
● Column type
● Table name
● Sample values + Google DLP API
52
User Flow
53
Some
Column
Data
Flag
Data Storage
Security
plaintext
Flag
PII Semantic
Type
phone number
User Flow
54
Some
Column
Data
Flag
Data Storage
Security
plaintext
Result: True
Reviewer: person 1
Data
Storage
Security
plaintext
Flag
PII Semantic
Type
phone number
Result: False
Reviewer: person 2
UI Changes
55
UI Changes
56
3. Support more granular
access controls
57
58
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Support
more granular
access
controls
Capability-governed access control
● Purpose - An enumerated, legal-approved reason to access specific PII
Semantic Types
● Capability - A combination of Purpose and Data Subject Type
59
Capability-governed access control
60
Capability
Purpose
Subject
Type
Semantic
Type
Semantic
Type
Capability-governed access control
61
Capability
001
USER SUPPORT
CASH
APP
CUSTOMER
PERSON
NAME
EMAIL
ADDRESS
Capability-governed access control
62
Amundsen
Gating
Service
4. Future Work
63
Future and Work in Progress
● Gremlin Proxy
● UI components
○ show users exactly the data they have access to based on their capabilities
○ show the data that is tied to a specific capability
○ empower data owners to revoke broad access to some data
● Supporting additional data types
● And beyond! (we are hiring…)
64
Summary
65
Develop breadth of applications
66
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Get started!
Check out Github:
github.com/lyft/amundsen
Join Slack (~700 users):
Link to join on github
67
Check out other blogs and videos:
Amundsen for Privacy @ Square
Introducing Amundsen
Amundsen's architecture
YouTube channel
Thanks!
68
Mark Grover | Lyft | @mark_grover
Alyssa Ransbury | Square | @alyssaran
Icons under Creative Commons License from https://thenounproject.com/

Amundsen: From discovering to security data

  • 1.
    Mark Grover |Lyft | @mark_grover Alyssa Ransbury | Square | @alyssaran Slides: cutt.ly/a6n Amundsen: From Discovering to Securing Data
  • 2.
    Who we are •Product Manager at Lyft • Committer/PMC on various Apache projects • Previously developer on Spark at Cloudera 2 • Security Engineer at Square • Leads and supports on privacy engineering efforts across dozens of product teams
  • 3.
  • 4.
    >35% consumer timespent on discovering & validating trustworthy data 4 Analyst/DS workflow and time spent on each step Other side effects 1. Interrupt heavy culture ● What data should I use for X? ● Is that trustworthy? 2. Increased DB load SELECT * … LIMIT 100 SELECT COUNT(*) over x days
  • 5.
  • 6.
    1. What kindof information? (aka ABC of metadata) 6 Application Context Metadata needed by humans or applications to operate ● Where is the data? ● What are the semantics of the data? Behavior How is data created and used over time? ● Who’s using the data? ● Who created the data? Change Change in data over time ● How is the data evolving over time? ● Evolution of code that generates the data Terminology borrowed from Ground paper Data Lineage TODAY
  • 7.
    Short answer: Anydata within your organization Long answer: 2. About what data? 7 Data stores Schema registry Events / Schemas StreamsPeople Employees TODAY NotebooksDashboard / Reports Processes
  • 8.
    Develop breadth ofapplications 8 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 9.
    Search based Lineagebased Network based Programmatic Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Access metadata programmatically Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Put (pull / push) metadata programmatically Other requirements ● Leverage as much data automatically as possible ● Preferably, open source and healthy community ● Easy to set up Goal: Reduce time to find trusted data w/ versatile graph
  • 10.
    Meet Amundsen 10 First personto discover the South Pole - Norwegian explorer, Roald Amundsen
  • 11.
  • 12.
  • 13.
    See details ofthe data set
  • 14.
    See detailed descriptionsand profile of the column
  • 15.
    Preview existing data,if perms allow
  • 16.
    Search for existingdashboards (aka reports) 16
  • 17.
  • 18.
  • 19.
    Search for dataowned and used by your peers!
  • 20.
  • 21.
    21 “This is God’s work”- Head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Data Scientist,Lyft A6n @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages
  • 22.
    Roles of Amundsenusers at Lyft 22 Penetration rate: DS (aka analyst): 81% RS (aka DS): 71% PM: 22% SWE: 17% Cust Serv: 7% (12/390) Sp. Ops: 67% (10/15) Sp. Op Leads: 53% (9/17) Economist: 100% (7/7) Cust. Quality: 78% (7/9) Growth Mktg: 25% (6/24)
  • 23.
  • 24.
    Roadmap (subject tochange, not ordered) • Tighter Lineage integration / visualization • Better view integration • ACL integration, allow only specific roles to edit descriptions • Show search context for what matched • Index more resources (notebooks, Kafka topics, etc.) 24
  • 25.
    Amundsen Open Source 25 700 Community members 150+ Companiesin the community 20+ Companies using in production
  • 26.
    Amundsen Open SourceCommunity 26 ProminentusersActivecommunity
  • 27.
    Develop breadth ofapplications 27 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 28.
  • 29.
    29 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 30.
    30 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 31.
    31 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 32.
    32 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 33.
    33 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 34.
  • 35.
  • 36.
    The Problem 36 ● Reliedmostly on manual work to understand our data ● We have A LOT of data, and growing ● NOT scalable ● Lots of goals related to data privacy on top of requirements from laws like GDPR / CCPA
  • 37.
    37 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 1. Store more specific metadata 2. Automate data inventorying 3. Support more granular access controls
  • 38.
    1. Store morespecific metadata 38
  • 39.
    39 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 1. Store more specific metadata
  • 40.
  • 41.
    PII Semantic Type for data in Column PIISemantic Type 41 A set value describing the general contents of data in a column, with a focus on discovering data that fits into one of three buckets ● Sensitive by itself ● Could be sensitive when taken together with other data ● Links to sensitive data, like an internal identification token Some Column Data
  • 42.
    Data Storage Security 42 Aset value describing the form of data in a column PII Semantic Type for data in Column Some Column Data Data Storage Security for data in Column
  • 43.
    Data Subject Type 43 Aset value describing users whose information exists in a column ● Example: square_product_user PII Semantic Type for data in Column Some Column Data Data Storage Security for data in Column Data Subject Type for data in Column
  • 44.
  • 45.
  • 46.
  • 47.
    47 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 2. Automate data inventorying
  • 48.
    Why? ● We havea lot of data! ● Metadata should be kept current at scale ● New data sources get created all the time ● Mistakes happen 48
  • 49.
    Flagging Possibly SensitiveData Locations 49 Some Column Data Data Storage Security encrypted
  • 50.
    Flagging Possibly SensitiveData Locations 50 Some Column Data Data Storage Security encrypted Flag
  • 51.
    Flags Contain information about: ●The metadata type it relates to (i.e. PII Semantic Type, Data Storage Security) ● The possible value (i.e. encrypted for Data Storage Security) ● Human readable description 51 Some Column Data Flag Reason Result Reviewer
  • 52.
    Checking for DataSensitivity ● Column name ○ Keyword partial or full match ○ Literal PII match ● Column type ● Table name ● Sample values + Google DLP API 52
  • 53.
  • 54.
    User Flow 54 Some Column Data Flag Data Storage Security plaintext Result:True Reviewer: person 1 Data Storage Security plaintext Flag PII Semantic Type phone number Result: False Reviewer: person 2
  • 55.
  • 56.
  • 57.
    3. Support moregranular access controls 57
  • 58.
    58 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 3. Support more granular access controls
  • 59.
    Capability-governed access control ●Purpose - An enumerated, legal-approved reason to access specific PII Semantic Types ● Capability - A combination of Purpose and Data Subject Type 59
  • 60.
  • 61.
    Capability-governed access control 61 Capability 001 USERSUPPORT CASH APP CUSTOMER PERSON NAME EMAIL ADDRESS
  • 62.
  • 63.
  • 64.
    Future and Workin Progress ● Gremlin Proxy ● UI components ○ show users exactly the data they have access to based on their capabilities ○ show the data that is tied to a specific capability ○ empower data owners to revoke broad access to some data ● Supporting additional data types ● And beyond! (we are hiring…) 64
  • 65.
  • 66.
    Develop breadth ofapplications 66 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 67.
    Get started! Check outGithub: github.com/lyft/amundsen Join Slack (~700 users): Link to join on github 67 Check out other blogs and videos: Amundsen for Privacy @ Square Introducing Amundsen Amundsen's architecture YouTube channel
  • 68.
    Thanks! 68 Mark Grover |Lyft | @mark_grover Alyssa Ransbury | Square | @alyssaran Icons under Creative Commons License from https://thenounproject.com/