Disrupting Data Discovery with Amundsen - How Lyft built an open source metadata platform

Tuesday, October 1st 2019
Phil Mizrahi | Product @Lyft
Disrupting Data Discovery with Amundsen

Agenda
• Challenges with Data Discovery
• Evaluating Solutions
• Amundsen
• Amundsen’s Architecture - How do we use Neo4j
• Impact
• What’s Next?
2

Challenges with Data
Discovery
3

Data is used to make informed decisions
5
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or make a decision
Make data the heart of every decision

• Goal: What new data-driven policies can we enact to reduce driver
insurance fraud?
• Idea: Let’s take a deeper look into insurance claims from drivers who
have given less than 𝑥 rides.
• Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but
where do I look?
Hi! I’m a new Analyst in the Fraud Department !
6

• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
7
We end up finding tables: driver_rides
& rides_driver_total

• What is the difference: driver_rides vs. rides_driver_total
• What do the different fields mean?
‒ Is driver_rides.completed different from
rides_driver_total.lifetime_completed?
‒ What period of time does the data in each table cover?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
8
SELECT * FROM schema.driver_rides
WHERE ds=’2019-05-15’
LIMIT 100;
SELECT * FROM schema.rides_driver_total
WHERE ds=’2019-05-15’
LIMIT 100;

- No way to know &
understand trusted data
- Created channels & oncalls
for data questions
Lots of queries like:
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Lack of productivity had many side effects
9
- Does data exist?
- Prior work?
- Source of truth?
- Who owns it?
- Who uses it?
Lots of unknowns Increased database load Interrupt heavy data culture

Lots of wasted tech & biz users time
10
Analyst/DS workflow and time spent on each step

Holy grail of solving for productivity
12
metadata
noun /ˈmedəˌdādə,ˈmedəˌdadə/
:a set of data that describes and gives information about other data.
1. What kind of
information?
2. About what
data?

1. What kind of information? (aka ABC of metadata)
13
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data Terminology borrowed from Ground paper

Short answer: Any data within your organization
Long answer:
2. About what data?
14
Data stores Dashboard /
Reports
Schema registry
Events /
Schemas
Streams People
Employees

3 complementary ways to do Data Discovery
15
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs

Data discovery for ALL users
16
Power User
- Has been at Lyft for a long time
- Knows the data environment well:
where to find data, what it means,
how to use it
Pain points:
- Needs to spend a fair amount of
their time sharing their knowledge
with the new user
- Could become “New user” if they
switch teams
New User
- Recently joined Lyft or switched to
a new team
- Needs to ramp up on a lot of
things, wants to start having impact
soon
Pain points:
- Doesn’t know where to start.
Spends their time asking questions
and cmd+F on github
- Makes mistakes by mis-using
some datasets
Other requirements
- Leverage as much data automatically as possible
- Preferably, open source and healthy community
- API availability
- Easy to set up

Solution space
• Vendors - Alation, Collibra
• Existing open source projects (e.g. Apache Atlas)
• LinkedIn’s data portal - Wherehows & DataHub (blog, code)
• Twitter’s data discovery (blog)
• Netflix’s metacat (code, blog)
• Airbnb’s data portal (blog, video)
• Big Query SQL Web UI & catalog (blog)
• Goods: Organizing Google’s Datasets (paper)
• Data Warehousing and Analytics Infrastructure at Facebook (paper) 17

Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)

Amundsen
19
Product named after Roald Amundsen
● First expedition to reach the South Pole
● First to explore both North & South Poles

Landing Page - Optimized for search

Search Results - Ranked on relevance & popularity

Relevance - search for “apple” on Google
22
Low relevance High relevance

Popularity - search for “apple” on Google
23
Low popularity High popularity

Search Results - Striking the balance
24
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Different weights for different metadata, e.g.
resource name
● Querying activity
● Dashboarding
● Lower weight for automated querying
● Higher weight for adhoc querying

Computed Column Metadata Statistics
Disclaimer: these stats are arbitrary.

31
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

33
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

34
Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly

Why choose a graph
database?
35

39
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources

Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
41

44
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
45

How to make the search result more relevant?
46
• Collect metrics
‒ Instrumentation for search behavior
‒ Measure click-through-rate (CTR) over top 5 results
• Experiment with different weights, e.g boost the exact table ranking
• Advanced search:
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
‒ Future: Filtering, Autosuggest

48
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Web Technologies
50
Develop Build Test

52
“This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS,
Lyft
A6n @ Lyft

Roles of Amundsen users at Lyft
53
Penetration rate:
DS (aka analyst): 81%
RS (aka DS): 71%
PM: 22%
SWE: 17%
Cust Serv: 7%
Sp. Ops: 67%
Sp. Op Leads: 53%
Economist: 100%
Cust. Quality: 78%
Growth Mktg: 25%

Community Users
54
ProminentusersActivecommunity

Community overview
Contributors

Recent Contributions from the community
• BigQuery integration (Coolblue)
• PostgreSQL and Redshift integration (Everfi)
• Security improvements and Apache Atlas integration (ING)
• Snowflake integration (LMC)
• Toolbar on landing page (In progress, Workday)
• Integrating with Delta analytics platform (In progress, Databricks)
• Talks by ING & Coolblue at conferences in Barcelona, Vilnius & Moscow
56

1. Develop breadth of applications
58
Metadata
Compliance
(GDPR/CCPA)
DataDiscovery
Downstream
impactanalysis
. . . . .
DataQuality

Roadmap (subject to change, not ordered)
• Index Dashboards (Product spec)
• Link business terms and process to technical metadata
• Standardize Information Governance metadata
• Include tags in search
• ACL integration, allow only specific roles to edit descriptions
• Show search context for what matched
• “Request for descriptions” aka notifications
• Data Lineage
60

Phil Mizrahi | @philippemizrahi | in/philippe-mizrahi
Project Code @ github.com/lyft/amundsen
Blog Post @ go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
61

Disrupting Data Discovery with Amundsen - How Lyft built an open source metadata platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Disrupting Data Discovery with Amundsen - How Lyft built an open source metadata platform

Similar to Disrupting Data Discovery with Amundsen - How Lyft built an open source metadata platform (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Disrupting Data Discovery with Amundsen - How Lyft built an open source metadata platform