Meetup SF - Amundsen

Making Big Data Easy
Data Discovery Lineage Data Like Code

May 15th 2019
Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft
Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery

Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s next?
3

5
Core Infra high level architecture
Custom apps

Challenges with Data
Discovery
6

Data is used to make informed decisions
7
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-based decision making process:
1. Search & ﬁnd data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Make data the heart of every decision

• My ﬁrst project is to predict the Data Meetup Attendance
• Goal: help the oﬃce team make a decision on # of chairs to provide
Hi! I am a n00b Data Scientist!
8

• Ask a friend/boss/coworker
• Ask in a wider slack channel
• Search in the Github repos
Step 1: Search & ﬁnd data
9
We end up ﬁnding a table: hosted_events
that seems to be the right one

• What does this ﬁeld mean?
‒ Does hosted_events.attendance data include employees?
‒ What does hosted_events.costs include?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
10
SELECT *
FROM default.hosted_events
WHERE ds=’2019-05-15’
LIMIT 100;

Data Scientists spend upto 1/3rd time in Data Discovery
11
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who owns
it, how to use it,...
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & ﬁnd data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision

Audience for data
discovery
12

User Personas - (1/2)
13
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models

User Personas - (2/2)
14
Power user
- Has been at Lyft for a long
time
- Knows the data environment
well: where to ﬁnd data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of his time sharing his
knowledge with the noob user
- Could become noob user if he
switches teams
Noob user
- Recently joined Lyft
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends his time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets

3 complementary ways to do Data Discovery
15
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs

Data discovery at Lyft
16
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen

Landing page optimized for search

Search results ranked on relevance and query activity

Relevance - search for “apple” on Google
20
Low relevance High relevance

Popularity - search for “apple” on Google
21
Low popularity High popularity

Striking the balance
22
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Diﬀerent weights for automated vs adhoc
querying

25
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

27
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Challenge #1
Why choose Graph
database?
29

32
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly

Challenge #2
Why not propagate the
metadata back to source?
33

Why not propagate the metadata back to source
34

36
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources

Challenge #3
Various forms of metadata
37

Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
39

How is the databuilder orchestrated?
42
Amundsen uses Apache Airflow to orchestrate Databuilder jobs

44
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support diﬀerent search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records ﬁrst based on data type, then
relevancy
‒ Wildcard Search
45

Challenge #4
How to make the search
result more relevant?
46

How to make the search result more relevant?
47
• Deﬁne a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)

49
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Amundsen’s impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
53

Amundsen is Open Source!
• Many organizations have similar problems
‒ github.com/lyft/amundsenfrontendlibrary
• Collaboration with outside companies has started
‒ Used in Production by ING
‒ Working with WeWork, Square, Bang & Olufsen & more
54

Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Superset)
Privacy Governance

Summary
• A metadata-powered discovery tool - like Amundsen - is key to the next
wave of big data applications
• Blog post with more details: go.lyft.com/datadiscoveryblog
• Join the community: github.com/lyft/amundsenfrontendlibrary
57

Phil Mizrahi | @philippemizrahi
Jin Hyuk Chang | @jinhyukchang
Repo at github.com/lyft/amundsenfrontendlibrary
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 58

60
Lyft Data Team
Lyft Data Team
Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra

Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Metadata: a set of data that describes and gives information about other data
Ground, Joe Hellerstein, Vikram Sreekanti et al.
RISE Lab, UC Berkeley

Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)

Detailed description and metadata about data resources

Computed stats about column metadata
Disclaimer: these stats are arbitrary.

Challenge #2
Pull model vs Push model
71

Pull model and Push model
72
Pull Model Push Model
● Amundsen goes to the source and
periodically update the index by pulling from
the source (e.g. database).
● Very efficient on popular platform. Hard to
be extended.
● The external source (e.g. database) pushes
metadata to a message bus which
Amundsen subscribes to.
● Flexible and extensible. Lots of work from
external source.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph

We’re Hiring! Apply at www.lyft.com/careers
or email data-recruiting@lyft.com
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco

Strata SF 2019
Rate this session
session page on conference website
O’Reilly Events App

Meetup SF - Amundsen

More Related Content

What's hot

Similar to Meetup SF - Amundsen

Recently uploaded

Meetup SF - Amundsen