Making Big Data Easy
Data Discovery Lineage Data Like Code
May 15th 2019
Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft
Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s next?
3
Data at Lyft
4
5
Core Infra high level architecture
Custom apps
Challenges with Data
Discovery
6
Data is used to make informed decisions
7
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Make data the heart of every decision
• My first project is to predict the Data Meetup Attendance
• Goal: help the office team make a decision on # of chairs to provide
Hi! I am a n00b Data Scientist!
8
• Ask a friend/boss/coworker
• Ask in a wider slack channel
• Search in the Github repos
Step 1: Search & find data
9
We end up finding a table: hosted_events
that seems to be the right one
• What does this field mean?
‒ Does hosted_events.attendance data include employees?
‒ What does hosted_events.costs include?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
10
SELECT *
FROM default.hosted_events
WHERE ds=’2019-05-15’
LIMIT 100;
Data Scientists spend upto 1/3rd time in Data Discovery
11
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who owns
it, how to use it,...
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Audience for data
discovery
12
User Personas - (1/2)
13
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models
User Personas - (2/2)
14
Power user
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of his time sharing his
knowledge with the noob user
- Could become noob user if he
switches teams
Noob user
- Recently joined Lyft
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends his time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets
3 complementary ways to do Data Discovery
15
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs
Data discovery at Lyft
16
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
19
Relevance - search for “apple” on Google
20
Low relevance High relevance
Popularity - search for “apple” on Google
21
Low popularity High popularity
Striking the balance
22
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Demo
23
Amundsen’s architecture
24
25
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Metadata Service
26
27
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Amundsen table detail page
Challenge #1
Why choose Graph
database?
29
30
Why Graph database? (1/2)
31
Why Graph database? (2/2)
32
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Challenge #2
Why not propagate the
metadata back to source?
33
Why not propagate the metadata back to source
34
2. Databuilder
35
36
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #3
Various forms of metadata
37
38
Metadata Sources @ Lyft
Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
39
Databuilder
40
Databuilder in action
41
How is the databuilder orchestrated?
42
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
3. Search service
43
44
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
45
Challenge #4
How to make the search
result more relevant?
46
How to make the search result more relevant?
47
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
4. Frontend service
48
49
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Search results ranked on relevance and query activity
Amundsen table detail page
What’s next?
52
Amundsen’s impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
53
Amundsen is Open Source!
• Many organizations have similar problems
‒ github.com/lyft/amundsenfrontendlibrary
• Collaboration with outside companies has started
‒ Used in Production by ING
‒ Working with WeWork, Square, Bang & Olufsen & more
54
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Superset)
Privacy Governance
Summary
56
Summary
• A metadata-powered discovery tool - like Amundsen - is key to the next
wave of big data applications
• Blog post with more details: go.lyft.com/datadiscoveryblog
• Join the community: github.com/lyft/amundsenfrontendlibrary
57
Phil Mizrahi | @philippemizrahi
Jin Hyuk Chang | @jinhyukchang
Repo at github.com/lyft/amundsenfrontendlibrary
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 58
Backup
59
60
Lyft Data Team
Lyft Data Team
Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
All about metadata
61
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Metadata: a set of data that describes and gives information about other data
Ground, Joe Hellerstein, Vikram Sreekanti et al.
RISE Lab, UC Berkeley
Buy vs. Build vs. Adopt
63
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Back to mocks...
65
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Challenge #2
Pull model vs Push model
71
Pull model and Push model
72
Pull Model Push Model
● Amundsen goes to the source and
periodically update the index by pulling from
the source (e.g. database).
● Very efficient on popular platform. Hard to
be extended.
● The external source (e.g. database) pushes
metadata to a message bus which
Amundsen subscribes to.
● Flexible and extensible. Lots of work from
external source.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
We’re Hiring! Apply at www.lyft.com/careers
or email data-recruiting@lyft.com
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco
Strata SF 2019
Rate this session
session page on conference website
O’Reilly Events App

Meetup SF - Amundsen

  • 1.
    Making Big DataEasy Data Discovery Lineage Data Like Code
  • 2.
    May 15th 2019 PhilMizrahi | @philippemizrahi | Associate Product Manager, Lyft Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  • 3.
    Agenda • Data atLyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s next? 3
  • 4.
  • 5.
    5 Core Infra highlevel architecture Custom apps
  • 6.
  • 7.
    Data is usedto make informed decisions 7 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision Make data the heart of every decision
  • 8.
    • My firstproject is to predict the Data Meetup Attendance • Goal: help the office team make a decision on # of chairs to provide Hi! I am a n00b Data Scientist! 8
  • 9.
    • Ask afriend/boss/coworker • Ask in a wider slack channel • Search in the Github repos Step 1: Search & find data 9 We end up finding a table: hosted_events that seems to be the right one
  • 10.
    • What doesthis field mean? ‒ Does hosted_events.attendance data include employees? ‒ What does hosted_events.costs include? • Dig deeper: explore using SQL queries Step 2: Understand the data 10 SELECT * FROM default.hosted_events WHERE ds=’2019-05-15’ LIMIT 100;
  • 11.
    Data Scientists spendupto 1/3rd time in Data Discovery 11 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, how to use it,... • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  • 12.
  • 13.
    User Personas -(1/2) 13 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  • 14.
    User Personas -(2/2) 14 Power user - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of his time sharing his knowledge with the noob user - Could become noob user if he switches teams Noob user - Recently joined Lyft - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends his time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  • 15.
    3 complementary waysto do Data Discovery 15 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  • 16.
    Data discovery atLyft 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 17.
  • 18.
    Search results rankedon relevance and query activity
  • 19.
  • 20.
    Relevance - searchfor “apple” on Google 20 Low relevance High relevance
  • 21.
    Popularity - searchfor “apple” on Google 21 Low popularity High popularity
  • 22.
    Striking the balance 22 RelevancePopularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 23.
  • 24.
  • 25.
    25 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 26.
  • 27.
    27 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 28.
  • 29.
    Challenge #1 Why chooseGraph database? 29
  • 30.
  • 31.
  • 32.
    32 2. Metadata Service •A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 33.
    Challenge #2 Why notpropagate the metadata back to source? 33
  • 34.
    Why not propagatethe metadata back to source 34
  • 35.
  • 36.
    36 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 37.
  • 38.
  • 39.
    Metadata - Challenges •No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 39
  • 40.
  • 41.
  • 42.
    How is thedatabuilder orchestrated? 42 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 43.
  • 44.
    44 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 45.
    3. Search Service •A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 45
  • 46.
    Challenge #4 How tomake the search result more relevant? 46
  • 47.
    How to makethe search result more relevant? 47 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  • 48.
  • 49.
    49 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 50.
    Search results rankedon relevance and query activity
  • 51.
  • 52.
  • 53.
    Amundsen’s impact • Tremendoussuccess at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 53
  • 54.
    Amundsen is OpenSource! • Many organizations have similar problems ‒ github.com/lyft/amundsenfrontendlibrary • Collaboration with outside companies has started ‒ Used in Production by ING ‒ Working with WeWork, Square, Bang & Olufsen & more 54
  • 55.
    Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Superset) Privacy Governance
  • 56.
  • 57.
    Summary • A metadata-powereddiscovery tool - like Amundsen - is key to the next wave of big data applications • Blog post with more details: go.lyft.com/datadiscoveryblog • Join the community: github.com/lyft/amundsenfrontendlibrary 57
  • 58.
    Phil Mizrahi |@philippemizrahi Jin Hyuk Chang | @jinhyukchang Repo at github.com/lyft/amundsenfrontendlibrary Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 58
  • 59.
  • 60.
    60 Lyft Data Team LyftData Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  • 61.
  • 62.
    Serving more metadataabout existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time Metadata: a set of data that describes and gives information about other data Ground, Joe Hellerstein, Vikram Sreekanti et al. RISE Lab, UC Berkeley
  • 63.
    Buy vs. Buildvs. Adopt 63
  • 64.
    Compared various existingsolutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 65.
  • 66.
    Search results rankedon relevance and query activity
  • 67.
    Detailed description andmetadata about data resources
  • 68.
  • 69.
    Computed stats aboutcolumn metadata Disclaimer: these stats are arbitrary.
  • 70.
  • 71.
    Challenge #2 Pull modelvs Push model 71
  • 72.
    Pull model andPush model 72 Pull Model Push Model ● Amundsen goes to the source and periodically update the index by pulling from the source (e.g. database). ● Very efficient on popular platform. Hard to be extended. ● The external source (e.g. database) pushes metadata to a message bus which Amundsen subscribes to. ● Flexible and extensible. Lots of work from external source. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 73.
    We’re Hiring! Applyat www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  • 74.
    Strata SF 2019 Ratethis session session page on conference website O’Reilly Events App