March 2019
Mark Grover | @mark_grover | Product Management, Lyft
Tao Feng | @feng-tao | Software Engineer, Lyft
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Architecture
• Summary
2
Data platform users
3
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
5
Core Infra high level architecture
Custom apps
Data Discovery
6
• My first project is to analyze and predict Strata Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
7
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
8
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
9
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
11
Data Scientists spend upto 1/3rd time in Data Discovery...
12
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Audience for data
discovery
13
Data Discovery - User personas
14
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Buy vs. Build vs. Adopt
17
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Meet Amundsen
19
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
22
Relevance - search for “apple” on Google
23
Low relevance High relevance
Popularity - search for “apple” on Google
24
Low popularity High popularity
Striking the balance
25
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Back to mocks...
26
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Amundsen’s architecture
32
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Frontend Service
34
35
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Detailed description and metadata about data resources
2. Metadata Service
37
38
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
39
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine.
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Trade Off #1
Why choose Graph
database
40
Why Graph database?
Why Graph database?
Trade Off #2
Why not propagate the
metadata back to source
43
Why not propagate the metadata back to source
44
Why not propagate the metadata back to source
45
?
?
3. Search Service
46
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• Support REST API for building indexes
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then relevancy
‒ Wildcard Search
48
Challenge #1
How to make the search
result more relevant?
49
How to make the search result more relevant?
50
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search
‒ Support category search (e.g. “column: is_line_ride”)
4. Data Builder
51
52
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #1
Various forms of metadata
53
54
Metadata Sources @ Lyft
Metadata - Challenges
• Standardization: No single data model that fits for all data resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Extraction: Each data set metadata is stored and fetched differently,
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
55
Challenge #2
Pull model vs Push model
56
Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
What’s next?
64
Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
Impact - Amundsen at Lyft
66
Beta release
(internal)
Generally Available
(GA) release
Alpha release
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Summary
69
Summary
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• Amundsen: Data graph for all data
• Blog post with more details: go.lyft.com/datadiscoveryblog
70
Mark Grover | @mark_grover
Tao Feng | @feng-tao
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
71
Backup
72

Strata sf - Amundsen presentation

  • 1.
    March 2019 Mark Grover| @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  • 2.
    Agenda • Data atLyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  • 3.
    Data platform users 3 DataModelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 4.
    5 Core Infra highlevel architecture Custom apps
  • 5.
  • 6.
    • My firstproject is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  • 7.
    • Option 1:Phone a friend! • Option 2: Github search Status quo 8
  • 8.
    • What doesthis field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  • 9.
  • 10.
    Exploring with SELECT* is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  • 11.
    Data Scientists spendupto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 12.
  • 13.
    Data Discovery -User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 14.
    3 Data Scientistpersonas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 15.
    Search based Lineagebased Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 16.
    Buy vs. Buildvs. Adopt 17
  • 17.
    Compared various existingsolutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 18.
    Meet Amundsen 19 First personto discover the South Pole - Norwegian explorer, Roald Amundsen
  • 19.
  • 20.
    Search results rankedon relevance and query activity
  • 21.
  • 22.
    Relevance - searchfor “apple” on Google 23 Low relevance High relevance
  • 23.
    Popularity - searchfor “apple” on Google 24 Low popularity High popularity
  • 24.
    Striking the balance 25 RelevancePopularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 25.
  • 26.
    Search results rankedon relevance and query activity
  • 27.
    Detailed description andmetadata about data resources
  • 28.
  • 29.
    Computed stats aboutcolumn metadata Disclaimer: these stats are arbitrary.
  • 30.
  • 31.
  • 32.
    33 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 33.
  • 34.
    35 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 35.
    Detailed description andmetadata about data resources
  • 36.
  • 37.
    38 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 38.
    39 2. Metadata Service •A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 39.
    Trade Off #1 Whychoose Graph database 40
  • 40.
  • 41.
  • 42.
    Trade Off #2 Whynot propagate the metadata back to source 43
  • 43.
    Why not propagatethe metadata back to source 44
  • 44.
    Why not propagatethe metadata back to source 45 ? ?
  • 45.
  • 46.
    47 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 47.
    3. Search Service •Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  • 48.
    Challenge #1 How tomake the search result more relevant? 49
  • 49.
    How to makethe search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  • 50.
  • 51.
    52 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 52.
  • 53.
  • 54.
    Metadata - Challenges •Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  • 55.
    Challenge #2 Pull modelvs Push model 56
  • 56.
    Pull model vs.Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 57.
  • 58.
  • 59.
    How are webuilding data? Databuilder
  • 60.
    How is databuilderorchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 61.
  • 62.
    Amundsen seems tobe more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  • 63.
    Impact - Amundsenat Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  • 64.
    Adding more kindsof data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  • 65.
    Serving more metadataabout existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  • 66.
  • 67.
    Summary • Data Discoverymaking data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  • 68.
    Mark Grover |@mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71
  • 69.

Editor's Notes

  • #3 Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • #4 Who is our audience: everyone who works at Lyft… Power users: Data Scientists, Research Scientists, Product Managers… Next: Engineers, GMs, Ops, etc.
  • #5 What do we do: Wide scope of work… de-mystify and democratize data at Lyft… Not mentioned here: Operational / transactional databases...
  • #6 What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  • #8 Mark
  • #9 Mark
  • #10 Mark
  • #12 Mark
  • #13 Data Discovery: How much of a challenge is it? Significant challenge… Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis… We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis… But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth… We can significantly increase productivity and impact if we can reduce this time...
  • #15 Mark
  • #24 Mark
  • #26 Popularity is not click through rate but through query access patterns.
  • #34 Amundsen architecture at Lyft: 3 micro-services(FE, metadata, search) and one generic data ingestion framework Will discuss each of the component in details High level walkthrough….how CCPA compilance works
  • #38 What options we have (graph, rdbms,
  • #40 ML features (one sentence on what is feature service) Add a logo of neo4j ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #42 Why graph db is the best option? Why not rdbms, nosql etc Why choose neo4j vs other graph db? (most popular graph db) Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data. Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  • #43 Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all. We model the graph as bi-direction relationships.
  • #45 ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #46 ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #48 ? challenge How to improve search relevance?
  • #49 Think of search algorithm… Event_ride_table -> event ride table
  • #51 Talk about how we measure it: instrument empty results from user, % of click through rate What we did? Boost the rank of exact table name match
  • #58 Walk through pros and cons
  • #61 Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table. Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that. Loader writes data into either sink (destination) or into staging area. Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  • #64 We currently index 2 a day, dependency Why we index twice a day?
  • #66 Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • #67 A slide on Amundsen @ Lyft? ? how long it has been in prod How many datasets Users WAU Usage
  • #68 Kafka topic Schema registry ML workflow and Airflow DAGs
  • #69 Three long term type of metadata(A.B.C) We want to use the index A,B,C for the datasources we mentioned in last slide
  • #71 Mark