SlideShare a Scribd company logo
1 of 69
March 2019
Mark Grover | @mark_grover | Product Management, Lyft
Tao Feng | @feng-tao | Software Engineer, Lyft
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Architecture
• Summary
2
Data platform users
3
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
5
Core Infra high level architecture
Custom apps
Data Discovery
6
• My first project is to analyze and predict Strata Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
7
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
8
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
9
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
11
Data Scientists spend upto 1/3rd time in Data Discovery...
12
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Audience for data
discovery
13
Data Discovery - User personas
14
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Buy vs. Build vs. Adopt
17
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Meet Amundsen
19
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
22
Relevance - search for “apple” on Google
23
Low relevance High relevance
Popularity - search for “apple” on Google
24
Low popularity High popularity
Striking the balance
25
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Back to mocks...
26
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Amundsen’s architecture
32
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Frontend Service
34
35
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Detailed description and metadata about data resources
2. Metadata Service
37
38
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
39
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine.
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Trade Off #1
Why choose Graph
database
40
Why Graph database?
Why Graph database?
Trade Off #2
Why not propagate the
metadata back to source
43
Why not propagate the metadata back to source
44
Why not propagate the metadata back to source
45
?
?
3. Search Service
46
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• Support REST API for building indexes
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then relevancy
‒ Wildcard Search
48
Challenge #1
How to make the search
result more relevant?
49
How to make the search result more relevant?
50
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search
‒ Support category search (e.g. “column: is_line_ride”)
4. Data Builder
51
52
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #1
Various forms of metadata
53
54
Metadata Sources @ Lyft
Metadata - Challenges
• Standardization: No single data model that fits for all data resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Extraction: Each data set metadata is stored and fetched differently,
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
55
Challenge #2
Pull model vs Push model
56
Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
What’s next?
64
Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
Impact - Amundsen at Lyft
66
Beta release
(internal)
Generally Available
(GA) release
Alpha release
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Summary
69
Summary
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• Amundsen: Data graph for all data
• Blog post with more details: go.lyft.com/datadiscoveryblog
70
Mark Grover | @mark_grover
Tao Feng | @feng-tao
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
71
Backup
72

More Related Content

What's hot

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 

What's hot (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Similar to Strata sf - Amundsen presentation

Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterpriseNeo4j
 

Similar to Strata sf - Amundsen presentation (20)

Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterprise
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 

More from Tao Feng

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

More from Tao Feng (6)

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Recently uploaded

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 

Recently uploaded (20)

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 

Strata sf - Amundsen presentation

  • 1. March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  • 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  • 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 4. 5 Core Infra high level architecture Custom apps
  • 6. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  • 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  • 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  • 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 13. Data Discovery - User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 16. Buy vs. Build vs. Adopt 17
  • 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 18. Meet Amundsen 19 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 20. Search results ranked on relevance and query activity
  • 21. How does search work? 22
  • 22. Relevance - search for “apple” on Google 23 Low relevance High relevance
  • 23. Popularity - search for “apple” on Google 24 Low popularity High popularity
  • 24. Striking the balance 25 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 26. Search results ranked on relevance and query activity
  • 27. Detailed description and metadata about data resources
  • 29. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 34. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 35. Detailed description and metadata about data resources
  • 37. 38 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 38. 39 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 39. Trade Off #1 Why choose Graph database 40
  • 42. Trade Off #2 Why not propagate the metadata back to source 43
  • 43. Why not propagate the metadata back to source 44
  • 44. Why not propagate the metadata back to source 45 ? ?
  • 46. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 47. 3. Search Service • Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  • 48. Challenge #1 How to make the search result more relevant? 49
  • 49. How to make the search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  • 51. 52 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 52. Challenge #1 Various forms of metadata 53
  • 54. Metadata - Challenges • Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  • 55. Challenge #2 Pull model vs Push model 56
  • 56. Pull model vs. Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 59. How are we building data? Databuilder
  • 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  • 63. Impact - Amundsen at Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  • 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  • 65. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  • 67. Summary • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  • 68. Mark Grover | @mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71

Editor's Notes

  1. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  2. Who is our audience: everyone who works at Lyft… Power users: Data Scientists, Research Scientists, Product Managers… Next: Engineers, GMs, Ops, etc.
  3. What do we do: Wide scope of work… de-mystify and democratize data at Lyft… Not mentioned here: Operational / transactional databases...
  4. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  5. Mark
  6. Mark
  7. Mark
  8. Mark
  9. Data Discovery: How much of a challenge is it? Significant challenge… Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis… We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis… But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth… We can significantly increase productivity and impact if we can reduce this time...
  10. Mark
  11. Mark
  12. Popularity is not click through rate but through query access patterns.
  13. Amundsen architecture at Lyft: 3 micro-services(FE, metadata, search) and one generic data ingestion framework Will discuss each of the component in details High level walkthrough….how CCPA compilance works
  14. What options we have (graph, rdbms,
  15. ML features (one sentence on what is feature service) Add a logo of neo4j ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  16. Why graph db is the best option? Why not rdbms, nosql etc Why choose neo4j vs other graph db? (most popular graph db) Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data. Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  17. Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all. We model the graph as bi-direction relationships.
  18. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  19. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  20. ? challenge How to improve search relevance?
  21. Think of search algorithm… Event_ride_table -> event ride table
  22. Talk about how we measure it: instrument empty results from user, % of click through rate What we did? Boost the rank of exact table name match
  23. Walk through pros and cons
  24. Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table. Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that. Loader writes data into either sink (destination) or into staging area. Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  25. We currently index 2 a day, dependency Why we index twice a day?
  26. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  27. A slide on Amundsen @ Lyft? ? how long it has been in prod How many datasets Users WAU Usage
  28. Kafka topic Schema registry ML workflow and Airflow DAGs
  29. Three long term type of metadata(A.B.C) We want to use the index A,B,C for the datasources we mentioned in last slide
  30. Mark