SlideShare a Scribd company logo

Meetup SF - Amundsen

Amundsen, the open source data discovery tool. Slides from the 5/15 presentation at Lyft

1 of 74
Download to read offline
Making Big Data Easy
Data Discovery Lineage Data Like Code
May 15th 2019
Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft
Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s next?
3
Data at Lyft
4
5
Core Infra high level architecture
Custom apps
Challenges with Data
Discovery
6
Ad

Recommended

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 

More Related Content

What's hot

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
Knowledge Graphs for Supply Chain Operations.pdf
Knowledge Graphs for Supply Chain Operations.pdfKnowledge Graphs for Supply Chain Operations.pdf
Knowledge Graphs for Supply Chain Operations.pdfVaticle
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasDataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data EngineeringAnanth PackkilDurai
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate ModernizationKarina Matos
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides SlideTeam
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 

What's hot (20)

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Knowledge Graphs for Supply Chain Operations.pdf
Knowledge Graphs for Supply Chain Operations.pdfKnowledge Graphs for Supply Chain Operations.pdf
Knowledge Graphs for Supply Chain Operations.pdf
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate Modernization
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides
 
Migrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQLMigrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQL
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Data Mesh
Data MeshData Mesh
Data Mesh
 

Similar to Meetup SF - Amundsen

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch
 
DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14Carly Strasser
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 

Similar to Meetup SF - Amundsen (20)

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time Analytics
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 

Recently uploaded

chap. 3. lipid deterioration oil and fat processign
chap. 3. lipid deterioration oil and fat processignchap. 3. lipid deterioration oil and fat processign
chap. 3. lipid deterioration oil and fat processignteddymebratie
 
self introduction sri balaji
self introduction sri balajiself introduction sri balaji
self introduction sri balajiSriBalaji891607
 
Critical Literature Review Final -MW.pdf
Critical Literature Review Final -MW.pdfCritical Literature Review Final -MW.pdf
Critical Literature Review Final -MW.pdfMollyWinterbottom
 
Center Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxCenter Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxsjzzztc
 
biofilm fouling of the membrane present in aquaculture
biofilm fouling of the membrane present in aquaculturebiofilm fouling of the membrane present in aquaculture
biofilm fouling of the membrane present in aquacultureVINETUBE2
 
UNIT I INTRODUCTION TO INTERNET OF THINGS
UNIT I INTRODUCTION TO INTERNET OF THINGSUNIT I INTRODUCTION TO INTERNET OF THINGS
UNIT I INTRODUCTION TO INTERNET OF THINGSbinuvijay1
 
Student Challange as Google Developers at NKOCET
Student Challange as Google Developers at NKOCETStudent Challange as Google Developers at NKOCET
Student Challange as Google Developers at NKOCETGDSCNKOCET
 
SR Globals Profile - Building Vision, Exceeding Expectations.
SR Globals Profile -  Building Vision, Exceeding Expectations.SR Globals Profile -  Building Vision, Exceeding Expectations.
SR Globals Profile - Building Vision, Exceeding Expectations.srglobalsenterprises
 
Nexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxNexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxRohanAgarwal340656
 
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.pptMohanumar S
 
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptx
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptxDeep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptx
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptxpmgdscunsri
 
CHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction ModuleCHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction Modulessusera4f8281
 
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...GauravBhartie
 
Introduction to Binary Tree and Conersion of General tree to Binary Tree
Introduction to Binary Tree  and Conersion of General tree to Binary TreeIntroduction to Binary Tree  and Conersion of General tree to Binary Tree
Introduction to Binary Tree and Conersion of General tree to Binary TreeSwarupaDeshpande4
 
Sample Case Study of industry 4.0 and its Outcome
Sample Case Study of industry 4.0 and its OutcomeSample Case Study of industry 4.0 and its Outcome
Sample Case Study of industry 4.0 and its OutcomeHarshith A S
 
Presentation of Helmet Detection Using Machine Learning.pptx
Presentation of Helmet Detection Using Machine Learning.pptxPresentation of Helmet Detection Using Machine Learning.pptx
Presentation of Helmet Detection Using Machine Learning.pptxasmitaTele2
 
sahana sri D AD21046 SELF INTRODUCTION.pdf
sahana sri D AD21046 SELF INTRODUCTION.pdfsahana sri D AD21046 SELF INTRODUCTION.pdf
sahana sri D AD21046 SELF INTRODUCTION.pdfsahanaaids46
 
CDE_Sustainability Performance_20240214.pdf
CDE_Sustainability Performance_20240214.pdfCDE_Sustainability Performance_20240214.pdf
CDE_Sustainability Performance_20240214.pdf8-koi
 
Center Enamel is the leading bolted steel tanks manufacturer in China.docx
Center Enamel is the leading bolted steel tanks manufacturer in China.docxCenter Enamel is the leading bolted steel tanks manufacturer in China.docx
Center Enamel is the leading bolted steel tanks manufacturer in China.docxsjzzztc
 
Deluck Technical Works Company Profile.pdf
Deluck Technical Works Company Profile.pdfDeluck Technical Works Company Profile.pdf
Deluck Technical Works Company Profile.pdfartpoa9
 

Recently uploaded (20)

chap. 3. lipid deterioration oil and fat processign
chap. 3. lipid deterioration oil and fat processignchap. 3. lipid deterioration oil and fat processign
chap. 3. lipid deterioration oil and fat processign
 
self introduction sri balaji
self introduction sri balajiself introduction sri balaji
self introduction sri balaji
 
Critical Literature Review Final -MW.pdf
Critical Literature Review Final -MW.pdfCritical Literature Review Final -MW.pdf
Critical Literature Review Final -MW.pdf
 
Center Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxCenter Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docx
 
biofilm fouling of the membrane present in aquaculture
biofilm fouling of the membrane present in aquaculturebiofilm fouling of the membrane present in aquaculture
biofilm fouling of the membrane present in aquaculture
 
UNIT I INTRODUCTION TO INTERNET OF THINGS
UNIT I INTRODUCTION TO INTERNET OF THINGSUNIT I INTRODUCTION TO INTERNET OF THINGS
UNIT I INTRODUCTION TO INTERNET OF THINGS
 
Student Challange as Google Developers at NKOCET
Student Challange as Google Developers at NKOCETStudent Challange as Google Developers at NKOCET
Student Challange as Google Developers at NKOCET
 
SR Globals Profile - Building Vision, Exceeding Expectations.
SR Globals Profile -  Building Vision, Exceeding Expectations.SR Globals Profile -  Building Vision, Exceeding Expectations.
SR Globals Profile - Building Vision, Exceeding Expectations.
 
Nexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxNexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptx
 
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt
20CE501PE – INDUSTRIAL WASTE MANAGEMENT.ppt
 
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptx
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptxDeep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptx
Deep Learning For Computer Vision- Day 3 Study Jams GDSC Unsri.pptx
 
CHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction ModuleCHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction Module
 
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...
Microstrip Bandpass Filter Design using EDA Tolol such as keysight ADS and An...
 
Introduction to Binary Tree and Conersion of General tree to Binary Tree
Introduction to Binary Tree  and Conersion of General tree to Binary TreeIntroduction to Binary Tree  and Conersion of General tree to Binary Tree
Introduction to Binary Tree and Conersion of General tree to Binary Tree
 
Sample Case Study of industry 4.0 and its Outcome
Sample Case Study of industry 4.0 and its OutcomeSample Case Study of industry 4.0 and its Outcome
Sample Case Study of industry 4.0 and its Outcome
 
Presentation of Helmet Detection Using Machine Learning.pptx
Presentation of Helmet Detection Using Machine Learning.pptxPresentation of Helmet Detection Using Machine Learning.pptx
Presentation of Helmet Detection Using Machine Learning.pptx
 
sahana sri D AD21046 SELF INTRODUCTION.pdf
sahana sri D AD21046 SELF INTRODUCTION.pdfsahana sri D AD21046 SELF INTRODUCTION.pdf
sahana sri D AD21046 SELF INTRODUCTION.pdf
 
CDE_Sustainability Performance_20240214.pdf
CDE_Sustainability Performance_20240214.pdfCDE_Sustainability Performance_20240214.pdf
CDE_Sustainability Performance_20240214.pdf
 
Center Enamel is the leading bolted steel tanks manufacturer in China.docx
Center Enamel is the leading bolted steel tanks manufacturer in China.docxCenter Enamel is the leading bolted steel tanks manufacturer in China.docx
Center Enamel is the leading bolted steel tanks manufacturer in China.docx
 
Deluck Technical Works Company Profile.pdf
Deluck Technical Works Company Profile.pdfDeluck Technical Works Company Profile.pdf
Deluck Technical Works Company Profile.pdf
 

Meetup SF - Amundsen

  • 1. Making Big Data Easy Data Discovery Lineage Data Like Code
  • 2. May 15th 2019 Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  • 3. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s next? 3
  • 5. 5 Core Infra high level architecture Custom apps
  • 7. Data is used to make informed decisions 7 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision Make data the heart of every decision
  • 8. • My first project is to predict the Data Meetup Attendance • Goal: help the office team make a decision on # of chairs to provide Hi! I am a n00b Data Scientist! 8
  • 9. • Ask a friend/boss/coworker • Ask in a wider slack channel • Search in the Github repos Step 1: Search & find data 9 We end up finding a table: hosted_events that seems to be the right one
  • 10. • What does this field mean? ‒ Does hosted_events.attendance data include employees? ‒ What does hosted_events.costs include? • Dig deeper: explore using SQL queries Step 2: Understand the data 10 SELECT * FROM default.hosted_events WHERE ds=’2019-05-15’ LIMIT 100;
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery 11 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, how to use it,... • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  • 13. User Personas - (1/2) 13 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  • 14. User Personas - (2/2) 14 Power user - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of his time sharing his knowledge with the noob user - Could become noob user if he switches teams Noob user - Recently joined Lyft - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends his time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  • 15. 3 complementary ways to do Data Discovery 15 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  • 16. Data discovery at Lyft 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 18. Search results ranked on relevance and query activity
  • 19. How does search work? 19
  • 20. Relevance - search for “apple” on Google 20 Low relevance High relevance
  • 21. Popularity - search for “apple” on Google 21 Low popularity High popularity
  • 22. Striking the balance 22 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 25. 25 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 27. 27 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 29. Challenge #1 Why choose Graph database? 29
  • 32. 32 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 33. Challenge #2 Why not propagate the metadata back to source? 33
  • 34. Why not propagate the metadata back to source 34
  • 36. 36 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 37. Challenge #3 Various forms of metadata 37
  • 39. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 39
  • 42. How is the databuilder orchestrated? 42 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 44. 44 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 45. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 45
  • 46. Challenge #4 How to make the search result more relevant? 46
  • 47. How to make the search result more relevant? 47 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  • 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 50. Search results ranked on relevance and query activity
  • 53. Amundsen’s impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 53
  • 54. Amundsen is Open Source! • Many organizations have similar problems ‒ github.com/lyft/amundsenfrontendlibrary • Collaboration with outside companies has started ‒ Used in Production by ING ‒ Working with WeWork, Square, Bang & Olufsen & more 54
  • 55. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Superset) Privacy Governance
  • 57. Summary • A metadata-powered discovery tool - like Amundsen - is key to the next wave of big data applications • Blog post with more details: go.lyft.com/datadiscoveryblog • Join the community: github.com/lyft/amundsenfrontendlibrary 57
  • 58. Phil Mizrahi | @philippemizrahi Jin Hyuk Chang | @jinhyukchang Repo at github.com/lyft/amundsenfrontendlibrary Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 58
  • 60. 60 Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  • 62. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time Metadata: a set of data that describes and gives information about other data Ground, Joe Hellerstein, Vikram Sreekanti et al. RISE Lab, UC Berkeley
  • 63. Buy vs. Build vs. Adopt 63
  • 64. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 66. Search results ranked on relevance and query activity
  • 67. Detailed description and metadata about data resources
  • 69. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 71. Challenge #2 Pull model vs Push model 71
  • 72. Pull model and Push model 72 Pull Model Push Model ● Amundsen goes to the source and periodically update the index by pulling from the source (e.g. database). ● Very efficient on popular platform. Hard to be extended. ● The external source (e.g. database) pushes metadata to a message bus which Amundsen subscribes to. ● Flexible and extensible. Lots of work from external source. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 73. We’re Hiring! Apply at www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  • 74. Strata SF 2019 Rate this session session page on conference website O’Reilly Events App