IDEAS Amundsen Presentation

D
Daniel WonSoftware Engineer
Saturday, October 26th 2019
Alagappan Sethuraman | Engineering Manager, Lyft
Daniel Won | Software Engineer, Lyft
Disrupting Data Discovery
Agenda
• What is Data Discovery?
• Challenges in Data Discovery
• Introducing Amundsen
• Amundsen Architecture
• Impact and Future Work
2
What is Data Discovery?
3
Data is used to make informed decisions
4
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualization
4. Share insights and/or make a decision
Make data the heart of every decision
What is Data Discovery?
Consider a data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/create a visualization
4. Share insights and/or make a decision
5
Data Discovery
Challenges in
Data Discovery
6
• My first project is predict the attendance for IDEAS conference
• Goal: Help the office team make a decision on number of chairs to
provide?
• Idea: Let’s take a look into attendance from previous conferences… but
where do I look?
Hi! I’m a new Analyst!
7
• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: hosted_events
that seems to be the right one
• You find several columns that might be what you're looking for:
‒ booked, registered, and attendance
• But you still have many questions such as:
‒ Does attendance include staff?
‒ What's the difference between booked and registered?
‒ How accurate are these figures?
Step 2: Understand the data
9
Step 2: Understand the data
● Look for further documentation on these columns
○ Where does this documentation live?
● Ask an expert who knows this table
○ Who is an expert?
● Run some queries to try to figure it out at the risk of being wrong
10
SELECT * FROM schema.host_events
LIMIT 100;
Nearly 1/3 of Data Scientist time is spent in Data
Discovery
11
• Data discovery is a problem
because of the lack of
understanding of what data
exists, where, who owns it, & how
to use it.
• Data Discovery provides little to
no intrinsic value
• Impactful work happens in
Analysis
Introducing Amundsen
12
What is Amundsen?
• Built at Lyft, official launch in late 2018
• Inspired by Google Search, Airbnb Data Portal, and
Apache Gobblin
• Named after Norwegian explorer Roald Amundsen
‒ Led the first expedition to the South Pole
‒ Led the first expedition through the Northwest Passage
13
Home Page
Search
Resource Metadata
Resource Ownership
17
Data Preview
18
Computed Column Statistics
Disclaimer: these stats are arbitrary.
Requesting Descriptions
20
User Profile
21
In-Application User Feedback
Amundsen Architecture
23
Amundsen Architecture
24
Why choose a graph
database?
25
26
Why Graph database? (1/2)
View Resource Metadata
28
Why Graph database? (2/2)
Neo4j is the source of truth
for editable metadata
29
Why not propagate the editabled metadata back to
source
30
Why not propagate the editabled metadata back to
source
31
Why not propagate the editabled metadata back to
source
32
Why not propagate the editabled metadata back to
source
33
Impact at Lyft
34
Amundsen’s Impact at Lyft
• Deployed at Lyft for over 1 year
• Over 700 Weekly Active Users
• 90% penetration among Data Scientists
• Reduced mean time to discovery by 75%
• Also used by Data Eng, Software Eng, PMs, Ops, Marketing Managers,
and more
35
Future Work
36
Search
Preview
37
Advanced
Search
38
More
Metadata
39
We're Open Source
40
• github.com/lyft/amundsen
• 200+ github stars, 10+ companies contributing back
• Slack channel 250+ people from 30+ companies
• Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow, LA,
NYC by Lyft employees and community
Amundsen is Open Source!
41
Community Overview
42
ContributorsActivecommunity
Thank You
43
Alagappan Sethuraman | /in/alagappanut
Daniel Won | /in/danwon
Project Code @ github.com/lyft/amundsen
Icons under Creative Commons License from https://thenounproject.com/
44
1 of 44

Recommended

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
20.3K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
5.8K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
4.3K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.1K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.6K views42 slides

More Related Content

Recently uploaded

RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx by
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 views3 slides
Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides
ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides
Introduction to Microsoft Fabric.pdf by
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdfishaniuudeshika
24 views16 slides
MOSORE_BRESCIA by
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIAFederico Karagulian
5 views8 slides
JConWorld_ Continuous SQL with Kafka and Flink by
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
100 views36 slides

Recently uploaded(20)

Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
JConWorld_ Continuous SQL with Kafka and Flink by Timothy Spann
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann100 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023 by StatsCommunications
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views

Featured

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.4K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.6K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.3K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides
Barbie - Brand Strategy Presentation by
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
25.1K views46 slides

Featured(20)

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.4K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.6K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.3K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.6K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn87.9K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views
Defining a Tech Project Vision in Eight Quick Steps pdf by TechSoup
Defining a Tech Project Vision in Eight Quick Steps pdfDefining a Tech Project Vision in Eight Quick Steps pdf
Defining a Tech Project Vision in Eight Quick Steps pdf
TechSoup 9.7K views

IDEAS Amundsen Presentation

Editor's Notes

  1. Name & Role working on an open-source data discovery tool at Lyft. It’s called “Amundsen” -- more on that name later. It leverages Neo4j, glad to share how we’ve been using Neo4j at Lyft to achieve goals of our product Amundsen.
  2. On the agenda for this talk
  3. Now onto challenges with data discovery
  4. Effective data discovery is important because data is at the heart of every decision we make. It is the only way to make informed, objective decisions. Applies to many roles Data-driven decision making process Search & find data Understand the data Perform an analysis Share insights or make a decision
  5. Now onto challenges with data discovery
  6. To highlight some data discover pain points that occur without the proper tools, let’s walk through a hypothetical example
  7. Your experience searching and finding data may involve doing all of the following 3 things.
  8. Your experience understanding the data doesn’t get any easier. Each question leads to further questions - How was this data collected?
  9. ⅓ of time on data discovery Difficult to find what exists, understand whether or not it’s what you are looking for, or trust that it is the source of truth for that information We can significantly increase productivity and impact if we can reduce this time...
  10. We’ve talked about some pain points of data discovery and why it’s important, let’s talk about our solution -- Amundsen.
  11. Disclaimer Representative data Amundsen circa March 2019 Our landing page is optimized for search Most common method of data discovery, presented with search bar & help text for some advanced search features We also want the landing page to be able to help users that don’t know what to search for. Created this concept of popular tables
  12. Users presented with ranked search results Not like page-rank but based on relevance and popularity
  13. Now onto challenges with data discovery
  14. However graph databases are not common for many web applications, and so one might ask why choose a graph database.
  15. Well if you remember the diagram of the data ecosystem at Lyft from the beginning of the talk, that can be modeled as a graph. This is a very powerful feature because the alternative to created these kinds of relationships with a RDBMS is joins A NoSQL database isn’t set up for this
  16. As you may remember from the application walkthrough, Amundsen surfaces resource metadata and that is what we are storing in Neo4j
  17. Let’s take a note of some of the features from the table detail page again and see how this is represented in Neo4j Walk through features What’s very beneficial about this is that when we have a new use case and a new piece of metadata to represent, we just have to create the new node and relationship.
  18. Another key characteristic of our system is that neo4j is the source of truth for our editable metadata
  19. This was actually not our original intent, we ran into a roadblock when we were first implementing the description editing feature. We originally had a setup like this
  20. Then we realized we forgot to account for something. Tables can get rebuilt using the source code that generated the table and descriptions will be overwritten
  21. The we thought about whether or not we could do this, update them both!
  22. The answer was no. ...And that’s how Neo4j became the source of truth for editable metadata
  23. Now onto challenges with data discovery
  24. Now onto challenges with data discovery
  25. Now onto challenges with data discovery
  26. T