Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Download to read offline

IDEAS Amundsen Presentation

Download to read offline

Slides for IDEAS 2019 Amundsen Presentation

  • Be the first to like this

IDEAS Amundsen Presentation

  1. 1. Saturday, October 26th 2019 Alagappan Sethuraman | Engineering Manager, Lyft Daniel Won | Software Engineer, Lyft Disrupting Data Discovery
  2. 2. Agenda • What is Data Discovery? • Challenges in Data Discovery • Introducing Amundsen • Amundsen Architecture • Impact and Future Work 2
  3. 3. What is Data Discovery? 3
  4. 4. Data is used to make informed decisions 4 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualization 4. Share insights and/or make a decision Make data the heart of every decision
  5. 5. What is Data Discovery? Consider a data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/create a visualization 4. Share insights and/or make a decision 5 Data Discovery
  6. 6. Challenges in Data Discovery 6
  7. 7. • My first project is predict the attendance for IDEAS conference • Goal: Help the office team make a decision on number of chairs to provide? • Idea: Let’s take a look into attendance from previous conferences… but where do I look? Hi! I’m a new Analyst! 7
  8. 8. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 8 We end up finding tables: hosted_events that seems to be the right one
  9. 9. • You find several columns that might be what you're looking for: ‒ booked, registered, and attendance • But you still have many questions such as: ‒ Does attendance include staff? ‒ What's the difference between booked and registered? ‒ How accurate are these figures? Step 2: Understand the data 9
  10. 10. Step 2: Understand the data ● Look for further documentation on these columns ○ Where does this documentation live? ● Ask an expert who knows this table ○ Who is an expert? ● Run some queries to try to figure it out at the risk of being wrong 10 SELECT * FROM schema.host_events LIMIT 100;
  11. 11. Nearly 1/3 of Data Scientist time is spent in Data Discovery 11 • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, & how to use it. • Data Discovery provides little to no intrinsic value • Impactful work happens in Analysis
  12. 12. Introducing Amundsen 12
  13. 13. What is Amundsen? • Built at Lyft, official launch in late 2018 • Inspired by Google Search, Airbnb Data Portal, and Apache Gobblin • Named after Norwegian explorer Roald Amundsen ‒ Led the first expedition to the South Pole ‒ Led the first expedition through the Northwest Passage 13
  14. 14. Home Page
  15. 15. Search
  16. 16. Resource Metadata
  17. 17. Resource Ownership 17
  18. 18. Data Preview 18
  19. 19. Computed Column Statistics Disclaimer: these stats are arbitrary.
  20. 20. Requesting Descriptions 20
  21. 21. User Profile 21
  22. 22. In-Application User Feedback
  23. 23. Amundsen Architecture 23
  24. 24. Amundsen Architecture 24
  25. 25. Why choose a graph database? 25
  26. 26. 26 Why Graph database? (1/2)
  27. 27. View Resource Metadata
  28. 28. 28 Why Graph database? (2/2)
  29. 29. Neo4j is the source of truth for editable metadata 29
  30. 30. Why not propagate the editabled metadata back to source 30
  31. 31. Why not propagate the editabled metadata back to source 31
  32. 32. Why not propagate the editabled metadata back to source 32
  33. 33. Why not propagate the editabled metadata back to source 33
  34. 34. Impact at Lyft 34
  35. 35. Amundsen’s Impact at Lyft • Deployed at Lyft for over 1 year • Over 700 Weekly Active Users • 90% penetration among Data Scientists • Reduced mean time to discovery by 75% • Also used by Data Eng, Software Eng, PMs, Ops, Marketing Managers, and more 35
  36. 36. Future Work 36
  37. 37. Search Preview 37
  38. 38. Advanced Search 38
  39. 39. More Metadata 39
  40. 40. We're Open Source 40
  41. 41. • github.com/lyft/amundsen • 200+ github stars, 10+ companies contributing back • Slack channel 250+ people from 30+ companies • Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow, LA, NYC by Lyft employees and community Amundsen is Open Source! 41
  42. 42. Community Overview 42 ContributorsActivecommunity
  43. 43. Thank You 43
  44. 44. Alagappan Sethuraman | /in/alagappanut Daniel Won | /in/danwon Project Code @ github.com/lyft/amundsen Icons under Creative Commons License from https://thenounproject.com/ 44

Slides for IDEAS 2019 Amundsen Presentation

Views

Total views

510

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

18

Shares

0

Comments

0

Likes

0

×