Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3

Share

Download to read offline

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation

Download to read offline

Presentation on Lyft's data discovery tool -- Amundsen.

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation

  1. 1. Wednesday, September 18th 2019 Tamika Tannis | @ttannis | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  2. 2. Agenda • Data Ecosystem at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s Next? 2
  3. 3. Data Ecosystem at Lyft 3
  4. 4. 4 Core Data Infrastructure (High Level) Custom Applications Architecture Applications Mobile App Services Services Data Streaming Frameworks (Kafka / Kinesis) Flink
  5. 5. Challenges with Data Discovery 5
  6. 6. Data is used to make informed decisions 6 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or make a decision Make data the heart of every decision
  7. 7. • Goal: What new data-driven policies can we enact to reduce driver insurance fraud? • Idea: Let’s take a deeper look into insurance claims from drivers who have given less than 𝑥 rides. • Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but where do I look? Hi! I’m a new Analyst in the Fraud Department ! 7
  8. 8. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 8 We end up finding tables: driver_rides & rides_driver_total
  9. 9. • What is the difference: driver_rides vs. rides_driver_total • What do the different fields mean? ‒ Is driver_rides.completed different from rides_driver_total.lifetime_completed? ‒ What period of time does the data in each table cover? • Dig deeper: explore using SQL queries Step 2: Understand the data 9 SELECT * FROM schema.driver_rides WHERE ds=’2019-05-15’ LIMIT 100; SELECT * FROM schema.rides_driver_total WHERE ds=’2019-05-15’ LIMIT 100;
  10. 10. Data Scientists spend upto 1/3rd time in Data Discovery 10 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, & how to use it. • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  11. 11. Audience for data discovery 11
  12. 12. User Personas - (1/2) 12 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  13. 13. User Personas - (2/2) 13 Power User - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of their time sharing their knowledge with the new user - Could become “New user” if they switch teams New User - Recently joined Lyft or switched to a new team - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends their time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  14. 14. 3 complementary ways to do Data Discovery 14 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  15. 15. Data Discovery at Lyft 15 Product named after Roald Amundsen ● First expedition to reach the South Pole ● First to explore both North & South Poles
  16. 16. Landing Page - Optimized for search
  17. 17. Search Results - Ranked on relevance & popularity
  18. 18. Relevance - search for “apple” on Google 18 Low relevance High relevance
  19. 19. Popularity - search for “apple” on Google 19 Low popularity High popularity
  20. 20. Search Results - Striking the balance 20 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Different weights for different metadata, e.g. resource name ● Querying activity ● Dashboarding ● Lower weight for automated querying ● Higher weight for adhoc querying
  21. 21. View Resource Metadata
  22. 22. Data Preview 22
  23. 23. View Resource Metadata
  24. 24. Computed Column Metadata Statistics Disclaimer: these stats are arbitrary.
  25. 25. View Resource Metadata
  26. 26. In-Application User Feedback
  27. 27. Amundsen’s Architecture 27
  28. 28. 28 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  29. 29. 1. Metadata Service 29
  30. 30. 30 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  31. 31. View Resource Metadata
  32. 32. Why choose a graph database? 32
  33. 33. 33 Why Graph database? (1/2)
  34. 34. 34 Why Graph database? (2/2)
  35. 35. 35 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  36. 36. Neo4j is the source of truth for editable metadata 36
  37. 37. Why not propagate the editabled metadata back to source 37
  38. 38. Why not propagate the editabled metadata back to source 38
  39. 39. Why not propagate the editabled metadata back to source 39
  40. 40. Why not propagate the editabled metadata back to source 40
  41. 41. 2. Databuilder 41
  42. 42. 42 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  43. 43. 43 Metadata Sources @ Lyft
  44. 44. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 44
  45. 45. Databuilder 45
  46. 46. Databuilder in action 46
  47. 47. How is the databuilder orchestrated? 47 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  48. 48. 3. Search Service 48
  49. 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  50. 50. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 50
  51. 51. How to make the search result more relevant? 51 • Experiment with different weights, e.g boost the exact table ranking • Collect metrics ‒ Instrumentation for search behavior ‒ Measure click-through-rate (CTR) over top 5 results • Advanced search: ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride) ‒ Future: Filtering, Autosuggest
  52. 52. 4. Frontend Service 52
  53. 53. 53 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  54. 54. Web Application
  55. 55. Web Technologies 55 Develop Build Test
  56. 56. What’s Next? 56
  57. 57. Amundsen’s Impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 57
  58. 58. Amundsen is Open Source! • github.com/lyft/amundsen • Growing and active community ‒ c.150 github stars, 10+ companies contributing back ‒ Slack w/ 30+ companies and c.100 people ‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft employees and community ‒ Featured in blog posts and interviews • Net positive impact for Lyft through external community contributing ‒ Integration with open source backend ‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from our roadmap 58
  59. 59. Community Overview 59 ContributorsActivecommunity
  60. 60. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  61. 61. Amundsen People 61
  62. 62. Amundsen People 62
  63. 63. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  64. 64. Roadmap 64
  65. 65. Roadmap 65
  66. 66. Tamika Tannis | @ttannis | /in/tamika-tannis Project Code @ github.com/lyft/amundsen Blog Post @ go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 66
  • rorybramwell

    Oct. 12, 2019
  • JrnAHansen

    Oct. 6, 2019
  • JHBaek

    Sep. 18, 2019

Presentation on Lyft's data discovery tool -- Amundsen.

Views

Total views

850

On Slideshare

0

From embeds

0

Number of embeds

83

Actions

Downloads

39

Shares

0

Comments

0

Likes

3

×