Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next

11

Share

How Lyft Drives Data Discovery

Speaker: Philippe Mizrahi - Associate Product Manager - Lyft

Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.

During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.

Related Books

Free with a 30 day trial from Scribd

See all

How Lyft Drives Data Discovery

  1. 1. Tuesday, October 1st 2019 Phil Mizrahi | Product @Lyft Disrupting Data Discovery with Amundsen
  2. 2. Agenda • Challenges with Data Discovery • Evaluating Solutions • Amundsen • Amundsen’s Architecture - How do we use Neo4j • Impact • What’s Next? 2
  3. 3. Challenges with Data Discovery 3
  4. 4. Data is used to make informed decisions 5 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or make a decision Make data the heart of every decision
  5. 5. • Goal: What new data-driven policies can we enact to reduce driver insurance fraud? • Idea: Let’s take a deeper look into insurance claims from drivers who have given less than 𝑥 rides. • Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but where do I look? Hi! I’m a new Analyst in the Fraud Department ! 6
  6. 6. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 7 We end up finding tables: driver_rides & rides_driver_total
  7. 7. • What is the difference: driver_rides vs. rides_driver_total • What do the different fields mean? ‒ Is driver_rides.completed different from rides_driver_total.lifetime_completed? ‒ What period of time does the data in each table cover? • Dig deeper: explore using SQL queries Step 2: Understand the data 8 SELECT * FROM schema.driver_rides WHERE ds=’2019-05-15’ LIMIT 100; SELECT * FROM schema.rides_driver_total WHERE ds=’2019-05-15’ LIMIT 100;
  8. 8. - No way to know & understand trusted data - Created channels & oncalls for data questions Lots of queries like: SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100; Lack of productivity had many side effects 9 - Does data exist? - Prior work? - Source of truth? - Who owns it? - Who uses it? Lots of unknowns Increased database load Interrupt heavy data culture
  9. 9. Lots of wasted tech & biz users time 10 Analyst/DS workflow and time spent on each step
  10. 10. Evaluating Solutions 11
  11. 11. Holy grail of solving for productivity 12 metadata noun /ˈmedəˌdādə,ˈmedəˌdadə/ :a set of data that describes and gives information about other data. 1. What kind of information? 2. About what data?
  12. 12. 1. What kind of information? (aka ABC of metadata) 13 Application Context Metadata needed by humans or applications to operate ● Where is the data? ● What are the semantics of the data? Behavior How is data created and used over time? ● Who’s using the data? ● Who created the data? Change Change in data over time ● How is the data evolving over time? ● Evolution of code that generates the data Terminology borrowed from Ground paper
  13. 13. Short answer: Any data within your organization Long answer: 2. About what data? 14 Data stores Dashboard / Reports Schema registry Events / Schemas Streams People Employees
  14. 14. 3 complementary ways to do Data Discovery 15 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  15. 15. Data discovery for ALL users 16 Power User - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of their time sharing their knowledge with the new user - Could become “New user” if they switch teams New User - Recently joined Lyft or switched to a new team - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends their time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets Other requirements - Leverage as much data automatically as possible - Preferably, open source and healthy community - API availability - Easy to set up
  16. 16. Solution space • Vendors - Alation, Collibra • Existing open source projects (e.g. Apache Atlas) • LinkedIn’s data portal - Wherehows & DataHub (blog, code) • Twitter’s data discovery (blog) • Netflix’s metacat (code, blog) • Airbnb’s data portal (blog, video) • Big Query SQL Web UI & catalog (blog) • Goods: Organizing Google’s Datasets (paper) • Data Warehousing and Analytics Infrastructure at Facebook (paper) 17
  17. 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  18. 18. Amundsen 19 Product named after Roald Amundsen ● First expedition to reach the South Pole ● First to explore both North & South Poles
  19. 19. Landing Page - Optimized for search
  20. 20. Search Results - Ranked on relevance & popularity
  21. 21. Relevance - search for “apple” on Google 22 Low relevance High relevance
  22. 22. Popularity - search for “apple” on Google 23 Low popularity High popularity
  23. 23. Search Results - Striking the balance 24 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Different weights for different metadata, e.g. resource name ● Querying activity ● Dashboarding ● Lower weight for automated querying ● Higher weight for adhoc querying
  24. 24. View Resource Metadata
  25. 25. Data Preview 26
  26. 26. View Resource Metadata
  27. 27. Computed Column Metadata Statistics Disclaimer: these stats are arbitrary.
  28. 28. In-Application User Feedback
  29. 29. Amundsen’s Architecture 30
  30. 30. 31 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  31. 31. 1. Metadata Service 32
  32. 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  33. 33. 34 Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  34. 34. Why choose a graph database? 35
  35. 35. 36 Why Graph database? (1/2)
  36. 36. 37 Why Graph database? (2/2)
  37. 37. 2. Databuilder 38
  38. 38. 39 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  39. 39. 40 Metadata Sources @ Lyft
  40. 40. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 41
  41. 41. Databuilder 42
  42. 42. 3. Search Service 43
  43. 43. 44 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  44. 44. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 45
  45. 45. How to make the search result more relevant? 46 • Collect metrics ‒ Instrumentation for search behavior ‒ Measure click-through-rate (CTR) over top 5 results • Experiment with different weights, e.g boost the exact table ranking • Advanced search: ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride) ‒ Future: Filtering, Autosuggest
  46. 46. 3. Frontend Service 47
  47. 47. 48 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  48. 48. Web Application
  49. 49. Web Technologies 50 Develop Build Test
  50. 50. Impact 51
  51. 51. 52 “This is God’s work” - George X, ex-head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Bomee P, DS, Lyft A6n @ Lyft
  52. 52. Roles of Amundsen users at Lyft 53 Penetration rate: DS (aka analyst): 81% RS (aka DS): 71% PM: 22% SWE: 17% Cust Serv: 7% Sp. Ops: 67% Sp. Op Leads: 53% Economist: 100% Cust. Quality: 78% Growth Mktg: 25%
  53. 53. Community Users 54 ProminentusersActivecommunity
  54. 54. Community overview Contributors
  55. 55. Recent Contributions from the community • BigQuery integration (Coolblue) • PostgreSQL and Redshift integration (Everfi) • Security improvements and Apache Atlas integration (ING) • Snowflake integration (LMC) • Toolbar on landing page (In progress, Workday) • Integrating with Delta analytics platform (In progress, Databricks) • Talks by ING & Coolblue at conferences in Barcelona, Vilnius & Moscow 56
  56. 56. What’s Next? 57
  57. 57. 1. Develop breadth of applications 58 Metadata Compliance (GDPR/CCPA) DataDiscovery Downstream impactanalysis . . . . . DataQuality
  58. 58. 2. Develop depth of metadata
  59. 59. Roadmap (subject to change, not ordered) • Index Dashboards (Product spec) • Link business terms and process to technical metadata • Standardize Information Governance metadata • Include tags in search • ACL integration, allow only specific roles to edit descriptions • Show search context for what matched • “Request for descriptions” aka notifications • Data Lineage 60
  60. 60. Phil Mizrahi | @philippemizrahi | in/philippe-mizrahi Project Code @ github.com/lyft/amundsen Blog Post @ go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 61
  • TJKang1

    Apr. 30, 2021
  • SunghwanCho8

    Jul. 12, 2020
  • KingChen2008

    Jul. 1, 2020
  • junk987654

    May. 13, 2020
  • vijayarajanmarimuthu

    Apr. 7, 2020
  • jaeyonglee9480

    Mar. 29, 2020
  • leoyang991

    Mar. 29, 2020
  • WonyoungChae

    Mar. 29, 2020
  • kivanolaitrung

    Nov. 22, 2019
  • QingXu1

    Nov. 7, 2019
  • NishantKumar207

    Oct. 10, 2019

Speaker: Philippe Mizrahi - Associate Product Manager - Lyft Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x. During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.

Views

Total views

2,206

On Slideshare

0

From embeds

0

Number of embeds

23

Actions

Downloads

0

Shares

0

Comments

0

Likes

11

×