Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

4

Share

Download to read offline

Strata sf - Amundsen presentation

Download to read offline

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Related Books

Free with a 30 day trial from Scribd

See all

Strata sf - Amundsen presentation

  1. 1. March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  2. 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  3. 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  4. 4. 5 Core Infra high level architecture Custom apps
  5. 5. Data Discovery 6
  6. 6. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  7. 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  8. 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  9. 9. Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;
  10. 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  11. 11. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  12. 12. Audience for data discovery 13
  13. 13. Data Discovery - User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  14. 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  15. 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  16. 16. Buy vs. Build vs. Adopt 17
  17. 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  18. 18. Meet Amundsen 19 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  19. 19. Landing page optimized for search
  20. 20. Search results ranked on relevance and query activity
  21. 21. How does search work? 22
  22. 22. Relevance - search for “apple” on Google 23 Low relevance High relevance
  23. 23. Popularity - search for “apple” on Google 24 Low popularity High popularity
  24. 24. Striking the balance 25 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  25. 25. Back to mocks... 26
  26. 26. Search results ranked on relevance and query activity
  27. 27. Detailed description and metadata about data resources
  28. 28. Data Preview within the tool
  29. 29. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  30. 30. Built-in user feedback
  31. 31. Amundsen’s architecture 32
  32. 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  33. 33. 1. Frontend Service 34
  34. 34. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  35. 35. Detailed description and metadata about data resources
  36. 36. 2. Metadata Service 37
  37. 37. 38 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  38. 38. 39 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  39. 39. Trade Off #1 Why choose Graph database 40
  40. 40. Why Graph database?
  41. 41. Why Graph database?
  42. 42. Trade Off #2 Why not propagate the metadata back to source 43
  43. 43. Why not propagate the metadata back to source 44
  44. 44. Why not propagate the metadata back to source 45 ? ?
  45. 45. 3. Search Service 46
  46. 46. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  47. 47. 3. Search Service • Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  48. 48. Challenge #1 How to make the search result more relevant? 49
  49. 49. How to make the search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  50. 50. 4. Data Builder 51
  51. 51. 52 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  52. 52. Challenge #1 Various forms of metadata 53
  53. 53. 54 Metadata Sources @ Lyft
  54. 54. Metadata - Challenges • Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  55. 55. Challenge #2 Pull model vs Push model 56
  56. 56. Pull model vs. Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  57. 57. 4. Databuilder
  58. 58. Databuilder in action
  59. 59. How are we building data? Databuilder
  60. 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  61. 61. What’s next? 64
  62. 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  63. 63. Impact - Amundsen at Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  64. 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  65. 65. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  66. 66. Summary 69
  67. 67. Summary • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  68. 68. Mark Grover | @mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71
  69. 69. Backup 72
  • YoonYung

    May. 7, 2019
  • emres

    Apr. 25, 2019
  • wyang72

    Apr. 5, 2019
  • ravishan_ece

    Apr. 3, 2019

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Views

Total views

3,150

On Slideshare

0

From embeds

0

Number of embeds

12

Actions

Downloads

121

Shares

0

Comments

0

Likes

4

×