Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

7

Share

Download to read offline

Disrupting Data Discovery

Download to read offline

Presentation on Disrupting Data Discovery at Strata in May 2019.

Related Books

Free with a 30 day trial from Scribd

See all

Disrupting Data Discovery

  1. 1. May 1st, 2019 Mark Grover | @mark_grover | Product Management, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  2. 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  3. 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  4. 4. 4 Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  5. 5. 5 Core Infra high level architecture Custom apps
  6. 6. Data Discovery 6
  7. 7. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  8. 8. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  9. 9. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  10. 10. Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;
  11. 11. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  12. 12. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  13. 13. All about metadata 13
  14. 14. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  15. 15. Audience for data discovery 15
  16. 16. Data Discovery - User personas 16 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  17. 17. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  18. 18. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  19. 19. Buy vs. Build vs. Adopt 19
  20. 20. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  21. 21. Meet Amundsen 21 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  22. 22. Landing page optimized for search
  23. 23. Search results ranked on relevance and query activity
  24. 24. How does search work? 24
  25. 25. Relevance - search for “apple” on Google 25 Low relevance High relevance
  26. 26. Popularity - search for “apple” on Google 26 Low popularity High popularity
  27. 27. Striking the balance 27 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  28. 28. Back to mocks... 28
  29. 29. Search results ranked on relevance and query activity
  30. 30. Detailed description and metadata about data resources
  31. 31. Data Preview within the tool
  32. 32. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  33. 33. Built-in user feedback
  34. 34. Amundsen’s architecture 34
  35. 35. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  36. 36. 1. Frontend Service 36
  37. 37. 37 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  38. 38. Amundsen table detail page
  39. 39. 2. Metadata Service 39
  40. 40. 40 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  41. 41. 41 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  42. 42. Trade Off #1 Why choose Graph database 42
  43. 43. Why Graph database?
  44. 44. Why Graph database?
  45. 45. Trade Off #2 Why not propagate the metadata back to source 45
  46. 46. Why not propagate the metadata back to source 46
  47. 47. Why not propagate the metadata back to source 47 ? ?
  48. 48. Why not propagate the metadata back to source 48
  49. 49. 3. Search Service 49
  50. 50. 50 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  51. 51. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 51
  52. 52. Challenge #1 How to make the search result more relevant? 52
  53. 53. How to make the search result more relevant? 53 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  54. 54. 4. Data Builder 54
  55. 55. 55 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  56. 56. Challenge #1 Various forms of metadata 56
  57. 57. 57 Metadata Sources @ Lyft
  58. 58. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 58
  59. 59. Challenge #2 Pull model vs Push model 59
  60. 60. Pull model vs. Push model 60 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  61. 61. Pull model vs. push model 61 Pull Model Push Model ● Onus of integration lays on data graph ● No interface to prescribe, hard to maintain crawlers ● Onus of integration lies on database ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph
  62. 62. Pull model vs. push model 62 Pull Model Push Model ● Onus of integration lays on data graph ● No interface to prescribe, hard to maintain crawlers ● Onus of integration lies on database ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Database Message queue Data graph Preferred if ● Near-real time indexing is important ● Clean interface doesn’t exist ● Other tools like Wherehows are moving towards Push Model Preferred if ● Waiting for indexing is ok ● Working with “strapped” teams ● There’s already an interface
  63. 63. 4. Databuilder
  64. 64. Databuilder in action
  65. 65. How are we building data? Databuilder
  66. 66. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  67. 67. What’s next? 67
  68. 68. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 68
  69. 69. Impact - Amundsen at Lyft 69 Beta release (internal) Generally Available (GA) release Alpha release
  70. 70. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  71. 71. Summary 71
  72. 72. Summary • Data Discovery adds 30+% more productivity to Data Scientists • Metadata is key to the next wave of big data applications • Amundsen - Lyft’s metadata and data discovery platform • Blog post with more details: go.lyft.com/datadiscoveryblog 72
  73. 73. Mark Grover | @mark_grover Tao Feng | @feng-tao Slides at go.lyft.com/datadiscoveryslides Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 73
  74. 74. We’re Hiring! Apply at www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  75. 75. Strata SF 2019 Rate this session session page on conference website O’Reilly Events App
  76. 76. Backup 76
  • MarkCooper113

    Oct. 16, 2019
  • vnnw

    Jun. 11, 2019
  • PengYu14

    May. 15, 2019
  • TeruyaIwakura

    May. 14, 2019
  • ArneRossmann

    May. 2, 2019
  • MartinGrayson

    May. 1, 2019
  • FrankGonzales27

    May. 1, 2019

Presentation on Disrupting Data Discovery at Strata in May 2019.

Views

Total views

1,830

On Slideshare

0

From embeds

0

Number of embeds

98

Actions

Downloads

76

Shares

0

Comments

0

Likes

7

×