Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Meetup SF - Amundsen

Amundsen, the open source data discovery tool. Slides from the 5/15 presentation at Lyft

  • Be the first to comment

  • Be the first to like this

Meetup SF - Amundsen

  1. 1. Making Big Data Easy Data Discovery Lineage Data Like Code
  2. 2. May 15th 2019 Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  3. 3. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s next? 3
  4. 4. Data at Lyft 4
  5. 5. 5 Core Infra high level architecture Custom apps
  6. 6. Challenges with Data Discovery 6
  7. 7. Data is used to make informed decisions 7 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision Make data the heart of every decision
  8. 8. • My first project is to predict the Data Meetup Attendance • Goal: help the office team make a decision on # of chairs to provide Hi! I am a n00b Data Scientist! 8
  9. 9. • Ask a friend/boss/coworker • Ask in a wider slack channel • Search in the Github repos Step 1: Search & find data 9 We end up finding a table: hosted_events that seems to be the right one
  10. 10. • What does this field mean? ‒ Does hosted_events.attendance data include employees? ‒ What does hosted_events.costs include? • Dig deeper: explore using SQL queries Step 2: Understand the data 10 SELECT * FROM default.hosted_events WHERE ds=’2019-05-15’ LIMIT 100;
  11. 11. Data Scientists spend upto 1/3rd time in Data Discovery 11 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, how to use it,... • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  12. 12. Audience for data discovery 12
  13. 13. User Personas - (1/2) 13 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  14. 14. User Personas - (2/2) 14 Power user - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of his time sharing his knowledge with the noob user - Could become noob user if he switches teams Noob user - Recently joined Lyft - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends his time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  15. 15. 3 complementary ways to do Data Discovery 15 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  16. 16. Data discovery at Lyft 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  17. 17. Landing page optimized for search
  18. 18. Search results ranked on relevance and query activity
  19. 19. How does search work? 19
  20. 20. Relevance - search for “apple” on Google 20 Low relevance High relevance
  21. 21. Popularity - search for “apple” on Google 21 Low popularity High popularity
  22. 22. Striking the balance 22 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  23. 23. Demo 23
  24. 24. Amundsen’s architecture 24
  25. 25. 25 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  26. 26. 1. Metadata Service 26
  27. 27. 27 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  28. 28. Amundsen table detail page
  29. 29. Challenge #1 Why choose Graph database? 29
  30. 30. 30 Why Graph database? (1/2)
  31. 31. 31 Why Graph database? (2/2)
  32. 32. 32 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  33. 33. Challenge #2 Why not propagate the metadata back to source? 33
  34. 34. Why not propagate the metadata back to source 34
  35. 35. 2. Databuilder 35
  36. 36. 36 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  37. 37. Challenge #3 Various forms of metadata 37
  38. 38. 38 Metadata Sources @ Lyft
  39. 39. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 39
  40. 40. Databuilder 40
  41. 41. Databuilder in action 41
  42. 42. How is the databuilder orchestrated? 42 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  43. 43. 3. Search service 43
  44. 44. 44 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  45. 45. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 45
  46. 46. Challenge #4 How to make the search result more relevant? 46
  47. 47. How to make the search result more relevant? 47 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  48. 48. 4. Frontend service 48
  49. 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  50. 50. Search results ranked on relevance and query activity
  51. 51. Amundsen table detail page
  52. 52. What’s next? 52
  53. 53. Amundsen’s impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 53
  54. 54. Amundsen is Open Source! • Many organizations have similar problems ‒ github.com/lyft/amundsenfrontendlibrary • Collaboration with outside companies has started ‒ Used in Production by ING ‒ Working with WeWork, Square, Bang & Olufsen & more 54
  55. 55. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Superset) Privacy Governance
  56. 56. Summary 56
  57. 57. Summary • A metadata-powered discovery tool - like Amundsen - is key to the next wave of big data applications • Blog post with more details: go.lyft.com/datadiscoveryblog • Join the community: github.com/lyft/amundsenfrontendlibrary 57
  58. 58. Phil Mizrahi | @philippemizrahi Jin Hyuk Chang | @jinhyukchang Repo at github.com/lyft/amundsenfrontendlibrary Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 58
  59. 59. Backup 59
  60. 60. 60 Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  61. 61. All about metadata 61
  62. 62. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time Metadata: a set of data that describes and gives information about other data Ground, Joe Hellerstein, Vikram Sreekanti et al. RISE Lab, UC Berkeley
  63. 63. Buy vs. Build vs. Adopt 63
  64. 64. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  65. 65. Back to mocks... 65
  66. 66. Search results ranked on relevance and query activity
  67. 67. Detailed description and metadata about data resources
  68. 68. Data Preview within the tool
  69. 69. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  70. 70. Built-in user feedback
  71. 71. Challenge #2 Pull model vs Push model 71
  72. 72. Pull model and Push model 72 Pull Model Push Model ● Amundsen goes to the source and periodically update the index by pulling from the source (e.g. database). ● Very efficient on popular platform. Hard to be extended. ● The external source (e.g. database) pushes metadata to a message bus which Amundsen subscribes to. ● Flexible and extensible. Lots of work from external source. Crawler Database Data graph Scheduler Database Message queue Data graph
  73. 73. We’re Hiring! Apply at www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  74. 74. Strata SF 2019 Rate this session session page on conference website O’Reilly Events App

    Be the first to comment

    Login to see the comments

Amundsen, the open source data discovery tool. Slides from the 5/15 presentation at Lyft

Views

Total views

668

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

35

Shares

0

Comments

0

Likes

0

×