Successfully reported this slideshow.
Your SlideShare is downloading. ×

Strata sf - Amundsen presentation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 69 Ad

Strata sf - Amundsen presentation

Download to read offline

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Advertisement
Advertisement

More Related Content

Similar to Strata sf - Amundsen presentation (20)

Advertisement
Advertisement

Strata sf - Amundsen presentation

  1. 1. March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  2. 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  3. 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  4. 4. 5 Core Infra high level architecture Custom apps
  5. 5. Data Discovery 6
  6. 6. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  7. 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  8. 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  9. 9. Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;
  10. 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  11. 11. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  12. 12. Audience for data discovery 13
  13. 13. Data Discovery - User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  14. 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  15. 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  16. 16. Buy vs. Build vs. Adopt 17
  17. 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  18. 18. Meet Amundsen 19 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  19. 19. Landing page optimized for search
  20. 20. Search results ranked on relevance and query activity
  21. 21. How does search work? 22
  22. 22. Relevance - search for “apple” on Google 23 Low relevance High relevance
  23. 23. Popularity - search for “apple” on Google 24 Low popularity High popularity
  24. 24. Striking the balance 25 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  25. 25. Back to mocks... 26
  26. 26. Search results ranked on relevance and query activity
  27. 27. Detailed description and metadata about data resources
  28. 28. Data Preview within the tool
  29. 29. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  30. 30. Built-in user feedback
  31. 31. Amundsen’s architecture 32
  32. 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  33. 33. 1. Frontend Service 34
  34. 34. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  35. 35. Detailed description and metadata about data resources
  36. 36. 2. Metadata Service 37
  37. 37. 38 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  38. 38. 39 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  39. 39. Trade Off #1 Why choose Graph database 40
  40. 40. Why Graph database?
  41. 41. Why Graph database?
  42. 42. Trade Off #2 Why not propagate the metadata back to source 43
  43. 43. Why not propagate the metadata back to source 44
  44. 44. Why not propagate the metadata back to source 45 ? ?
  45. 45. 3. Search Service 46
  46. 46. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  47. 47. 3. Search Service • Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  48. 48. Challenge #1 How to make the search result more relevant? 49
  49. 49. How to make the search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  50. 50. 4. Data Builder 51
  51. 51. 52 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  52. 52. Challenge #1 Various forms of metadata 53
  53. 53. 54 Metadata Sources @ Lyft
  54. 54. Metadata - Challenges • Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  55. 55. Challenge #2 Pull model vs Push model 56
  56. 56. Pull model vs. Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  57. 57. 4. Databuilder
  58. 58. Databuilder in action
  59. 59. How are we building data? Databuilder
  60. 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  61. 61. What’s next? 64
  62. 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  63. 63. Impact - Amundsen at Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  64. 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  65. 65. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  66. 66. Summary 69
  67. 67. Summary • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  68. 68. Mark Grover | @mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71
  69. 69. Backup 72

Editor's Notes

  • Today’s agenda:
    Why empowering with data is important…
    What are we doing in the data team at Lyft (context)...
    What challenges we are facing and have seen other companies face…
    How are we solving the problem...
    At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • Who is our audience: everyone who works at Lyft…
    Power users: Data Scientists, Research Scientists, Product Managers…
    Next: Engineers, GMs, Ops, etc.
  • What do we do: Wide scope of work… de-mystify and democratize data at Lyft…
    Not mentioned here: Operational / transactional databases...
  • What does the architecture for our core infra look like?
    Mobile application primarily…
    Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  • Mark
  • Mark
  • Mark
  • Mark
  • Data Discovery: How much of a challenge is it? Significant challenge…
    Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis…
    We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis…
    But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth…

    We can significantly increase productivity and impact if we can reduce this time...
  • Mark
  • Mark
  • Popularity is not click through rate but through query access patterns.
  • Amundsen architecture at Lyft:
    3 micro-services(FE, metadata, search) and one generic data ingestion framework
    Will discuss each of the component in details

    High level walkthrough….how CCPA compilance works
  • What options we have (graph, rdbms,
  • ML features (one sentence on what is feature service)

    Add a logo of neo4j

    ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • Why graph db is the best option? Why not rdbms, nosql etc
    Why choose neo4j vs other graph db? (most popular graph db)

    Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data.

    Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah

    There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  • Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all.

    We model the graph as bi-direction relationships.
  • ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • ? challenge
    How to improve search relevance?
  • Think of search algorithm…
    Event_ride_table -> event ride table
  • Talk about how we measure it: instrument empty results from user, % of click through rate
    What we did? Boost the rank of exact table name match
  • Walk through pros and cons
  • Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin

    Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table.

    Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that.

    Loader writes data into either sink (destination) or into staging area.

    Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  • We currently index 2 a day, dependency

    Why we index twice a day?
  • Today’s agenda:
    Why empowering with data is important…
    What are we doing in the data team at Lyft (context)...
    What challenges we are facing and have seen other companies face…
    How are we solving the problem...
    At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • A slide on Amundsen @ Lyft?

    ? how long it has been in prod
    How many datasets
    Users
    WAU
    Usage
  • Kafka topic
    Schema registry
    ML workflow and Airflow DAGs
  • Three long term type of metadata(A.B.C)
    We want to use the index A,B,C for the datasources we mentioned in last slide
  • Mark

×