Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data council sf amundsen presentation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Disrupting Data Discovery
Disrupting Data Discovery
Loading in …3
×

Check these out next

1 of 68 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data council sf amundsen presentation (20)

Advertisement
Advertisement

Data council sf amundsen presentation

  1. 1. April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Amundsen: A Data Discovery Platform from Lyft
  2. 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Demo • Architecture • Summary 2
  3. 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  4. 4. 4 Core Infra high level architecture Custom apps
  5. 5. Data Discovery 5
  6. 6. • My first project is to analyze and predict Data council Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 6
  7. 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 7
  8. 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 8
  9. 9. Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;
  10. 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 10
  11. 11. Data Scientists spend upto 1/3rd time in Data Discovery... 11 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  12. 12. Audience for data discovery 12
  13. 13. Data Discovery - User personas 13 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  14. 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  15. 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  16. 16. Meet Amundsen 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  17. 17. Landing page optimized for search
  18. 18. Search results ranked on relevance and query activity
  19. 19. How does search work? 19
  20. 20. Relevance - search for “apple” on Google 20 Low relevance High relevance
  21. 21. Popularity - search for “apple” on Google 21 Low popularity High popularity
  22. 22. Striking the balance 22 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  23. 23. Back to mocks... 23
  24. 24. Search results ranked on relevance and query activity
  25. 25. Detailed description and metadata about data resources
  26. 26. Data Preview within the tool
  27. 27. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  28. 28. Built-in user feedback
  29. 29. Demo 29
  30. 30. Amundsen’s architecture 30
  31. 31. 31 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  32. 32. 1. Frontend Service 32
  33. 33. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  34. 34. Amundsen table detail page
  35. 35. 2. Metadata Service 35
  36. 36. 36 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  37. 37. 37 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  38. 38. Trade Off #1 Why choose Graph database 38
  39. 39. Why Graph database?
  40. 40. Why Graph database?
  41. 41. Trade Off #2 Why not propagate the metadata back to source 41
  42. 42. Why not propagate the metadata back to source 42
  43. 43. Why not propagate the metadata back to source 43 ? ?
  44. 44. Why not propagate the metadata back to source 44
  45. 45. 3. Search Service 45
  46. 46. 46 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  47. 47. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 47
  48. 48. Challenge #1 How to make the search result more relevant? 48
  49. 49. How to make the search result more relevant? 49 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  50. 50. 4. Data Builder 50
  51. 51. 51 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  52. 52. Challenge #1 Various forms of metadata 52
  53. 53. 53 Metadata Sources @ Lyft
  54. 54. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 54
  55. 55. Challenge #2 Pull model vs Push model 55
  56. 56. Pull model vs. Push model 56 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  57. 57. 4. Databuilder
  58. 58. Databuilder in action
  59. 59. How are we building data? Databuilder
  60. 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  61. 61. What’s next? 63
  62. 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 64
  63. 63. Impact - Amundsen at Lyft 65 Beta release (internal) Generally Available (GA) release Alpha release
  64. 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  65. 65. Summary 67
  66. 66. Summary • Data Discovery adds 30+% more productivity to Data Scientists • Metadata is key to the next wave of big data applications • Amundsen - Lyft’s metadata and data discovery platform • Blog post with more details: go.lyft.com/datadiscoveryblog 68
  67. 67. Jin Hyuk Chang | @jinhyukchang Tao Feng | @feng-tao Slides at go.lyft.com/amundsen_datacouncil_2019 Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 69
  68. 68. Backup 72

Editor's Notes

  • Today’s agenda:
    Why empowering with data is important…
    What are we doing in the data team at Lyft (context)...
    What challenges we are facing and have seen other companies face…
    How are we solving the problem...
    At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • Who is our audience: everyone who works at Lyft…
    Power users: Data Scientists, Research Scientists, Product Managers…
    Next: Engineers, GMs, Ops, etc.
  • What does the architecture for our core infra look like?
    Mobile application primarily…
    Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  • Mark
  • Mark
  • Mark
  • Mark
  • Data Discovery: How much of a challenge is it? Significant challenge…
    Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis…
    We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis…
    But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth…

    We can significantly increase productivity and impact if we can reduce this time...
  • Mark
  • Mark
  • Popularity is not click through rate but through query access patterns.
  • Amundsen architecture at Lyft:
    3 micro-services(FE, metadata, search) and one generic data ingestion framework
    Will discuss each of the component in details

    High level walkthrough….how CCPA compilance works
  • ML features (one sentence on what is feature service)

    Add a logo of neo4j

    ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • Why graph db is the best option? Why not rdbms, nosql etc
    Why choose neo4j vs other graph db? (most popular graph db)

    Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data.

    Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah

    There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  • Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all.

    We model the graph as bi-direction relationships.
  • ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • ? challenge
    How to improve search relevance?
  • Think of search algorithm…
    Event_ride_table -> event ride table
  • Talk about how we measure it: instrument empty results from user, % of click through rate
    What we did? Boost the rank of exact table name match
  • Walk through pros and cons
  • Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin

    Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table.

    Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that.

    Loader writes data into either sink (destination) or into staging area.

    Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  • Today’s agenda:
    Why empowering with data is important…
    What are we doing in the data team at Lyft (context)...
    What challenges we are facing and have seen other companies face…
    How are we solving the problem...
    At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • A slide on Amundsen @ Lyft?

    ? how long it has been in prod
    How many datasets
    Users
    WAU
    Usage
  • Kafka topic
    Schema registry
    ML workflow and Airflow DAGs
  • Mark
  • ***feel free to edit based on season;

    No need to divide by location, can also divide by department

×