Data Discoverability with DataHub
Maggie Hays
Senior Product Manager -- Data Services
Data Quality Meetup -- November 19, 2020
2
Agenda
● Overview of Teams
● Current State of Data Discoverability
● Data Catalog Evaluation
● DataHub POC - Progress & Level of Effort
● Highlight: DataHub Functionality
3
SpotHero’s Data-Focused Teams
Data Engineering
3 Engineers
SpotHero IQ
2 Engineers
3 Data Scientists
Analytics
3 Business Analysts
(We’re hiring!!)
4
1
2
3
Current State of Data Discoverability
Data Lineage is difficult to discover and navigate,
regardless of role or tenure
● Impact analysis is arduous; Engineers avoid breaking changes at all costs
● Prolonged debugging/troubleshooting data issues
Difficult to discover what data exists and/or
what it represents
● Reliance on tribal knowledge
● Large burden on the Analytics team to answer any/all questions
Confidence in Data Accuracy is neutral, but room for
improvement
● Once folks track down the data, they are relatively confident in its
accuracy
May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate
5
Data Catalog Evaluation
DataHub
Amundsen
/ Marquez
Apache
Atlas Alation
Ease of Integration
Lineage Support
Configurable
Metadata
Affordability
6
Looker
Airflow
SpotHero’s Data Stack & DataHub POC
SH Application
Data
Workflow Tools
Marketing Tools
Microservices
Clickstream
Analytics
Redshift
S3/Parquet
Fivetran
Segment
Kafka
SQL
Python
Spark
Sources Ingestion Storage ETL
Complete
Q4 2020
7
1
2
3
DataHub POC - Level of Effort
Research & Tool Evaluation: 180 hrs
● Creation of Pugh Matrix to force-rank evaluation
● Rapid side-by-side POC of DataHub and Amundsen/Marquez
Initial Rollout of DataHub POC: 300 hrs
● Terraform Elasticsearch, MySQL, Neo4j, Aiven; helm chart for
API/frontend/Kafka components
● Datalake & ETL scrapers, including lineage
● Enrich with ETL ownership, links to GHE
Looker & Kafka Metadata Ingestion & Lineage: Est. 160 hrs
● Building Looker/LookML scraper - planning to contribute back to DH codebase
● Teaming up with DataHub to inform design of Dashboard entities
8
DataHub Functionality: Cross-Platform Search
9
DataHub Functionality: Dataset Metadata
DDL & Ownership External Docs
10
DataHub Functionality:
Lineage
11
Yay Data Discoverability!

Data Discoverability with DataHub

  • 1.
    Data Discoverability withDataHub Maggie Hays Senior Product Manager -- Data Services Data Quality Meetup -- November 19, 2020
  • 2.
    2 Agenda ● Overview ofTeams ● Current State of Data Discoverability ● Data Catalog Evaluation ● DataHub POC - Progress & Level of Effort ● Highlight: DataHub Functionality
  • 3.
    3 SpotHero’s Data-Focused Teams DataEngineering 3 Engineers SpotHero IQ 2 Engineers 3 Data Scientists Analytics 3 Business Analysts (We’re hiring!!)
  • 4.
    4 1 2 3 Current State ofData Discoverability Data Lineage is difficult to discover and navigate, regardless of role or tenure ● Impact analysis is arduous; Engineers avoid breaking changes at all costs ● Prolonged debugging/troubleshooting data issues Difficult to discover what data exists and/or what it represents ● Reliance on tribal knowledge ● Large burden on the Analytics team to answer any/all questions Confidence in Data Accuracy is neutral, but room for improvement ● Once folks track down the data, they are relatively confident in its accuracy May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate
  • 5.
    5 Data Catalog Evaluation DataHub Amundsen /Marquez Apache Atlas Alation Ease of Integration Lineage Support Configurable Metadata Affordability
  • 6.
    6 Looker Airflow SpotHero’s Data Stack& DataHub POC SH Application Data Workflow Tools Marketing Tools Microservices Clickstream Analytics Redshift S3/Parquet Fivetran Segment Kafka SQL Python Spark Sources Ingestion Storage ETL Complete Q4 2020
  • 7.
    7 1 2 3 DataHub POC -Level of Effort Research & Tool Evaluation: 180 hrs ● Creation of Pugh Matrix to force-rank evaluation ● Rapid side-by-side POC of DataHub and Amundsen/Marquez Initial Rollout of DataHub POC: 300 hrs ● Terraform Elasticsearch, MySQL, Neo4j, Aiven; helm chart for API/frontend/Kafka components ● Datalake & ETL scrapers, including lineage ● Enrich with ETL ownership, links to GHE Looker & Kafka Metadata Ingestion & Lineage: Est. 160 hrs ● Building Looker/LookML scraper - planning to contribute back to DH codebase ● Teaming up with DataHub to inform design of Dashboard entities
  • 8.
  • 9.
    9 DataHub Functionality: DatasetMetadata DDL & Ownership External Docs
  • 10.
  • 11.