Data Discoverability with DataHub

Data Discoverability with DataHub
Maggie Hays
Senior Product Manager -- Data Services
Data Quality Meetup -- November 19, 2020

2
Agenda
● Overview of Teams
● Current State of Data Discoverability
● Data Catalog Evaluation
● DataHub POC - Progress & Level of Eﬀort
● Highlight: DataHub Functionality

3
SpotHero’s Data-Focused Teams
Data Engineering
3 Engineers
SpotHero IQ
2 Engineers
3 Data Scientists
Analytics
3 Business Analysts
(We’re hiring!!)

4
1
2
3
Current State of Data Discoverability
Data Lineage is difficult to discover and navigate,
regardless of role or tenure
● Impact analysis is arduous; Engineers avoid breaking changes at all costs
● Prolonged debugging/troubleshooting data issues
Difficult to discover what data exists and/or
what it represents
● Reliance on tribal knowledge
● Large burden on the Analytics team to answer any/all questions
Confidence in Data Accuracy is neutral, but room for
improvement
● Once folks track down the data, they are relatively confident in its
accuracy
May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate

5
Data Catalog Evaluation
DataHub
Amundsen
/ Marquez
Apache
Atlas Alation
Ease of Integration
Lineage Support
Conﬁgurable
Metadata
Aﬀordability

6
Looker
Airﬂow
SpotHero’s Data Stack & DataHub POC
SH Application
Data
Workﬂow Tools
Marketing Tools
Microservices
Clickstream
Analytics
Redshift
S3/Parquet
Fivetran
Segment
Kafka
SQL
Python
Spark
Sources Ingestion Storage ETL
Complete
Q4 2020

7
1
2
3
DataHub POC - Level of Eﬀort
Research & Tool Evaluation: 180 hrs
● Creation of Pugh Matrix to force-rank evaluation
● Rapid side-by-side POC of DataHub and Amundsen/Marquez
Initial Rollout of DataHub POC: 300 hrs
● Terraform Elasticsearch, MySQL, Neo4j, Aiven; helm chart for
API/frontend/Kafka components
● Datalake & ETL scrapers, including lineage
● Enrich with ETL ownership, links to GHE
Looker & Kafka Metadata Ingestion & Lineage: Est. 160 hrs
● Building Looker/LookML scraper - planning to contribute back to DH codebase
● Teaming up with DataHub to inform design of Dashboard entities

8
DataHub Functionality: Cross-Platform Search

9
DataHub Functionality: Dataset Metadata
DDL & Ownership External Docs

10
DataHub Functionality:
Lineage

Data Discoverability with DataHub

More Related Content

What's hot

Similar to Data Discoverability with DataHub

Recently uploaded

Data Discoverability with DataHub