Data Discoverability
Maggie Hays
Senior Product Manager -- Data Services
DataHub Town Hall -- September 25, 2020
2
Agenda
● Overview of Teams & Data Stack
● Current State of Data Discoverability
● Data Catalog Evaluation
● DataHub POC - Hypotheses, Progress, and Next Steps
● Brief Demo!
3
SpotHero’s Data-Focused Teams
Data Engineering
3 Engineers
SpotHero IQ
2 Engineers
1-3 Data Scientists
(We’re hiring!!)
Analytics
5 Business Analysts
4
Looker
Airflow
SpotHero’s Data Stack
SH Application
Data
Workflow Tools
Marketing Tools
Microservices
Clickstream
Analytics
Redshift
S3/Parquet
Fivetran
Segment
Kafka
SQL
Python
Spark
Sources Ingestion Storage ETL
5
1
2
3
Current State of Data Discoverability
Data Lineage is difficult to discover and navigate,
regardless of role or tenure
● Impact analysis is arduous; Engineers avoid breaking changes at all costs
● Prolonged debugging/troubleshooting data issues
Difficult to discover what data exists and/or
what it represents
● Reliance on tribal knowledge
● Large burden on the Analytics team to answer any/all questions
Confidence in Data Accuracy is neutral, but room for
improvement
● Once folks track down the data, they are relatively confident in its
accuracy
May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate
6
Data Catalog Evaluation
DataHub
Amundsen
/ Marquez
Apache
Atlas Alation
Ease of Integration
Lineage Support
Configurable
Metadata
Affordability
7
1
2
3
DataHub POC - Hypotheses
Increase visibility into what data exists, what it
represents, and how it’s used across the
company
Decrease the effort required by SpotHero
teammates to use and interpret data
Increase SpotHero teammates’ confidence in
the accuracy and/or relevance of data
8
Looker
Airflow
DataHub POC - Progress & Next Steps
SH Application
Data
Workflow Tools
Marketing Tools
Microservices
Clickstream
Analytics
Redshift
S3/Parquet
Fivetran
Segment
Kafka
SQL
Python
Spark
Sources Ingestion Storage ETL
Complete
Q4 2020
9
Quick Demo!

Data Discoverability at SpotHero

  • 1.
    Data Discoverability Maggie Hays SeniorProduct Manager -- Data Services DataHub Town Hall -- September 25, 2020
  • 2.
    2 Agenda ● Overview ofTeams & Data Stack ● Current State of Data Discoverability ● Data Catalog Evaluation ● DataHub POC - Hypotheses, Progress, and Next Steps ● Brief Demo!
  • 3.
    3 SpotHero’s Data-Focused Teams DataEngineering 3 Engineers SpotHero IQ 2 Engineers 1-3 Data Scientists (We’re hiring!!) Analytics 5 Business Analysts
  • 4.
    4 Looker Airflow SpotHero’s Data Stack SHApplication Data Workflow Tools Marketing Tools Microservices Clickstream Analytics Redshift S3/Parquet Fivetran Segment Kafka SQL Python Spark Sources Ingestion Storage ETL
  • 5.
    5 1 2 3 Current State ofData Discoverability Data Lineage is difficult to discover and navigate, regardless of role or tenure ● Impact analysis is arduous; Engineers avoid breaking changes at all costs ● Prolonged debugging/troubleshooting data issues Difficult to discover what data exists and/or what it represents ● Reliance on tribal knowledge ● Large burden on the Analytics team to answer any/all questions Confidence in Data Accuracy is neutral, but room for improvement ● Once folks track down the data, they are relatively confident in its accuracy May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate
  • 6.
    6 Data Catalog Evaluation DataHub Amundsen /Marquez Apache Atlas Alation Ease of Integration Lineage Support Configurable Metadata Affordability
  • 7.
    7 1 2 3 DataHub POC -Hypotheses Increase visibility into what data exists, what it represents, and how it’s used across the company Decrease the effort required by SpotHero teammates to use and interpret data Increase SpotHero teammates’ confidence in the accuracy and/or relevance of data
  • 8.
    8 Looker Airflow DataHub POC -Progress & Next Steps SH Application Data Workflow Tools Marketing Tools Microservices Clickstream Analytics Redshift S3/Parquet Fivetran Segment Kafka SQL Python Spark Sources Ingestion Storage ETL Complete Q4 2020
  • 9.