09/27/2023
Sherin Thomas
10 Tips for enabling discovery
and governance
Confidential | 2
Sherin Thomas
Software Engineer at Chime
Previously Netflix, Lyft, Twitter and Google
Been doing all kinds of data stuff for the last decade
I’m a “Sunday” artist
@doodlesmt
@sherinthomasm
@thomassherin
hps://maturck.com/a-chart-of-the-big-data-ecosystem/| 3
Confidential| 4
https://xkcd.com/2582/ | 5
Confidential | 6
“When in doubt, you can’t be wrong”
- From “This isn’t what it looks like” by Pseudonymous Bosch
Confidential | 7
Confidential | 8
Conway’s Law In Action
● Product engineering is usually
separate from analytics
● Analysts are consumers, product
engineering is producer of data
● Producers don’t know what
consumers need
● Consumers don’t know what
producers have
● While Data Engineer is stuck in
the middle
● Need to make these groups talk
to each other
I. Bring everybody to the same “water cooler”
Confidential | 9
I. Bring everybody to the same “water cooler”
Confidential | 10
● One data catalog for all types of users - analysts, PMs, engineers
● Choose a data catalog that can support a range of data artifacts
○ Tables
○ Pipelines
○ People
○ Roles
○ Data Products
○ Domains
○ and many more….
I. Bring everybody to the same “water cooler”
Confidential | 11
● Lineage so you get a full picture of how data travels through your organization
● Different ways to groups artifacts - tags, domains, glossary etc
● Pull context from documentation, comments etc
● Search and retrieve by relevance
● Producers can put context here
● Consumers can find what they need
I. Bring everybody to the same “water cooler”
Confidential | 12
DataHub Project is an open source metadata platform that enables Data Discovery, Data
Observability, and Federated Governance on top of a high-fidelity Metadata Graph.
Acryl Data is the company advancing the DataHub Project
Metadata Graph
People Data
Code
Governance Discovery Observability
The water cooler
Confidential | 13
The water cooler
Confidential | 14
The water cooler
Confidential | 15
II. Choose a common love data
language(spoiler alert - its schema)
Confidential | 16
II. Choose a common love data language
Confidential | 17
II. Choose a common love data language
Confidential | 18
● Schematize on write
● If that is not an option - schematize during ingestion
● Schematize on write
● If that is not an option - schematize during ingestion
II. Choose a common love data language
Confidential | 19
Tracking Plan
(schema check in the
server)
Protocol Buffer
Schema
II. Choose a common love data language
Confidential | 20
● Schematize on write
● If that is not an option - schematize during ingestion
● Keep track of “actual” data format with “expected” data format to flag schema drift
III. Use schema for more than schema
Confidential | 21
III. Use schema for more than schema
Confidential | 22
Documentation
Message and
field level
annotations
IV. Keep the logical and the physical together
Confidential | 23
IV. Keep the logical and the physical together
Confidential | 24
Spark, Flink, Airflow, ETL
ETL workflows written by a data engineer
3
Raw Data - events, logs etc
Generated from mobile/frontend, or
services
4
Business Insights
Funnel analysis, business insights based
off of fact tables. Consumed by Marketing
or business leads
1
Fact Tables
Curated fact tables generated by DE
2
IV. Keep the logical and the physical together
Confidential | 25
Spark, Flink, Airflow, ETL
ETL workflows written by a data engineer
3
Raw Data - events, logs etc
Generated from mobile/frontend, or
services
4
Business Insights
Funnel analysis, business insights based
off of fact tables. Consumed by Marketing
or business leads
1
Fact Tables
Curated fact tables generated by ETL
2
IV. Keep the logical and the physical together
Confidential | 26
Spark, Flink, Airflow, ETL
ETL workflows written by a data engineer
3
Raw Data - events, logs etc
Generated from mobile/frontend, or
services
4
Business Insights
Funnel analysis, business insights based
off of fact tables. Consumed by Marketing
or business leads
1
Fact Tables
Curated fact tables generated by ETL
2
IV. Keep the logical and the physical together
Confidential | 27
● Bring engineers, PMs, analysts and BI to the same tool
● Engineers can add context, assertions, data quality checks
● Technical issues at the physical layer can be easily bubbled up to logical layers
● Visibility increases when everybody is communicating at the same place - one water
cooler
IV. Keep the logical and the physical together
Confidential | 28
Datasets grouped by
domain. These are logical
groupings
On the same page we have
datasets grouped by
platform(Snowflake,
Looker etc) - organized at a
physical level.
V. Keep the logical and the physical separate
Confidential | 29
V. Keep the logical and the physical separate
Confidential | 30
● Data governance is a gnarly problem
● The same type of data can exist in many forms - ephemeral data in a stream,
permanent data in a blob store….
● The most challenging aspect of governance is applying rules and policies
consistently across all these forms
● What is also hard is that rules and policies may change
V. Keep the logical and the physical separate
Confidential | 31
● “All problems in CS can be solved with one more level of indirection”
● Every data entity should have a logical data model defined as a platform agnostic
schema(protocol buffer et al)
● Specify governance rules on the schema - leverage schema annotation
● Define rules and policies separately in a central glossary
● Orchestrate rules at the platform level
V. Keep the logical and the physical separate
Confidential | 32
classification policy Rule for
Snowflake
Rule for
Kinesis
personal_info sensitive “Mask for
roles - A, B
and C”
“Mask on
read for
everything”
SSN confidential “Do not
retain”
“Do not
retain”
…. …. ….
V. Keep the logical and the physical separate
Confidential | 33
Rule Orchestrator
Rule Orchestrator
Glossary
(Policies)
Schema
Repository
(Classification)
VI. Crowdsource metadata ingestion
Confidential | 34
VI. Crowdsource metadata ingestion
Confidential | 35
VI. Crowdsource metadata ingestion
Confidential | 36
● Metadata ingestion should not be Data Platform’s responsibility alone
● Build a “data stewards” squad and assign representatives from different org.
● Leads provide expertise to define best practices, build core metadata ingestion
capabilities that work with systems they own.
● Hook up metadata ingestion directly to the sources - think log and metric collection!
VI. Crowdsource metadata ingestion
Confidential | 37
VII. Everything must have an owner
Confidential | 38
VII. Everything must have an owner
Confidential | 39
● Assign owners to dataset at the outset
● No owner - no accountability
● Consider different types of ownership - technical and business
VIII. Meet data consumers where they are
Confidential | 40
VIII. Meet data consumers where they are
Confidential | 41
IX. Shift Left
Confidential | 42
IX. Shift Left - Declare and collect metadata at the source
Confidential | 43
IX. Shift Left - Declare and collect metadata at the source
Confidential | 44
X. Embrace data contracts
Confidential | 45
X. Embrace data contracts - but what is it?
Confidential | 46
X. Embrace data contracts - but what is it?
Confidential | 47
● Formal agreement between producers and consumers where one didn’t exist before
● Consumers add assertions
● Producers own accountability
● Basically SLAs
X. Embrace data contracts - what makes a contract?
Confidential | 48
Schema Semantics
X. Embrace data contracts - what makes a contract?
Confidential | 49
Use lineage for impact analysis
In Conclusion
Confidential | 50
● Bring everybody to one “water cooler”
● Choose a common, platform agnostic schema language
● Use schema for more than just schema
● Keep the logical and physical together
● But also keep the logical and physical separate
In Conclusion
Confidential | 51
● Crowdsource metadata ingestion
● Everything must have an owner
● Meet data consumers where they are
● Shift Left
● Embrace data contracts
Confidential | 52
Thank you for your time!
@doodlesmt
@sherinthomasm
@thomassherin

10 tips for enabling data discovery and governance in your organization

  • 1.
    09/27/2023 Sherin Thomas 10 Tipsfor enabling discovery and governance
  • 2.
    Confidential | 2 SherinThomas Software Engineer at Chime Previously Netflix, Lyft, Twitter and Google Been doing all kinds of data stuff for the last decade I’m a “Sunday” artist @doodlesmt @sherinthomasm @thomassherin
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    “When in doubt,you can’t be wrong” - From “This isn’t what it looks like” by Pseudonymous Bosch Confidential | 7
  • 8.
    Confidential | 8 Conway’sLaw In Action ● Product engineering is usually separate from analytics ● Analysts are consumers, product engineering is producer of data ● Producers don’t know what consumers need ● Consumers don’t know what producers have ● While Data Engineer is stuck in the middle ● Need to make these groups talk to each other
  • 9.
    I. Bring everybodyto the same “water cooler” Confidential | 9
  • 10.
    I. Bring everybodyto the same “water cooler” Confidential | 10 ● One data catalog for all types of users - analysts, PMs, engineers ● Choose a data catalog that can support a range of data artifacts ○ Tables ○ Pipelines ○ People ○ Roles ○ Data Products ○ Domains ○ and many more….
  • 11.
    I. Bring everybodyto the same “water cooler” Confidential | 11 ● Lineage so you get a full picture of how data travels through your organization ● Different ways to groups artifacts - tags, domains, glossary etc ● Pull context from documentation, comments etc ● Search and retrieve by relevance ● Producers can put context here ● Consumers can find what they need
  • 12.
    I. Bring everybodyto the same “water cooler” Confidential | 12 DataHub Project is an open source metadata platform that enables Data Discovery, Data Observability, and Federated Governance on top of a high-fidelity Metadata Graph. Acryl Data is the company advancing the DataHub Project Metadata Graph People Data Code Governance Discovery Observability
  • 13.
  • 14.
  • 15.
  • 16.
    II. Choose acommon love data language(spoiler alert - its schema) Confidential | 16
  • 17.
    II. Choose acommon love data language Confidential | 17
  • 18.
    II. Choose acommon love data language Confidential | 18 ● Schematize on write ● If that is not an option - schematize during ingestion
  • 19.
    ● Schematize onwrite ● If that is not an option - schematize during ingestion II. Choose a common love data language Confidential | 19 Tracking Plan (schema check in the server) Protocol Buffer Schema
  • 20.
    II. Choose acommon love data language Confidential | 20 ● Schematize on write ● If that is not an option - schematize during ingestion ● Keep track of “actual” data format with “expected” data format to flag schema drift
  • 21.
    III. Use schemafor more than schema Confidential | 21
  • 22.
    III. Use schemafor more than schema Confidential | 22 Documentation Message and field level annotations
  • 23.
    IV. Keep thelogical and the physical together Confidential | 23
  • 24.
    IV. Keep thelogical and the physical together Confidential | 24 Spark, Flink, Airflow, ETL ETL workflows written by a data engineer 3 Raw Data - events, logs etc Generated from mobile/frontend, or services 4 Business Insights Funnel analysis, business insights based off of fact tables. Consumed by Marketing or business leads 1 Fact Tables Curated fact tables generated by DE 2
  • 25.
    IV. Keep thelogical and the physical together Confidential | 25 Spark, Flink, Airflow, ETL ETL workflows written by a data engineer 3 Raw Data - events, logs etc Generated from mobile/frontend, or services 4 Business Insights Funnel analysis, business insights based off of fact tables. Consumed by Marketing or business leads 1 Fact Tables Curated fact tables generated by ETL 2
  • 26.
    IV. Keep thelogical and the physical together Confidential | 26 Spark, Flink, Airflow, ETL ETL workflows written by a data engineer 3 Raw Data - events, logs etc Generated from mobile/frontend, or services 4 Business Insights Funnel analysis, business insights based off of fact tables. Consumed by Marketing or business leads 1 Fact Tables Curated fact tables generated by ETL 2
  • 27.
    IV. Keep thelogical and the physical together Confidential | 27 ● Bring engineers, PMs, analysts and BI to the same tool ● Engineers can add context, assertions, data quality checks ● Technical issues at the physical layer can be easily bubbled up to logical layers ● Visibility increases when everybody is communicating at the same place - one water cooler
  • 28.
    IV. Keep thelogical and the physical together Confidential | 28 Datasets grouped by domain. These are logical groupings On the same page we have datasets grouped by platform(Snowflake, Looker etc) - organized at a physical level.
  • 29.
    V. Keep thelogical and the physical separate Confidential | 29
  • 30.
    V. Keep thelogical and the physical separate Confidential | 30 ● Data governance is a gnarly problem ● The same type of data can exist in many forms - ephemeral data in a stream, permanent data in a blob store…. ● The most challenging aspect of governance is applying rules and policies consistently across all these forms ● What is also hard is that rules and policies may change
  • 31.
    V. Keep thelogical and the physical separate Confidential | 31 ● “All problems in CS can be solved with one more level of indirection” ● Every data entity should have a logical data model defined as a platform agnostic schema(protocol buffer et al) ● Specify governance rules on the schema - leverage schema annotation ● Define rules and policies separately in a central glossary ● Orchestrate rules at the platform level
  • 32.
    V. Keep thelogical and the physical separate Confidential | 32 classification policy Rule for Snowflake Rule for Kinesis personal_info sensitive “Mask for roles - A, B and C” “Mask on read for everything” SSN confidential “Do not retain” “Do not retain” …. …. ….
  • 33.
    V. Keep thelogical and the physical separate Confidential | 33 Rule Orchestrator Rule Orchestrator Glossary (Policies) Schema Repository (Classification)
  • 34.
    VI. Crowdsource metadataingestion Confidential | 34
  • 35.
    VI. Crowdsource metadataingestion Confidential | 35
  • 36.
    VI. Crowdsource metadataingestion Confidential | 36 ● Metadata ingestion should not be Data Platform’s responsibility alone ● Build a “data stewards” squad and assign representatives from different org. ● Leads provide expertise to define best practices, build core metadata ingestion capabilities that work with systems they own. ● Hook up metadata ingestion directly to the sources - think log and metric collection!
  • 37.
    VI. Crowdsource metadataingestion Confidential | 37
  • 38.
    VII. Everything musthave an owner Confidential | 38
  • 39.
    VII. Everything musthave an owner Confidential | 39 ● Assign owners to dataset at the outset ● No owner - no accountability ● Consider different types of ownership - technical and business
  • 40.
    VIII. Meet dataconsumers where they are Confidential | 40
  • 41.
    VIII. Meet dataconsumers where they are Confidential | 41
  • 42.
  • 43.
    IX. Shift Left- Declare and collect metadata at the source Confidential | 43
  • 44.
    IX. Shift Left- Declare and collect metadata at the source Confidential | 44
  • 45.
    X. Embrace datacontracts Confidential | 45
  • 46.
    X. Embrace datacontracts - but what is it? Confidential | 46
  • 47.
    X. Embrace datacontracts - but what is it? Confidential | 47 ● Formal agreement between producers and consumers where one didn’t exist before ● Consumers add assertions ● Producers own accountability ● Basically SLAs
  • 48.
    X. Embrace datacontracts - what makes a contract? Confidential | 48 Schema Semantics
  • 49.
    X. Embrace datacontracts - what makes a contract? Confidential | 49 Use lineage for impact analysis
  • 50.
    In Conclusion Confidential |50 ● Bring everybody to one “water cooler” ● Choose a common, platform agnostic schema language ● Use schema for more than just schema ● Keep the logical and physical together ● But also keep the logical and physical separate
  • 51.
    In Conclusion Confidential |51 ● Crowdsource metadata ingestion ● Everything must have an owner ● Meet data consumers where they are ● Shift Left ● Embrace data contracts
  • 52.
    Confidential | 52 Thankyou for your time! @doodlesmt @sherinthomasm @thomassherin