10 tips for enabling data discovery and governance in your organization

09/27/2023
Sherin Thomas
10 Tips for enabling discovery
and governance

Conﬁdential | 2
Sherin Thomas
Software Engineer at Chime
Previously Netﬂix, Lyft, Twitter and Google
Been doing all kinds of data stuff for the last decade
I’m a “Sunday” artist
@doodlesmt
@sherinthomasm
@thomassherin

hps://maturck.com/a-chart-of-the-big-data-ecosystem/| 3

“When in doubt, you can’t be wrong”
- From “This isn’t what it looks like” by Pseudonymous Bosch
Conﬁdential | 7

Conﬁdential | 8
Conway’s Law In Action
● Product engineering is usually
separate from analytics
● Analysts are consumers, product
engineering is producer of data
● Producers don’t know what
consumers need
● Consumers don’t know what
producers have
● While Data Engineer is stuck in
the middle
● Need to make these groups talk
to each other

I. Bring everybody to the same “water cooler”
Conﬁdential | 9

Confidential | 10
● One data catalog for all types of users - analysts, PMs, engineers
● Choose a data catalog that can support a range of data artifacts
○ Tables
○ Pipelines
○ People
○ Roles
○ Data Products
○ Domains
○ and many more….

Confidential | 11
● Lineage so you get a full picture of how data travels through your organization
● Different ways to groups artifacts - tags, domains, glossary etc
● Pull context from documentation, comments etc
● Search and retrieve by relevance
● Producers can put context here
● Consumers can ﬁnd what they need

Confidential | 12
DataHub Project is an open source metadata platform that enables Data Discovery, Data
Observability, and Federated Governance on top of a high-ﬁdelity Metadata Graph.
Acryl Data is the company advancing the DataHub Project
Metadata Graph
People Data
Code
Governance Discovery Observability

The water cooler
Confidential | 13

The water cooler
Confidential | 14

The water cooler
Confidential | 15

II. Choose a common love data
language(spoiler alert - its schema)
Conﬁdential | 16

II. Choose a common love data language
Confidential | 17

Confidential | 18
● Schematize on write
● If that is not an option - schematize during ingestion

Confidential | 19
Tracking Plan
(schema check in the
server)
Protocol Buffer
Schema

Confidential | 20
● Keep track of “actual” data format with “expected” data format to ﬂag schema drift

III. Use schema for more than schema
Conﬁdential | 21

III. Use schema for more than schema
Confidential | 22
Documentation
Message and
ﬁeld level
annotations

IV. Keep the logical and the physical together
Conﬁdential | 23

Confidential | 24
Spark, Flink, Airﬂow, ETL
ETL workﬂows written by a data engineer
3
Raw Data - events, logs etc
Generated from mobile/frontend, or
services
4
Business Insights
Funnel analysis, business insights based
off of fact tables. Consumed by Marketing
or business leads
1
Fact Tables
Curated fact tables generated by DE
2

Confidential | 25
3
services
4
Business Insights
or business leads
1
Fact Tables
Curated fact tables generated by ETL
2

Confidential | 26
3
services
4
Business Insights
or business leads
1
Fact Tables
Curated fact tables generated by ETL
2

Confidential | 27
● Bring engineers, PMs, analysts and BI to the same tool
● Engineers can add context, assertions, data quality checks
● Technical issues at the physical layer can be easily bubbled up to logical layers
● Visibility increases when everybody is communicating at the same place - one water
cooler

Confidential | 28
Datasets grouped by
domain. These are logical
groupings
On the same page we have
datasets grouped by
platform(Snowﬂake,
Looker etc) - organized at a
physical level.

V. Keep the logical and the physical separate
Conﬁdential | 29

Confidential | 30
● Data governance is a gnarly problem
● The same type of data can exist in many forms - ephemeral data in a stream,
permanent data in a blob store….
● The most challenging aspect of governance is applying rules and policies
consistently across all these forms
● What is also hard is that rules and policies may change

Confidential | 31
● “All problems in CS can be solved with one more level of indirection”
● Every data entity should have a logical data model deﬁned as a platform agnostic
schema(protocol buffer et al)
● Specify governance rules on the schema - leverage schema annotation
● Deﬁne rules and policies separately in a central glossary
● Orchestrate rules at the platform level

Confidential | 32
classification policy Rule for
Snowflake
Rule for
Kinesis
personal_info sensitive “Mask for
roles - A, B
and C”
“Mask on
read for
everything”
SSN confidential “Do not
retain”
“Do not
retain”
…. …. ….

Confidential | 33
Rule Orchestrator
Rule Orchestrator
Glossary
(Policies)
Schema
Repository
(Classiﬁcation)

VI. Crowdsource metadata ingestion
Conﬁdential | 34

Confidential | 35

Confidential | 36
● Metadata ingestion should not be Data Platform’s responsibility alone
● Build a “data stewards” squad and assign representatives from different org.
● Leads provide expertise to deﬁne best practices, build core metadata ingestion
capabilities that work with systems they own.
● Hook up metadata ingestion directly to the sources - think log and metric collection!

Confidential | 37

VII. Everything must have an owner
Conﬁdential | 38

VII. Everything must have an owner
Confidential | 39
● Assign owners to dataset at the outset
● No owner - no accountability
● Consider different types of ownership - technical and business

VIII. Meet data consumers where they are
Conﬁdential | 40

VIII. Meet data consumers where they are
Confidential | 41

IX. Shift Left
Conﬁdential | 42

IX. Shift Left - Declare and collect metadata at the source
Confidential | 43

IX. Shift Left - Declare and collect metadata at the source
Confidential | 44

X. Embrace data contracts
Conﬁdential | 45

X. Embrace data contracts - but what is it?
Confidential | 46

X. Embrace data contracts - but what is it?
Confidential | 47
● Formal agreement between producers and consumers where one didn’t exist before
● Consumers add assertions
● Producers own accountability
● Basically SLAs

X. Embrace data contracts - what makes a contract?
Confidential | 48
Schema Semantics

X. Embrace data contracts - what makes a contract?
Confidential | 49
Use lineage for impact analysis

In Conclusion
Confidential | 50
● Bring everybody to one “water cooler”
● Choose a common, platform agnostic schema language
● Use schema for more than just schema
● Keep the logical and physical together
● But also keep the logical and physical separate

In Conclusion
Confidential | 51
● Crowdsource metadata ingestion
● Everything must have an owner
● Meet data consumers where they are
● Shift Left
● Embrace data contracts

Conﬁdential | 52
Thank you for your time!
@doodlesmt
@sherinthomasm
@thomassherin

10 tips for enabling data discovery and governance in your organization

More Related Content

Similar to 10 tips for enabling data discovery and governance in your organization

More from HostedbyConfluent

Recently uploaded

10 tips for enabling data discovery and governance in your organization