Ryan Collingwood discusses data contracts and how they can be implemented using code. Data contracts define how data is exchanged between parties and ensure there are no uncertainties. They include elements like schema, governance, semantics, and service level objectives. Implementing data contracts in code allows them to be version controlled, tested, and more easily maintained than text. Python is proposed as the language due to type checking and libraries that could be used. Open questions remain around tooling and who will do the work to implement data contracts.
2. Who am I and my current context
• Ryan Collingwood, Head of Data & Analytics at Oroton
• Australia’s oldest luxury fashion company
• Centralised Data Team
• Monoliths (ERP & POS) surrounded by number of SaaS
• Data is mostly moved in batch
3. Why I think you might care about this
Responsibility in the
modern data stack
Andrew Jones -
Driving Data Quality with
Data Contracts (2023)
4. Shout out to Andrew Jones
https://data-contracts.com/
5. Similar, Related, and Complementary Concepts
APIs Data
Dictionaries
Data Mesh Event Storming
I’d be curious to know what else you might add to this list
Data Catalogs
Domain Driven
Design
6. Advice is a form of nostalgia. Dispensing it is a way
of fishing the past from the disposal, wiping it off,
painting over the ugly parts and recycling it for
more than it's worth
Mary Schmich
https://www.chicagotribune.com/columns/chi-schmich-sunscreen-column-column.htm
“If I could offer you only one tip for the future, sunscreen would be it.”
8. ... outlines how data can get exchanged between two parties.
It defines the structure, format, and rules of exchange in a
distributed data architecture. These formal agreements make
sure that there aren’t any uncertainties or undocumented
assumptions about data.
https://atlan.com/data-contracts/
... is an agreed interface between the generators of data and
its consumers. It sets the expectations around that data,
defines how it should be governed, and facilitates the explicit
generation of quality data that meets the business
requirements.
Andrew Jones - Driving Data Quality with Data Contracts (2023)
10. You can be a Data Producer without knowing about it
Non-consensual API
Team C
��
11. Broken pipelines, broken non-promises
Non-consensual API
Non-consensual API
Non-consensual API
🧰
❌
Team A
Team C
��
Team B
12. One of the largest impediments to addressing data quality at any organization is the
lack of collaboration between data producers and data consumers.
...
A common workaround (is the) proliferation of non-consensual APIs.
Can’t get a software engineer to emit the data you need to solve some business
problem?
Connect your ELT tool to a production source and extract a batch dump on a
schedule.
Easy
(Until things start breaking…whoops).
Chad Sanderson - https://dataproducts.substack.com/p/the-production-grade-data-pipeline
13. What makes up a Data Contract
https://github.com/PacktPublishing/Driving-Data-Quality-with-Data-Contracts/blob/main/Chapter03/order_events.yaml
14. However, data contracts are more than just a
schema... we need our data contracts to capture
metadata that describes how the data can be used,
how it is governed, and the controls around the data
Driving Data Quality with Data Contracts - Andrew Jones (2023)
15. What makes up a Data Contract
Schema
Contract
Governance
Semantics
Service Level
Objectives
Dataset
Governance
Mechanisms of
Transmission
People
16. Schema versus Semantics
Schema Semantics
Systems interoperability Human Expectations
Support for Implicit Validation
by Database Technologies
Tends to require Explicit
Validation by complimentary
solutions
Ensuring we capture and
retrieve the data consistently
Ensuring we interpret the data
consistently
Dates / times, monetary values - are a trap if considered only as schema.
What are your “schema” but “secretly semantic” situations?
17. Minimum Viable Data Contract Tooling
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Operate
24. Ok so how are
we going to
make this all
happen?
Awesome humans who
understand models,
abstractions, constraints
You could even do it in
✨code ✨
... and you should definitely
version control it
25. Why Code? Why not Text?
● Entanglement of meaning and representation
● Finding References instead of text matches
● Enforcement of structure
● Refactoring
● Testable constraints
● More options for document generation
○ Including JSON and yaml
Although... I’ve been having a blast using Logseq (a graph like outliner) and
I might be crazy enough to give that a go as an IDE for this
29. Creating a Meta
Model
● Focused around Events
● From UI to DB
● Schema and Semantics
● People
... still figuring it out
Don’t have to do it all at once!
38. Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
39. Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
40. Code Refactoring - Other Libraries
• https://pybowler.io/ - doesn't have variable extraction and not much
development activity in the last while
• https://github.com/hchasestevens/astpath - useful for finding parts of the AST
but then I'm not sure how to proceed with it, seems to be powering a number
of meta-programming libs though
• traad - https://av.tib.eu/en/media/19947
41. Further explorations for wrangling generated code
• Abstract Syntax Tree - Options for querying
• Linting - Define my own rules to as they apply to the meta
schema
• Code duplication detection
• Network (Graph) Analysis
43. My References
• Andrew Jones - Driving Data Quality with Data Contracts (2023) - ISBN 13 978-1837635009
• Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos -
https://atlan.com/data-contracts/
• Chad Sanderson - The Production-Grade Data Pipeline -
https://dataproducts.substack.com/p/the-production-grade-data-pipeline
• Chad Sanderson and Adrian Kreuziger - An Engineers Guide to Data Contracts -
https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
• Green Tree Snakes the missing Python AST docs - https://greentreesnakes.readthedocs.io/en/latest/
• Rope - Refactoring Variable Extraction -
https://rope.readthedocs.io/en/latest/library.html#performing-refactorings