Covers what a schema registry is, and the importance of a shared contractual language (schemas) between consuming and producing services. Talk presented at the Big Data Analytics meetup in Sydney.
2. Snowplow Philosophy
- Open source (ALv2) or managed (paid)
- Batch or real time
- Collect everything (web, mobile, IoT, webhooks)
- Ownership of data matters
- Data modelling should be first class and flexible
- BYO toolset (Spark, Drill, Beam etc)
3. ● Imagine all employees are required to speak only in their native language.
● Either everyone has to be multilingual, or expensive translators must be added for every
pair of languages spoken.
○ Even if you have a sophisticated and efficient way of getting messages from place
to place, you’re still stuck with the overhead of constant translation.
Hazards of many languages
4.
5. ● A shared contract between a consumer and a producer
● Prior art
○ Avro, Thrift, Protobuf etc
A schema
6. Key attributes of schema technologies
● Code generation – for bindings to your schemas in a
given programming language
● Data encodings
● Validation rules - for calibration and sanity
● Types – a description of the type of data
● Schema evolution
11. Schema storage
● Option 1: Send the entire definition with the record
Record Record Record Record
Schema Schema Schema Schema
12. Schema storage
● Option 2: Send a pointer to the definition
*Schema *Schema *Schema *Schema
Record Record Record Record
Schema storage
13. ● A canonical, shared source of truth
● Within and between organisations
Schema registry
14. ● Data governance
○ Safe schema evolution
○ Policy enforcement
● Data pipeline resilience
● Data discovery
● Efficiency
○ Cost
○ Storage
○ Computation
● Shares principles with software engineering CI/CD
Why?
15. Key takeaways
Schemas are critical and a shared repository of all schemas used by the
organisation is important to make siloed knowledge shared and explicit.
By using schemas, the data definition for a particular kind of data exists in a single
place.
Schemas serve as self-contained and automatically enforceable contracts
between producers and consumers of data.