Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry

Evolve your schemas in a better way!
A deep dive into Avro schema compatibility
and Schema Registry
Tim van Baarsen & Kosta Chuturkov

About the Speakers
The Netherlands
-
Amsterdam
Team Dora Romania
-
Bucharest

ING
www.ing.jobs
• 60,000+ employees
• Serve 37+ million customers
• Corporate clients and financial
institutions in over 40
countries

Kafka @ ING
Frontrunners in Kafka since 2014
Running in production:
• 9 years
• 7000+ topics
• Serving 1000+ Development teams
• Self service topic management

Kafka @ ING
Traffic is growing with +10% monthly
0
200.000
400.000
600.000
800.000
1.000.000
1.200.000
1.400.000
1.600.000
1.800.000
2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Messages produced per second (average)
Messages produced per second (average)

What are we going to cover today ?
• Why Schemas?
• What compatibility level to pick?
• What changes can I make when evolving my schemas?
• What options do I have when I need to introduce a breaking change?
• Should we automatically register schemas from our applications?
• How do you generate Java classes from your Avro schemas and you build an
automated test suite (unit tests)

Why schemas?
The only constant in life is change!
-Heraclitus (Greek philosopher)

Why schemas?
The only constant in life is change!
The same applies to your Kafka
events flowing through your
streaming applications.

Why schemas?
Producer Application
Kafka cluster
Consumer Application
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Responsibilities:
- subscribe
- deserialization
• key
• value
- heartbeat
Responsibilities:
- send
- serialization
• key
• value
Not responsible for:
• Type checking
• Schema validation *
• Other constraints
Data in a Kafka topic are just stored as bytes!
Deserializer
Serializer
= flow of data
0100101001101
your-topic Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
Serializer your-topic
Consumers and producers are decoupled
at runtime
Kafka
client
Kafka
client

Why schemas?
Consumers and producers are decoupled
at runtime
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Indirectly coupled on the data format
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
What fields and
types of data can I
expect?
Documentation of
the fields?
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Some requirements
changed. We need to
introduce a new field
Don’t cause
inconsistency
Keep It compatible!
No disruption my
service
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Schema
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Schema
We need the schema
the data was written
with to be able to read it
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Schema
Don’t send the schema
each time we send data
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
Deserializer
0
1
0
0
1
0
1
0
0
1
1
0
1
Deserializer
Confluent Schema
Registry
Kafka
client
Kafka
client
Kafka
client
Don’t send the schema
each time we send data

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
your-topic
0
1
0
0
1
0
1
0
0
1
1
0
1
Confluent Schema
Registry
Kafka
client
Kafka
client
Kafka
client
Serializer Deserializer
Deserializer

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
KafkaAvro
Serializer
your-topic
0
1
0
0
1
0
1
0
0
1
1
0
1
Confluent Schema
Registry
Register Schema
Kafka
client
Kafka
client
Kafka
client
Deserializer
Deserializer

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
KafkaAvro
Serializer
your-topic
0
1
0
0
1
0
1
0
0
1
1
0
1
Confluent Schema
Registry
id: 1 id: 1 id: 1
Register Schema
id: 1
Schema id
Kafka
client
Kafka
client
Kafka
client
Deserializer
Deserializer

Why schemas?
Kafka cluster
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
KafkaAvro
Deserializer
KafkaAvro
Serializer
your-topic
0
1
0
0
1
0
1
0
0
1
1
0
1
Confluent Schema
Registry
KafkaAvro
Deserializer
id: 1 id: 1 id: 1
Register Schema
id: 1
id: 1 id: 1 id: 1
Schema id
Schema id
Kafka
client
Kafka
client
Kafka
client

Why schemas?
Kafka cluster
Kafka
client
0100101001101
0100101001101
poll
send
old
0 1 2 3 4 5 6
new
KafkaAvro
Deserializer
KafkaAvro
Serializer
your-topic
0
1
0
0
1
0
1
0
0
1
1
0
1
Kafka
client
Confluent Schema
Registry
KafkaAvro
Deserializer
id: 1 id: 1 id: 1
Register Schema
id: 1
id: 1 id: 1 id: 1
Schema id
Schema id
Load Schema
Kafka
client
Confluent Schema Registry = runtime dependency
Need high availability

Avro
§ At ING we prefer Avro
§ Apache Avro™ is a data serialization system
offering rich data structures and uses a compact
and binary format.
{
"type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [
{ "name": ”name", "type": "string” },
{ "name": ”isJointAccountHolder", "type": ”boolean "}
]
}
{
"name": ”Jack",
"isJointAccountHolder": true
}

Avro field types
q primitive types (null, boolean, int, long, float, double, bytes, and string)
q complex types (record, enum, array, map, union, and fixed).
q Logical types(decimal, uuid, date…)
{
"type": "record",
"name": "Customer",
"fields": [
{ "name": ”isJointAccountHolder", "type": ”boolean "},
{ "name": "country", "type": { "name": "Country", "type": "enum", "symbols" : ["US", "UK", "NL"] } },
{ "name": ”dateJoined", "type": ”long”, ”logicalType": ” timestamp-millis” }
]
}
{
"name": ”Jack",
"isJointAccountHolder": true,
”country": ”UK”,
”dateJoined": 1708944593285
}

Maps
Note: the values type applies for the values in the map. The keys are strings.
Example java Map representation:
Map<String, Long> customerPropertiesMap = new HashMap<>();
{
"type": "record",
"name": "Customer",
"fields": [
{ "name": "name", "type": "string" },
{ "name": "isJointAccountHolder", "type": "boolean "},
{ "name": "country", "type": { "name": "Country", "type": "enum", "symbols": ["US", "UK", "NL"]}},
{ "name": "dateJoined", "type": "long", "logicalType": " timestamp-millis"},
{ "name": "customerPropertiesMap", "type": { "type": "map", "values": "long" }}
]
}
{
"name": ”Jack",
”dateJoined": 1708944593285,
“customerPropertiesMap”:
{”key1": 1708, ”key2": 1709}
}

Fixed
{
"type": "record",
"name": "Customer",
"fields": [
{ "name": "name", "type": "string" },
{ "name": "isJointAccountHolder", "type": "boolean "},
{ "name": "country", "type": { "name": "Country", "type": "enum", "symbols": ["US", "UK", "NL"]}},
{ "name": "dateJoined", "type": "long", "logicalType": " timestamp-millis"},
{ "name": "customerPropertiesMap", "type": { "type": "map", "values": "long" }, "doc": "Customer
properties"},
{ "name": "annualIncome", "type": ["null", {"name": "AnnualIncome", "type": "fixed", "size":
32}],"doc": "Annual income of the Customer.", "default": null}
]
}
{
"name": ”Jack",
”dateJoined": 1708944593285,
“customerPropertiesMap”:
{”key1": 1708, ”key2": 1709},
”annualIncome": [64, -9, 92, …]
}

Unions
• Unions are represented using JSON arrays
• For example, ["null", "string"] declares a schema which may be either a
null or string.
• Question: Who thinks this a valid definition?
{
….
"fields": [
{
"name": "firstName",
"type": ["null", "string"],
"doc": "The first name of the Customer."
}
,…]
{
….
"fields": [
{
"name": "firstName",
"type": ["null", "string”, “int”],
"doc": "The first name of the Customer."
}
,…]
org.apache.kafka.common.errors.SerializationException: Error
serializing Avro message…
…
Caused by: org.apache.avro.UnresolvedUnionException: Not in union
["null","string","int"]: true

Aliases
{
….
"fields": [
{
"name": ”customerName",
"aliases": [
”name"
],
"type": "string”,
"doc": "The name of the Customer.",
"default": null
}
,…]
{
….
"fields": [
{
"name": ”name",
"type": "string”,
"doc": "The name of the Customer.",
"default": null
}
,…]
• Named types and fields may have aliases
• Aliases function by re-writing the writer's schema using aliases from the reader's schema.
Consumer
Producer

BACKWARD
{
"type": "record",
"name": "Customer",
"version": "1",
"fields": [
{ "name": "occupation", "type": "string "}
]
}
Producer 1: V1
{
"type": "record",
"name": "Customer",
"version": "1",
"fields": [
]
}
Consumer 1 read: V1
Consumer
Producer
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
{ "name": " name ", "type": "string "},
}
Consumer 2 read: V2 (Delete field)

BACKWARD
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
}
Producer 1: V2
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
}
Consumer 1 read: V2
Consumer
Producer
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
}
Consumer 2 read: V2 (Delete field)
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
{ "name": "name ", "type": "string ”},
{ "name": "occupation", "type": "string"},
{ "name": " annualIncome", "type": ["null","int"],"default": null}
Consumer 3 read: V3 (Add optional field)

BACKWARD TRANSITIVE
{
"type": "record",
"name": "Customer",
"version": "1",
"fields": [
]
}
Producer: V1
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
}
Consumer read: V2 (Delete field)
Consumer
Producer
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
{ "name": " annualIncome", "type": ["null","int"],"default": “null”}
]
}
Consumer read: V3 (Add optional field)
{
"type": "record",
"name": "Customer",
"version": ”…n",
"fields": [
]
}
Consumer read: V…n (Delete field)
Compatible

FORWARD
Consumer
Producer
{
"type": "record",
"name": "Customer",
"version": ”1",
"fields": [
{ "name": " annualIncome", "type": ["null"," double"],"default": null}
]
}
{
"type": "record",
"name": "Customer",
"version": ”1",
"fields": [
{ "name": " annualIncome", "type": ["null",”double"],"default": null}
]
}
{
"type": "record",
"name": "Customer",
"version": ”2",
"fields": [
{ "name": " annualIncome", "type": ["null",”double"],"default": null}
]
}
Producer write: V2 (Delete Optional Field)
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
{ "name": " annualIncome", "type": ["null",”double"],"default": null},
{"name": "dateOfBirth", "type": "string", "doc": "The date of birth for the Customer."}
]
}
Producer write: V3 (Add Required Field)
Consumer read: V1
Producer write: V1

FORWARD TRANSITIVE
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
]
}
Producer: V2 (Delete Optional Field)
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
{ "name": " annualIncome", "type": ["null","int"],"default": null},
{"name": "dateOfBirth", "type": "string”}
]
}
Producer: V3 (Add Field)
Consumer
Producer
{
"type": "record",
"name": "Customer",
"version": ”3",
"fields": [
]
}
Consumer read: V1
{
"type": "record",
"name": "Customer",
"version": ”…n",
"fields": [
{"name": "dateOfBirth", "type": "string"},
{"name": "phoneNumber", "type": "string”}
]
}
Producer: V..n (Add Field)
Compatible

FULL
{
"type": "record",
"name": "Customer",
"version": ”1",
"fields": [
{ "name": "name , "type": [”null",”string"],"default": null}
{ "name": "occupation , "type": [”null",”string"],"default": null}
{ "name": "annualIncome", "type": ["null","int"],"default": null},
{ "name": "dateOfBirth", "type": [”null",”string"],"default": null}
}
Producer: V1
{
"type": "record",
"name": "Customer",
"version": "2",
"fields": [
{ "name": "name , "type": [”null",”string"],"default": null}
{ "name": "occupation , "type": [”null",”string"],"default": null}
{ "name": "dateOfBirth", "type": [”null",”string"],"default": null},
{ "name": "phoneNumber" , "type": [”null",”string"],"default": null}
}
Consumer read: V2
Consumer
Producer
NOTE:
• The default values apply only on the consumer side.
• On the producer side you need to set a value for the field

Available compatibility types
From: Confluent Schema Registry documentation
New schema can
be used to read
old data
Old schema can
be used to read
new data
Both backward
and forward
No compatibility
enforced

What compatibility to use ?
• If you are the topic owner and the producer and in control of evolving the
schema and you don’t want you break existing consumers, use
FORWARD
• If you are the topic owner and a consumer, use BACKWARD, so you can
upgrade first and then ask the producer to evolve its schema with the
fields you need

Backward Compatibility Demo: Components
kafka-producer-one
(schema v1 )
kafka-consumer-third
(schema v4)
poll
send
kafka-consumer-
first (schema v2)
poll
kafka-consumer-
second (schema v3)
occupation (required)
annualIncome (optional)
age (required)
Kafka cluster
old
0 1 2 3 4 5 6
new
customers-topic-
backward
poll
Adding a required field is not a
backward compatible change!

Forward Compatibility Demo: Components
kafka-producer-one
(schema v2) send
kafka-consumer-
first (schema v1)
poll
kafka-producer-two
(schema v3) send
kafka-producer-
three (schema v4) send
annualIncome (optional)
dateOfBirth (required)
phoneNumber (required)
Kafka cluster
old
0 1 2 3 4 5 6
new
customers-topic-
forward
Removing a required field is not a
forward compatible change!

Plugins: Avro Schema to Java Class
Avro
Schema
.avsc
avro-maven-plugin
.class .jar
maven-compiler-plugin
.java
maven-jar-plugin

Avro
Schema
.avsc
avro-maven-plugin
.class .jar
maven-compiler-plugin
.java
maven-jar-plugin
• Validation of Avro Syntax
• No validation on compatibility!

package com.example.avro.customer;
/** Avro schema for our customer. */
@org.apache.avro.specific.AvroGenerated
public class Customer extends
org.apache.avro.specific.SpecificRecordBase implements
org.apache.avro.specific.SpecificRecord {
private static final long serialVersionUID = 1600536469030327220L;
public static final org.apache.avro.Schema SCHEMA$ = new
org.apache.avro.Schema.Parser().parse("{"type":"record","name":"
CustomerBackwardDemo","namespace":"com.example.avro.custom
er","doc":"Avro schema for our
customer.","fields":[{"name":"name","type":{"type":"string","
avro.java.string":"String"},"doc":"The name of the
Customer."},{"name":"occupation","type":{"type":"string","avro
.java.string":"String"},"doc":"The occupation of the
Customer."}],"version":1}");
…
}
{
"namespace": "com.example.avro.customer",
"type": "record",
"name": "Customer",
"version": 1,
"doc": "Avro schema for our customer.",
"fields": [
{
"name": "name",
"type": "string",
"doc": "The name of the Customer."
},
{
"name": "occupation",
"type": "string",
"doc": "The occupation of the Customer."
}
]
}
Customer.avsc Customer.java

.jar
Producer
Kafka
client Consumer
Kafka
client
Kafka Streams
App
Specific record:
Customer.class

Test Avro compatibility
Integration test style
Confluent Schema
Registry
Subject: customer-value
Compatibility: BACKWARD
REST API
V1 V2
V1 V2 V3
Validate compatibility
• Curl
• Confluent CLI
• Confluent Maven Plugin
Your
Java
project
Registered in
the Schema
Registry

Confluent Schema
Registry
REST API
V1 V2
V1 V2 V3
• Curl
• Confluent CLI
• Confluent Schema
Registry Maven Plugin
Your
Java
project
Registered in
the Schema
Registry
Automate in
Your Maven
build

Confluent Schema
Registry
REST API
V1 V2
V1 V2 V3
• curl
• Confluent CLI
Your
Java
project
Registered in
the Schema
Registry
Automate in
Your Maven
build

Test Avro compatibility: Unit tests
Unit test style
V1 V2 V3
• curl
• Confluent CLI
• Unit tests
Your
Java
project
Automate in
Your Maven
build

Should we auto register schemas ?
• By default, client applications automatically register new schemas
• Auto registration is performed by the producers only
• For development environments you can use auto register schema
• For Prod environments the best practice is
• to register schemas outside the client application
• to control when schemas are registered with Schema Registry and how they evolve
• You can disable auto schema registration on the producer auto.register.schemas: false
• Schema Registry: Schema registry security plugin

package com.example.avro.customer;
/** Avro schema for our customer. */
@org.apache.avro.specific.AvroGenerated
public class Customer extends org.apache.avro.specific.SpecificRecordBase implements
org.apache.avro.specific.SpecificRecord {
private static final long serialVersionUID = 1600536469030327220L;
public static final org.apache.avro.Schema SCHEMA$ = new
org.apache.avro.Schema.Parser().parse("{"type":"record","name":"CustomerBackwardDemo
","namespace":"com.example.avro.customer","doc":"Avro schema for our
customer.","fields":[{"name":"name","type":{"type":"string","avro.java.string":"String"
},"doc":"The name of the
Customer."},{"name":"occupation","type":{"type":"string","avro.java.string":"String"},
"doc":"The occupation of the Customer."}],"version":1}");
…
}
Auto register schema lessons learned
• Maven Avro plugin: additional information appended to the schema in Java code
• Producer (KafkaAvroSerializer): auto.register.schemas: false
• When serializing the Avro Schema is derived from Customer Java object
{
"type": "record",
"name": "Customer",
"fields": [
{
"name": ”name",
"type": "string”,
”doc": "The name of the Customer.”
}
]
}
Mismatch in schema comparison
Avro Schema (avsc) registered
in Schema Registry
Avro Schema (Java) in producer

Auto register schema lessons learned
• If you are using the its recommented to set this property (KafkaAvroSerializer)
avro.remove.java.properties: true
Note:
There is an open issue for the Avro Maven Plugin for this AVRO-2838
{
"type": "record",
"name": "Customer",
"fields": [
{
"name": ”name",
"type": "string”,
}
]
}
{
"type": "record",
"name": "Customer",
"fields": [
{
"name": ”name",
"type": "string”,
}
]
}
No mismatch in schema comparison
Avro Schema (avsc) Avro Schema (as Java String)

Schema Evolution Guidelines
Rules of the Road for Modifying Schemas
If you want to make your schema evolvable, then follow these guidelines.
§ Provide a default value for fields in your schema, as this allows you to delete the field later.
§ Don’t change a field’s data type.
§ Don’t rename an existing field (use aliases instead).

Breaking changes. How to move forward?
What can you do?
• “Force push“ schema
• BACKWARD -> NONE -> BACKWARD
• Allow for downtime?
• Both producers and consumer under your control?
• Last resort
• “Produce to multiple topics”
• V1 topic
• V2 topic
• Migrate consumers
• Transaction atomic operation
• Data Contracts for Schema Registry
• Field level transformations

Wrap up
Communication
§ Important to communicate changes between producing and consuming
teams
Gain more confidence
§ Add unit/integration tests to make sure your changes are compatible

Wrap up
Schema registration
§ Don’t allow applications register schemas automatically
§ Don’t assume application will set auto.register.schemas=false
§ Make sure to have security measurements in place
Be aware of pitfalls
§ Avro Maven plugin adds: "avro.java.string"
§ Deserialization exceptions on the consumer side

Questions?
🤔
❔
Demo codebase:
https://github.com/j-tim/kafka-summit-london-2024

Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry

Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry

Recommended

Recommended

More Related Content

Similar to Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry

Similar to Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibility and Schema Registry