Learn while schemas are critical for async services and how the Kafka Schema Registry helps maintain schema compatibility throughout the development lifecycle.
2. 2
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent
Control Center
Date: Thursday, March 16, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Date: Thursday, March 23, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Date: Thursday, March 9, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Clarke Patterson, Senior Director, Product
Marketing
3. 3
What is Data Governance?
Data Access
Master Data
Management
Data Agility
Data Reliability
Data Quality
4. 4
To Grow a Successful Data Platform, You need Some Guarantees
6. 7
Schemas Are a Key Part of This
1. Schemas: Enforced Contracts Between your Components
2. Schemas need to be agile but compatible
3. You need tools & process that will let you:
• Move fast and not break things
9. 11
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers
10. 12
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers
BeachPromo If (property.location == beach &&
customer.level == platinum)
sendMessage({topic: RoomGifts, customer_id: customer.id,
room: property.roomnum, gift: seaShell})
11. 13
There is special value in standardizing parts of schema
{ "type": "record",
"name": "Event",
"fields": [{ "name": "headers", "type": {
“name” : “headers”,
“type”: ”record”
“fields”: [
{“name”: “source_app”, “type”: “string”},
{“name”: “host_app”, “type”: “string”},
{“name”: “destination_app”, “type”: [“null,“string”]},
{“name”: “timestamp”, “type”: “long”}
]},
{ "name": "body", "type": "bytes" }] }
Standard headers allowed
analyzing patterns of communication
S3
12. 15
In Request Response World – APIs between services are key
In Async Stream Processing World – Message Schemas ARE the
API
…except they stick around for a lot longer
13. 16
Lets assume you see why you need
schema compatibility…
How do we make it work for us?
15. 18
Key Points:
• Schema Registry is efficient – due to caching
• Schema Registry is transparent
• Schema Registry is Highly Available
• Schema Registry prevents bad data at the source
16. 19
Big Benefits
• Schema Management is part of your entire app lifecycle – from dev
to prod
• Single source of truth for all apps and data stores
• No more “Bad Data” randomly breaking things
• Increased agility! add, remove and modify fields with confidence
• Teams can share and discover data
• Smaller messages on the wire and in storage
17. 20
But the Best Part is…
RDBM
Hadoop /
Hive
Kafka w/
Schema
Registry
JDBC
Source
HDFS
Sink
Schema
Schema
Schema
Schema
Schema
19. 22
Compatibility tips!
• Do you want to upgrade producer without touching consumers?
• You want *forward* compatibility.
• You can add fields.
• You can delete fields with defaults
• Do you want to update schema in a DWH but still read old messages?
• You want *backward* compatibility
• You can delete fields
• You can add fields with defaults
• Both?
• You want *full* compatibility.
• Default all things!
• Except the primary key which you never touch.
22. 25
Summary!
1. Data Quality is critical
2. Schema are critical
3. Schema Registry is used to maintain quality
from Dev to Prod
4. Schema Registry is in Confluent Open Source
23. 26
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent
Control Center
Date: Thursday, March 16, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Date: Thursday, March 23, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Date: Thursday, March 9, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Clarke Patterson, Senior Director, Product
Marketing
24. 27
Get Started with Apache Kafka Today!
https://www.confluent.io/downloads/
THE place to start with Apache Kafka!
Thoroughly tested and
quality assured
More extensible developer
experience
Easy upgrade path to
Confluent Enterprise
This diagram shows what schema registry is for. From left to right:
Schema registry is specific to a Kafka cluster, so you can have one schema registry per cluster.
Applications can be third party apps like Twitter or SFDC or a custom application. Schema registry is relevant to every producer who can feed messages to your cluster.
The schema registry exists outside your applications as a separate process.
Within an application, there is a function called serializer that serializes messages for delivery to the kafka data pipeline. Confluent’s schema registry and integrated into the serializer. Some other apex developer companies have created their own schema registries outside the serializer; argument is that’s more complexity for the same functionality.
The process goes like this:
The serializer places a call to the schema registry, to see if it has a format for the data the application wants to publish. If it does, then schema registry passes that format to the application’s serializer, which uses it to filter out incorrectly formatted messages. This keeps your data pipeline clean.
The format confluent uses for its schema registry is called Avro. Avro is the only known generally published schema format. i.e. everything else is custom.
When you use the confluent schema registry, it’s automatically integrated into the serializer and there’s no effort you need to put into it. Your data pipeline will just be clean. You simply need to have all applications call the schema registry when publishing.
The ultimate benefit of schema registry is that the consumer applications of your kafka topic can successfully read the messages they receive.
Schema Registry is an industry best practice – Uber, Yelp, AirBnB, Ancestry, Monsanto