Data Governance and Schema Registry for Software Developers

1
Simplify Governance of
Streaming Data
Gwen Shapira
Product Manager, Apache Kafka™ PMC

2
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent
Control Center
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Speaker: Clarke Patterson, Senior Director, Product
Marketing

3
What is Data Governance?
Data Access
Master Data
Management
Data Agility
Data Reliability
Data Quality

4
To Grow a Successful Data Platform, You need Some Guarantees

5
If you can’t trust your data,
what can you trust?

7
Schemas Are a Key Part of This
1. Schemas: Enforced Contracts Between your Components
2. Schemas need to be agile but compatible
3. You need tools & process that will let you:
• Move fast and not break things

9
Cautionary tale: Uber
Producers
Service 1 Service 2
RedshiftS3
EMR
{created_at: “Dec 1,
2015”}
CREATE (
created_at: STRING )
parseDate(created_at, “%M %d,
%yyyy”)

10
Producers
Service 1 Service 2
RedshiftS3
EMR
{created_at:
1491368429}
CREATE (
created_at: STRING )
parseDate(created_at, “%M %d,
%yyyy”)
Cautionary tale: Uber

11
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers

12
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers
BeachPromo If (property.location == beach &&
customer.level == platinum)
sendMessage({topic: RoomGifts, customer_id: customer.id,
room: property.roomnum, gift: seaShell})

13
There is special value in standardizing parts of schema
{ "type": "record",
"name": "Event",
"fields": [{ "name": "headers", "type": {
“name” : “headers”,
“type”: ”record”
“fields”: [
{“name”: “source_app”, “type”: “string”},
{“name”: “host_app”, “type”: “string”},
{“name”: “destination_app”, “type”: [“null,“string”]},
{“name”: “timestamp”, “type”: “long”}
]},
{ "name": "body", "type": "bytes" }] }
Standard headers allowed
analyzing patterns of communication
S3

15
In Request Response World – APIs between services are key
In Async Stream Processing World – Message Schemas ARE the
API
…except they stick around for a lot longer

16
Lets assume you see why you need
schema compatibility…
How do we make it work for us?

17
Schema Registry
Elastic
Cassandra
Streams
App
Example
Consumers
Serializer
App 1
Serializer
App 2
!
Kafka Topic!
Schema
Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)
Prevent incompatible changes
Supports multi-datacenter environments

18
Key Points:
• Schema Registry is efficient – due to caching
• Schema Registry is transparent
• Schema Registry is Highly Available
• Schema Registry prevents bad data at the source

19
Big Benefits
• Schema Management is part of your entire app lifecycle – from dev
to prod
• Single source of truth for all apps and data stores
• No more “Bad Data” randomly breaking things
• Increased agility! add, remove and modify fields with confidence
• Teams can share and discover data
• Smaller messages on the wire and in storage

20
But the Best Part is…
RDBM
Hadoop /
Hive
Kafka w/
Schema
Registry
JDBC
Source
HDFS
Sink
Schema
Schema
Schema
Schema
Schema

22
Compatibility tips!
• Do you want to upgrade producer without touching consumers?
• You want *forward* compatibility.
• You can add fields.
• You can delete fields with defaults
• Do you want to update schema in a DWH but still read old messages?
• You want *backward* compatibility
• You can delete fields
• You can add fields with defaults
• Both?
• You want *full* compatibility.
• Default all things!
• Except the primary key which you never touch.

23
App Lifecycle
Developement
Source
Repository
Pull Request
CI Staging / Test
Production
“Nightly build”
Lets Try
Lets Do It!
Optional
Good
Idea
YES Almost
too late
Last
resort

24
Finding Schemas
1. Write Schemas -> Generate Code
2. Get Schemas from registry -> Generate code
3. Write code -> Generate Schemas

25
Summary!
1. Data Quality is critical
2. Schema are critical
3. Schema Registry is used to maintain quality
from Dev to Prod
4. Schema Registry is in Confluent Open Source

26
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent
Control Center
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Speaker: Clarke Patterson, Senior Director, Product
Marketing

27
Get Started with Apache Kafka Today!
https://www.confluent.io/downloads/
THE place to start with Apache Kafka!
Thoroughly tested and
quality assured
More extensible developer
experience
Easy upgrade path to
Confluent Enterprise

28
30% off conference fee
code = kafpcf17
New York City | May 8, 2017
Presented by

Data Governance and Schema Registry for Software Developers

Recommended

Recommended

More Related Content

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (20)

Recently uploaded

Recently uploaded (20)

Data Governance and Schema Registry for Software Developers

Editor's Notes