Simplify Governance of Streaming Data

1
Simplify Governance of
Streaming Data
Gwen Shapira
Product Manager, Apache Kafka™ PMC

2
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent Control
Center
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Speaker: Clarke Patterson, Senior Director, Product Marketing

3
What is Data Governance?
Data Access
Master Data
Management
Data Agility
Data Reliability
Data Quality

4
To Grow a Successful Data Platform, You need Some Guarantees

5
If you can’t trust your data,
what can you trust?

6
Schemas Are a Key Part of This
1. Schemas: Enforced Contracts Between your Components
2. Schemas need to be agile but compatible
3. You need tools & process that will let you:
• Move fast and not break things

7
Cautionary tale: Uber
Producers
Service 1 Service 2
RedshiftS3
EMR
{created_at: “Dec 1,
2015”}
CREATE (
created_at: STRING )
parseDate(created_at, “%M %d,
%yyyy”)

8
Producers
Service 1 Service 2
RedshiftS3
EMR
{created_at:
1491368429}
CREATE (
created_at: STRING )
parseDate(created_at, “%M %d,
%yyyy”)
Cautionary tale: Uber

9
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers

10
The right way to do things
MessageCustomer
RoomGifts
BookRoom
LoyalCustomers
BeachPromo If (property.location == beach &&
customer.level == platinum)
sendMessage({topic: RoomGifts, customer_id: customer.id, room:
property.roomnum, gift: seaShell})

11
There is special value in standardizing parts of schema
{ "type": "record",
"name": "Event",
"fields": [{ "name": "headers", "type": {
“name” : “headers”,
“type”: ”record”
“fields”: [
{“name”: “source_app”, “type”: “string”},
{“name”: “host_app”, “type”: “string”},
{“name”: “destination_app”, “type”: [“null,“string”]},
{“name”: “timestamp”, “type”: “long”}
]},
{ "name": "body", "type": "bytes" }] }
Standard headers allowed
analyzing patterns of communication
S3

12
In Request Response World – APIs between services are key
In Async Stream Processing World – Message Schemas ARE the API
…except they stick around for a lot longer

13
Lets assume you see why you need
schema compatibility…
How do we make it work for us?

14
Schema Registry
Elastic
Cassandra
Streams
App
Example Consumers
Serializer
App 1
Serializer
App 2
!
Kafka Topic!
Schema
Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)
Prevent incompatible changes
Supports multi-datacenter environments

15
Key Points:
• Schema Registry is efficient – due to caching
• Schema Registry is transparent
• Schema Registry is Highly Available
• Schema Registry prevents bad data at the source

16
Big Benefits
• Schema Management is part of your entire app lifecycle – from dev
to prod
• Single source of truth for all apps and data stores
• No more “Bad Data” randomly breaking things
• Increased agility! add, remove and modify fields with confidence
• Teams can share and discover data
• Smaller messages on the wire and in storage

17
But the Best Part is…
RDBM
Hadoop /
Hive
Kafka w/
Schema
Registry
JDBC
Source
HDFS
Sink
Schema
Schema
Schema
Schema
Schema

19
Compatibility tips!
• Do you want to upgrade producer without touching consumers?
• You want *forward* compatibility.
• You can add fields.
• You can delete fields with defaults
• Do you want to update schema in a DWH but still read old messages?
• You want *backward* compatibility
• You can delete fields
• You can add fields with defaults
• Both?
• You want *full* compatibility.
• Default all things!
• Except the primary key which you never touch.

20
App Lifecycle
Developement
Source
Repository
Pull Request
CI Staging / Test
Production
“Nightly build”
Lets Try
Lets Do It!
Optional
Good
Idea
YES Almost
too late
Last
resort

21
Finding Schemas
1. Write Schemas -> Generate Code
2. Get Schemas from registry -> Generate code
3. Write code -> Generate Schemas

22
Summary!
1. Data Quality is critical
2. Schema are critical
3. Schema Registry is used to maintain quality
from Dev to Prod
4. Schema Registry is in Confluent Open Source

23
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent Control
Center
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Speaker: Clarke Patterson, Senior Director, Product Marketing

24
Get Started with Apache Kafka Today!
https://www.confluent.io/downloads/
THE place to start with Apache Kafka!
Thoroughly tested and quality
assured
More extensible developer
experience
Easy upgrade path to
Confluent Enterprise

25
30% off conference fee
code = kafpcf17
New York City | May 8, 2017
Presented by

Simplify Governance of Streaming Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simplify Governance of Streaming Data

Similar to Simplify Governance of Streaming Data (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Simplify Governance of Streaming Data