Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Essentials of Automations: Optimizing FME Workflows with Parameters
Kafka Tutorial - Introduction to Apache Kafka (Part 2)
1. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Cloudurable provides Cassandra and Kafka Support on
AWS/EC2
Kafka Tutorial (Part 2)
Java Producer and Consumer
examples
Schema Registry
Tutorial on Kafka
Streaming Platform
Part 2
3. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
StockPrice Producer Java Example
Lab StockPrice
Producer Java
Example
4. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPrice App to demo Advanced
Producer
❖ StockPrice - holds a stock price has a name, dollar, and cents
❖ StockPriceKafkaProducer - Configures and creates
KafkaProducer<String, StockPrice>, StockSender list,
ThreadPool (ExecutorService), starts StockSender runnable into
thread pool
❖ StockAppConstants - holds topic and broker list
❖ StockPriceSerializer - can serialize a StockPrice into byte[]
❖ StockSender - generates somewhat random stock prices for a
given StockPrice name, Runnable, 1 thread per StockSender
❖ Shows using KafkaProducer from many threads
5. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPrice domain object
❖ has name
❖ dollars
❖ cents
❖ converts
itself to
JSON
6. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPriceKafkaProducer
❖ Import classes and setup logger
❖ Create createProducer method to create KafkaProducer
instance
❖ Create setupBootstrapAndSerializers to initialize bootstrap
servers, client id, key serializer and custom serializer
(StockPriceSerializer)
❖ Write main() method - creates producer, create StockSender
list passing each instance a producer, creates a thread pool so
every stock sender gets it own thread, runs each stockSender in
its own thread
7. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPriceKafkaProducer imports,
createProducer
❖ Import classes and
setup logger
❖ createProducer
used to create a
KafkaProducer
❖ createProducer()
calls
setupBoostrapAnd
Serializers()
8. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Configure Producer Bootstrap and
Serializer
❖ Create setupBootstrapAndSerializers to initialize bootstrap
servers, client id, key serializer and custom serializer
(StockPriceSerializer)
❖ StockPriceSerializer will serialize StockPrice into bytes
9. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPriceKafkaProducer.mai
n()
❖ main method - creates producer,
❖ create StockSender list passing each instance a producer
❖ creates a thread pool (executorService)
❖ every StockSender runs in its own thread
10. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockAppConstants
❖ topic name for Producer example
❖ List of bootstrap servers
11. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPriceKafkaProducer.getStockSen
derList
12. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockPriceSerializer
❖ Converts StockPrice into byte array
13. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender
❖ Generates random stock prices for a given StockPrice
name,
❖ StockSender is Runnable
❖ 1 thread per StockSender
❖ Shows using KafkaProducer from many threads
❖ Delays random time between delayMin and delayMax,
❖ then sends random StockPrice between stockPriceHigh
and stockPriceLow
14. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender imports,
Runnable
❖ Imports Kafka Producer, ProducerRecord, RecordMetadata, StockPrice
❖ Implements Runnable, can be submitted to ExecutionService
15. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender constructor
❖ takes a topic, high & low stockPrice, producer, delay min & max
16. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender run()
❖ In loop, creates random record, send record, waits random
time
17. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender
createRandomRecord
❖ createRandomRecord uses randomIntBetween
❖ creates StockPrice and then wraps StockPrice in ProducerRecord
18. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
StockSender
displayRecordMetaData
❖ Every 100 records displayRecordMetaData gets called
❖ Prints out record info, and recordMetadata info:
❖ key, JSON value, topic, partition, offset, time
❖ uses Future from call to producer.send()
19. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run it
❖ Run ZooKeeper
❖ Run three Brokers
❖ run create-topic.sh
❖ Run StockPriceKafkaProducer
20. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run scripts
❖ run ZooKeeper from ~/kafka-training
❖ use bin/create-topic.sh to create
topic
❖ use bin/delete-topic.sh to delete
topic
❖ use bin/start-1st-server.sh to run
Kafka Broker 0
❖ use bin/start-2nd-server.sh to run
Kafka Broker 1
❖ use bin/start-3rd-server.sh to run
Kafka Broker 2
Config is under directory called config
server-0.properties is for Kafka Broker 0
server-1.properties is for Kafka Broker 1
server-2.properties is for Kafka Broker 2
21. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run All 3 Brokers
22. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run create-topic.sh script
Name of the topic
is stock-prices
Three partitions
Replication factor
of three
23. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run StockPriceKafkaProducer
❖ Run StockPriceKafkaProducer from the IDE
❖ You should see log messages from StockSender(s)
with StockPrice name, JSON value, partition, offset,
and time
24. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Lab Adding an
orderly shutdown
flush and close
Using flush and close
25. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Shutdown Producer nicely
❖ Handle ctrl-C shutdown from Java
❖ Shutdown thread pool and wait
❖ Flush producer to send any outstanding batches if using
batches (producer.flush())
❖ Close Producer (producer.close()) and wait
26. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Nice Shutdown
producer.close()
27. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Restart Producer then shut it
down
❖ Add shutdown hook
❖ Start StockPriceKafkaProducer
❖ Now stop it (CTRL-C or hit stop button in IDE)
28. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Lab Configuring
Producer Durability
29. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Lab Batching
Records
30. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Lab Adding Retries
and Timeouts
31. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Advanced Consumers
Currently Not added to
sample
32. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Working with Avro Serialization and the Schema Registry
Schema Management in
Kafka
Understanding Avro and the
Schema Registry for Kafka
33. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Overview
❖ First we cover Avro
❖ Avro Schema
❖ Using a Schema to write data
❖ Using a Schema to read data
❖ Then we cover the Schema Registry
❖ Managing Schemas
❖ Kafka Avro Serialization
34. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Avro
Introduction to Avro
for Big Data
and Streaming
Architecture
Apache Avro Data
Serialization for Hadoop and
Kafka
35. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Overview
❖ Why Avro?
❖ What is Avro?
❖ Why are schemas important
❖ Examples reading and writing with Avro
❖ Working with Code generation
❖ Working with Avro Generic Record
36. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why Avro?
❖ Direct mapping to JSON
❖ Compact binary format
❖ Very Fast
❖ Polyglot bindings
❖ Code generation for static languages
❖ Code generation not needed
❖ Supports Schema
❖ Supports compatibility checks
❖ Allows evolving your data over time
37. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Apache Avro
❖ Data serialization system
❖ Data structures
❖ Binary data format
❖ Container file format to store persistent data
❖ RPC capabilities
❖ Does not require code generation to use
38. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schemas
❖ Supports schemas for defining data structure
❖ Serializing and deserializing data, uses schema
❖ File schema
❖ Avro files store data with its schema
❖ RPC Schema
❖ RPC protocol exchanges schemas as part of the
handshake
❖ Schemas written in JSON
39. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro compared to…
❖ Similar to Thrift, Protocol Buffers, JSON, etc.
❖ Does not require code generation
❖ Avro needs less encoding as part of the data since it
stores names and types in the schema
❖ It supports evolution of schemas.
40. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why schemas?
❖ Multiple Producers and Consumers evolve over time
❖ Schemas keep your data clean
❖ NoSQL was no schema but not as much: Cassandra
has schemas
❖ REST/JSON was schema-less and IDL-less but not
anymore Swagger and RAML
41. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema provides Robustness
❖ Streaming architecture support
❖ Challenging: Consumers and Producers evolve on different timelines.
❖ Producers send streams of records that 0 to * Consumers read
❖ All consumers or use cases are not known ahead of time
❖ Data could go to Hadoop or some store used years later in use cases you
can’t imagine
❖ Schemas help future proof data; makes data more robust.
❖ Avro Schema with support for evolution is important
❖ Schema provides robustness: provides meta-data about data stored with
data
42. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Future Compatibility
❖ Data record format compatibility is a hard problem to solve
❖ Schema capture a point in time of what your data when it was stored
❖ Data will evolve. New fields are added.
❖ Kafka streams are often recorded in data lakes - needs to be less rigid
than schema for operational data
❖ Producers can have Consumers that they never know about
❖ You can’t test a Consumer that you don’t know about
❖ Avro Schemas support agility sakes
❖ 80% of data science is data cleanup, Schemas help
43. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema
Avro schema stored in src/main/avro by default.
very field and record has a “doc” attribute which you should use to document the purpose of the fie
44. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Code Generation
45. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Employee Code Generation
46. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using Generated Avro class
47. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing employees to an
Avro File
48. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading employees From a
File
49. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using GenericRecord
50. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing Generic Records
51. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading using Generic
Records
52. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema Validation
53. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro supported types
❖ Records
❖ Arrays
❖ Enums
❖ Unions
❖ Maps
❖ String, Int, Boolean, Decimal, Timestamp, Date
54. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Types 1
avro type json type example description
null null null no value
boolean boolean truetrue or false
int integer 132 bit signed int
long 64 bit signed int
float number 1.1single precision (32-bit) floating-point
number
double number 1.2double precision (64-bit) floating-point
number
bytes string "u00FF"
string string "foo"
record object {"a": 1} complex type / object
enum string “Hi Mom" Enumerated type
array array [1] array of any value
union can be used to define nullable fields
map object {"a": 1}
fixed string "u00ff" fixed length string
55. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Types 2
avro type json type example description
date int 2000 stores the number of days from the unix
epoch, 1 January 1970
time-millis int 10800000 stores the number of milliseconds after
midnight
time-micros long 1000000 microseconds after midnight
timestamp-
milis
long 1494437796880 milliseconds since epoch
decimal complex/object 22.555 specifies scale and precision like
BigDecimal in Java
duration complex/object Specifies duration in months, days and
milliseconds
56. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Fuller example Avro Schema
57. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Record Attributes
❖ name: name of the record (required).
❖ namespace: equates to packages or modules
❖ doc: documentation for future user of this schema
❖ aliases: array aliases (alias names)
❖ fields: an array of fields
58. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Field Attributes
❖ name: name of the field (required)
❖ doc: description of field (important for future usage)
❖ type: JSON object defining a schema, or a JSON string
naming a record definition (required)
❖ default: Default value for this field
❖ order: specifies sort ordering of record (optional).
❖ "ascending" (the default), "descending", "ignore"
❖ aliases: array of alternate names
59. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Tips for Using Avro
❖ Avoid advance features which are not supported by polyglot language
mappings
❖ Think simple data objects
❖ Instead of magic strings use enums (better validation)
❖ Document all records and fields in the schema
❖ Avoid complex union types (use Unions for nullable fields only)
❖ Avoid recursive types
❖ Use reasonable field names
❖ employee_id instead of id and then use employee_id in all other records that
have a field that refer to the employee_id
60. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro
❖ Fast data serialization
❖ Supports data structures
❖ Supports Records, Maps, Array, and basic types
❖ You can use it direct or use Code Generation
❖ Read more.
61. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Avro
Confluent Schema
Registry
and
Kafka Avro Serialization
Managing Record Schema in
Kafka
62. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Overview
❖ What is the Schema Registry?
❖ Why do you want to use it?
❖ Understanding Avro Schema Evolution
❖ Setting up and using Schema Registry
❖ Managing Schemas with REST interface
❖ Writing Avro Serializer based Producers
❖ Writing Avro Deserializer based Consumers
63. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Confluent Schema Registry
❖ Confluent Schema Registry manage Avro Schemas for Kafka clients
❖ Provides REST interface for putting and getting Avro schemas
❖ Stores a history of schemas
❖ versioned
❖ allows you to configure compatibility setting
❖ supports evolution of schemas
❖ Provides serializers used by Kafka clients which handles schema storage and
serialization of records using Avro
❖ Don’t have to send schema just the schema uuid, schema can be looked up in
Registry. Helps with Speed!
64. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Producer Schema
Process
❖ Producer creates a record/message, which is an Avro record
❖ Record contains schema and data
❖ Schema Registry Avro Serializer serializes the data and schema id
(just id)
❖ Keeps a cache of registered schemas from Schema Registry to
ids
❖ Consumer receives payload and deserializes it with Kafka Avro
Deserializers which use Schema Registry
❖ Deserializer looks up the full schema from cache or Schema
Registry based on id
65. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why Schema Registry?
❖ Consumer has its schema, one it is expecting
record/message to conform to
❖ Compatibility check is performed or two schemas
❖ if no match, but are compatible, then payload
transformation happens aka Schema Evolution
❖ if not failure
❖ Kafka records have Key and Value and schema can be
done on both
66. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Registry Actions
❖ Register schemas for key and values of Kafka records
❖ List schemas (subjects); List all versions of a subject (schema)
❖ Retrieve a schema by version or id; get latest version of
schema
❖ Check to see if schema is compatible with a certain version
❖ Get the compatibility level setting of the Schema Registry
❖ BACKWARDS, FORWARDS, FULL, NONE
❖ Add compatibility settings to a subject/schema
67. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Compatibility
❖ Backward compatibility
❖ data written with older schema can be read with newer schema
❖ Forward compatibility
❖ data written with newer schema can be read with an older schema
❖ Full compatibility
❖ New version of a schema is backward and forward compatible
❖ None
❖ Schema will not be validated for compatibility at all
68. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Registry Config
❖ Compatibility can be configured globally or per schema
❖ Options are:
❖ NONE - don’t check for schema compatibility
❖ FORWARD - check to make sure last schema version is forward
compatible with new schemas
❖ BACKWARDS (default) - make sure new schema is backwards
compatible with latest
❖ FULL - make sure new schema is forwards and backwards
compatible from latest to new and from new to latest
69. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Evolution
❖ Avro schema is changed after data has been written to
store using an older version of that schema, then Avro
might do a Schema Evolution
❖ Schema evolution happens only during deserialization at
Consumer
❖ If Consumer’s schema is different from Producer’s
schema, then value or key is automatically modified
during deserialization to conform to consumers reader
schema
70. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Evolution 2
❖ Schema evolution is automatic transformation of Avro
schema
❖ transformation is between version of consumer schema
and what the producer put into the Kafka log
❖ When Consumer schema is not identical to the Producer
schema used to serialize the Kafka Record, then a data
transformation is performed on the Kafka record (key or
value)
❖ If the schemas match then no need to do a
transformation
71. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Allowed Schema Modifications
❖ Add a field with a default
❖ Remove a field that had a default value
❖ Change a field’s order attribute
❖ Change a field’s default value
❖ Remove or add a field alias
❖ Remove or add a type alias
❖ Change a type to a union that contains original type
72. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Rules of the Road for modifying
Schema
❖ Provide a default value for fields in your schema
❖ Allows you to delete the field later
❖ Don’t change a field's data type
❖ When adding a new field to your schema, you have to
provide a default value for the field
❖ Don’t rename an existing field
❖ You can add an alias
73. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Remember our example
Employee
Avro covered in Avro/Kafka Tutorial
74. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Let’s say
❖ Employee did not have an age in version 1 of the
schema
❖ Later we decided to add an age field with a default value
of -1
❖ Now let’s say we have a Producer using version 2, and
a Consumer using version 1
75. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Scenario adding new field age w/
default value
❖ Producer uses version 2 of the Employee schema and
creates a com.cloudurable.Employee record, and sets
age field to 42, then sends it to Kafka topic new-
employees
❖ Consumer consumes records from new-employees
using version 1 of the Employee Schema
❖ Since Consumer is using version 1 of schema, age
field is removed during deserialization
76. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Scenario adding new field age w/ default
value 2
❖ Same consumer modifies some fields, then writes record to
a NoSQL store
❖ When it does this, the age field is missing from record
that it writes to NoSQL store
❖ Another client using version 2 reads the record from the
NoSQL store
❖ Age field is missing from the record (because the
Consumer wrote it with version 1), age is set to default
value of -1
77. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Registry Actions
REST
❖ Register schemas for key and values of Kafka records
❖ List schemas (subjects)
❖ List all versions of a subject (schema)
❖ Retrieve a schema by version or id
❖ get latest version of schema
78. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Registry Actions 2
❖ Check to see if schema is compatible with a certain
version
❖ Get the compatibility level setting of the Schema
Registry
❖ BACKWARDS, FORWARD, FULL, NONE
❖ Add compatibility settings to a subject/schema
79. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Register a Schema
80. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Register a Schema
{"id":2}
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json"
--data '{"schema": "{"type": …}’
http://localhost:8081/subjects/Employee/versions
81. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
List All Schema
["Employee","Employee2","FooBar"]
curl -X GET http://localhost:8081/subjects
82. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Working with versions
[1,2,3,4,5]
{“subject”:"Employee","version":2,"id":4,"schema":"
{"type":"record","name":"Employee",
”namespace”:"com.cloudurable.phonebook", …
{“subject”:"Employee","version":1,"id":3,"schema":"
{"type":"record","name":"Employee",
”namespace”:"com.cloudurable.phonebook", …
83. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Working with Schemas
84. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Changing Compatibility
Checks
85. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Incompatible Change
{“error_code":409,"
message":"Schema being registered is incompatible with an e
86. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Incompatible Change
{"is_compatible":false}
87. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Use Schema Registry
❖ Start up Schema Registry server pointing to Zookeeper
cluster
❖ Import Kafka Avro Serializer and Avro Jars
❖ Configure Producer to use Schema Registry
❖ Use KafkaAvroSerializer from Producer
❖ Configure Consumer to use Schema Registry
❖ Use KafkaAvroDeserializer from Consumer
88. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Start up Schema Registry
Server
cat ~/tools/confluent-3.2.1/etc/schema-registry/schema-registry.properties
listeners=http://0.0.0.0:8081
kafkastore.connection.url=localhost:2181
kafkastore.topic=_schemas
debug=false
89. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Import Kafka Avro Serializer &
Avro Jars
90. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Configure Producer to use Schema
Registry
91. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Use KafkaAvroSerializer from
Producer
92. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Configure Consumer to use Schema
Registry
93. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Use KafkaAvroDeserializer from
Consumer
94. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Schema Registry
❖ Confluent provides Schema Registry to manage Avro
Schemas for Kafka Consumers and Producers
❖ Avro provides Schema Migration
❖ Confluent uses Schema compatibility checks to see if
Producer schema and Consumer schemas are
compatible and to do Schema evolution if needed
❖ Use KafkaAvroSerializer from Producer
❖ Use KafkaAvroDeserializer from Consumer
Editor's Notes
Kafka Tutorial - Introduction to Apache Kafka
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Avro supports direct mapping to JSON as well as a compact binary format.
It is a very fast serialization format.
Avro is widely used in the Hadoop ecosystem.
Avro supports polyglot bindings to many programming languages and a code generation for static languages. For dynamically typed languages, code generation is not needed.
Another key advantage of Avro is its support of evolutionary schemas which supports compatibility checks, and allows evolving your data over time.
Apache Avro™ is a data serialization system.
Avro provides data structures, binary data format, container file format to store persistent data and RPC capabilities. Avro does not require code generation to use.
Integrates well with JavaScript, Python, Ruby and Java.
Avro data format is defined by Avro schemas. When deserializing data, the schema is used. Data is serialized based on the schema, and schema is sent with data. Avro data plus schema is fully self-describing.
When Avro files store data with its schema. Avro RPC is also based on schema. Part of the RPC protocol exchanges schemas as part of the handshake.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake.
Avro schemas are written in JSON.
Avro is similar to Thrift, Protocol Buffers, JSON, etc. Avro does not require code generation. Avro needs less encoding as part of the data since it stores names and types in the schema. It supports evolution of schemas.
There was a trend towards schema-less as part of the NoSQL, but that pendulum has swung back a bit, e.g., Cassandra has schemas REST/JSON was schema-less and IDL-less but not anymore with Swagger, API gateways, and RAML.
Now the trend is more towards schemas that can evolve and Avro fits well in this space.
Streaming architecture like Kafka supports decoupling by sending data in streams to an unknown number of consumers.
Streaming architecture is challenging as Consumers and Producers evolve on different timelines.
Producers send a stream of records that zero to many Consumers read. Not only are there multiple consumers but data might end up in Hadoop or some other store and used for use cases you did not even imagine. Schemas help future proof your data and make it more robust. Supporting all use cases future (Big Data), past (older Consumers) and current use cases is not easy without a schema. Avro schema with its support for evolution is essential for making the data robust for streaming architectures like Kafka, and with the metadata that schema provides, you can reason on the data. Having a schema provides robustness in providing meta-data about the data stored in Avro records which are self-documenting the data.
Data record format compatibility is a hard problem to solve with streaming architecture and Big Data. Avro schemas are not a cure-all, but essential for documenting and modeling your data.
Avro Schema definitions capture a point in time of what your data looked like when it recorded since the schema is saved with the data. Data will evolve. New fields are added. Since streams often get recorded in data lakes like Hadoop and those records can represent historical data, not operational data,
it makes sense that data streams and data lakes have a less rigid, more evolving schema than the schema of the operational relational database or Cassandra cluster. It makes sense to have a rigid schema for operational data, but not data that ends up in a data lake.
With a streaming platform, consumers and producers can change all of the time and evolve quite a bit. Producers can have Consumers that they never know. You can’t test a Consumer that you don’t know. For agility sakes, you don’t want to update every Consumer every time a Producers adds a field to a Record. These types of updates are not feasible without support for Schema.
There are plugins for Maven and Gradle to generate code based on Avro schemas.
This gradle-avro-plugin is a Gradle plugin that uses Avro tools to do Java code generation for Apache Avro. This plugin supports Avro schema files (avsc), and Avro RPC IDL (avdl). For Kafka you only need avsc. Notice that we did not generate setter methods. This makes the instances somewhat immutable.
The plugin generates the files and puts them under build/generated-main-avro-java.
The Employee class has a constructor and has a builder.
The above shows serializing an Employee list to disk. In Kafka, we will not be writing to disk directly.
We are just showing how so you have a way to test Avro serialization, which is helpful when debugging schema incompatibilities.
Note we create a DatumWriter, which converts Java instance into an in-memory serialized format. SpecificDatumWriter is used with generated classes like Employee.
DataFileWriter writes the serialized records to the employee.avro file.
The above deserializes employees from the employees.avro file.
Deserializing is similar to serializing but in reverse. We create a SpecificDatumReader to converts in-memory serialized items into instances of our generated Employee class.
The DatumReader reads records from the file by calling next.
Another way to read is using forEach as follows:
final DataFileReader<Employee> dataFileReader = new DataFileReader<>(file, empReader);dataFileReader.forEach(employeeList::add);
You can use an Avro GenericRecord instead of using generated code.
You can write to Avro files using GenericRecords as well.
You can read from Avro files using generic records as well.
Avro will validate the data types when it serializes and deserializes the data.
The document https://avro.apache.org/docs/current/spec.html#Protocol+Declaration describes all of the supported types.
Decimal “byte array must contain the two's-complement representation of the unscaled integer value in big-endian byte order”.
Decimal “byte array must contain the two's-complement representation of the unscaled integer value in big-endian byte order”.
The above has examples of default values, arrays, primitive types, Records within records, enums, and more.
Notice that the nickName is an optional field and that we modeled that with a union.
The doc attribute is imperative for future usage as it documents what the fields and records are supposed to represent. Remember that this data can outlive systems that produced it. A self-documenting schema is critical for a robust system.
Avoid advanced Avro features which are not supported by polyglot language mappings. Think simple data transfer objects or structs. Don't use magic strings, use enums instead as they provide better validation.
Document all records and fields in the schema. Documentation is imperative for future usage. Documents what the fields and records represent. A self-documenting schema is critical for a robust streaming system and Big Data. Don't use complex union types. Use Unions for nullable fields only and avoid using recursive types at all costs.
Use reasonable field names and use them consistently with other records. Example, `employee_id` instead of `id` and then use `employee_id` in all other records that have a field that refer to the `employee_id` from Employee.
Confluent Schema Registry stores Avro Schemas for Kafka producers and consumers. The Schema Registry and provides RESTful interface for managing Avro schemas It allows the storage of a history of schemas which are versioned. the Confluent Schema Registrysupports checking schema compatibility for Kafka. You can configure compatibility setting which supports the evolution of schemas using Avro. Kafka Avro serialization project provides serializers. Kafka Producers and Consumers that use Kafka Avro serialization handle schema management and serialization of records using Avro and the Schema Registry. When using the Confluent Schema Registry, Producers don’t have to send schema just the schema id which is unique. The consumer uses the schema id to look up the full schema from the Confluent Schema Registry if not already cached. Since you don’t have to send the schema with each set of records, this saves time. Not sending the schema with each record or batch of records, speeds up the serialization as only the id of the schema is sent.
The Kafka Producer creates a record/message, which is an Avro record. The record contains a schema id and data. With Kafka Avro Serializer, the schema is registered if needed and then it serializes the data and schema id. The Kafka Avro Serializer keeps a cache of registered schemas from Schema Registry their schema ids.
Consumers receive payloads and deserialize them with Kafka Avro Deserializers which use the Confluent Schema Registry. Deserializer looks up the full schema from cache or Schema Registry based on id.
Consumer has its schema which could be different than the producers. The consumer schema is the schema the consumer is expecting the record/message to conform to. With the Schema Registry a compatibility check is performed and if the two schemas don’t match but are compatible, then the payload transformation happens via Avro Schema Evolution. Kafka records can have a Key and a Value and both can have a schema.
The Schema Registry can store schemas for keys and values of Kafka records. It can also list schemas by subject. It can list all versions of a subject (schema). It can retrieve a schema by version or id. It can get the latest version of a schema. Importantly, the Schema Registry can check to see if schema is compatible with a certain version. There is a compatibility level (BACKWARDS, FORWARDS, FULL, NONE) setting for the Schema Registry and an individual subject. You can manage schemas via a REST API with the Schema registry.
Backward compatibility means data written with an old schema is readable with a newer schema. Forward compatibility means data written with newer schema is readable with old schemas. Full compatibility means a new version of a schema is backward and forward compatible. None disables schema validation and it not recommended. If you set the level to none then Schema Registry just stores the schema and Schema will not be validated for compatibility at all.
Schema Registry Config
The Schema compatibility checks can is configured globally or per subject.
The compatibility checks value is one of the following:
• NONE - don’t check for schema compatibility
• FORWARD - check to make sure last schema version is forward compatible with new schemas
• BACKWARDS (default) - make sure new schema is backwards compatible with latest
• FULL - make sure new schema is forwards and backwards compatible from latest to new and from new to latest
Schema Evolution
If an Avro schema is changed after data has been written to store using an older version of that schema, then Avro might do a Schema Evolution when you try to read that data.
From Kafka perspective, Schema evolution happens only during deserialization at Consumer (read). If Consumer’s schema is different from Producer’s schema, then value or key is automatically modified during deserialization to conform to consumers reader schema if possible.
Avro schema evolution is an automatic transformation of Avro schema between the consumer schema version and what the schema the producer put into the Kafka log. When Consumer schema is not identical to the Producer schema used to serialize the Kafka Record, then a data transformation is performed on the Kafka record’s key or value. If the schemas match then no need to do a transformation
Allowed Modification During Schema Evolution
You can add a field with a default to a schema. You can remove a field that had a default value. You can change a field’s order attribute. You can change a field’s default value to another value or add a default value to a field that did not have one. You can remove or add a field alias (keep in mind that this could break some consumers that depend on the alias). You can change a type to a union that contains original type. If you do any of the above, then your schema can use Avro’s schema evolution when reading with an old schema.
Rules of the Road for modifying Schema
If you want to make your schema evolvable, then follow these guidelines. Provide a default value for fields in your schema as this allows you to delete the field later. Never change a field’s data type. When adding a new field to your schema, you have to provide a default value for the field. Don’t rename an existing field (use aliases instead). You can add an alias
Let’s use an example to talk about this. The following example is from our Avro tutorial.
Let’s say our Employee record did not have an age in version 1 of the schema and then later we decided to add an age field with a default value of -1. Now let’s say we have a Producer using version 2 of the schema with age, and a Consumer using version 1 with no age.
The Producer uses version 2 of the Employee schema and creates a com.cloudurable.Employee record, and sets age field to 42, then sends it to Kafka topic new-employees. The Consumer consumes records from new-employees using version 1 of the Employee Schema. Since Consumer is using version 1 of the schema, the age field gets removed during deserialization.
The same consumer modifies some records and then writes the record to a NoSQL store. When the Consumer does this, the age field is missing from the record that it writes to the NoSQL store. Another client using version 2 of the schema which has the age, reads the record from the NoSQL store. The age field is missing from the record because the Consumer wrote it with version 1, thus the client reads the record and the age is set to default value of -1.
If you added the age and it was not optional, i.e., the age field did not have a default, then the Schema Registry could reject the schema, and the Producer could never it add it to the Kafka log.
Using REST Schema Registry REST API
Recall that the Schema Registry allows you to manage schemas using the following operations:
• store schemas for keys and values of Kafka records
• List schemas by subject.
• list all versions of a subject (schema).
• Retrieves a schema by version
• Retrieves a schema by id
• Retrieve the latest version of a schema
• Perform compatibility checks
• Set compatibility level globally
• Set compatibility level globally
Recall that all of this is available via a REST API with the Schema Registry.
If you have a good HTTP client, you can basically perform all of the above operations via the REST interface for the Schema Registry. I wrote a little example to do this so I could understand the Schema registry a little better using the OkHttp client from Square (com.squareup.okhttp3:okhttp:3.7.0+).
I suggest running the example and trying to force incompatible schemas to the Schema Registry and note the behavior for the various compatibility settings.
I suggest running the example and trying to force incompatible schemas to the Schema Registry and note the behavior for the various compatibility settings.
Now let’s cover writing consumers and producers that use Kafka Avro Serializers which in turn use the Schema Registry and Avro.
We will need to start up the Schema Registry server pointing to our Zookeeper cluster. Then we will need to import the Kafka Avro Serializer and Avro Jars into our gradle project. You will then need to configure the Producer to use Schema Registry and the KafkaAvroSerializer. To write the consumer, you will need to configure it to use Schema Registry and to use the KafkaAvroDeserializer.
Notice that we include the Kafka Avro Serializer lib (io.confluent:kafka-avro-serializer:3.2.1) and the Avro lib (org.apache.avro:avro:1.8.1).
To learn more about the Gradle Avro plugin, please read this article on using Avro.
http://cloudurable.com/blog/avro/index.html
Notice that we configure the schema registry and the KafkaAvroSerializer as part of the Producer setup.
The producer code is like it was before in other examples.
Notice just like the producer we have to tell the consumer where to find the Schema Registry, and we have to configure the Kafka Avro Deserializer.
The consumer code is like it was before in other examples.
Confluent provides Schema Registry to manage Avro Schemas for Kafka Consumers and Producers. Avro provides Schema Migration which is necessary for streaming and big data architectures.
Confluent uses Schema compatibility checks to see if the Producer’s schema and Consumer’s schemas are compatible and to do Schema evolution if needed. You use KafkaAvroSerializer from the Producer and point to the Schema Registry. You use KafkaAvroDeserializer from Consumer and point to the Schema Registry.