2. Schedule
Tech Talks Date/Time
TT#1 Dive into Apache Kafka® June 4th (Thursday)
10:30am - 11:30am AEST
TT#2 Introduction to Streaming Data and Stream Processing with Apache
Kafka
July 2nd (Thursday)
10:30am - 11:30am AEST
TT#3 Confluent Schema Registry August 6th (Thursday)
10:30am - 11:30am AEST
TT#4 Kafka Connect September 3rd (Thursday)
10:30am - 11:30am AEST
TT#5 Avoiding Pitfalls with Large-Scale Kafka Deployments October 1st (Thursday)
10:30am - 11:30am AEST
3. Disclaimer… • Some of you may know what Kafka
is or have used it already...
• If that’s the case, sit back and take a
refresher on Kafka and learn about
Confluent
4. Business Digitization Trends are Revolutionizing your
Data Flow
Massive volumes of
new data generated
every day
Mobile Cloud Microservices Internet of
Things
Machine
Learning
Distributed across
apps, devices,
datacenters, clouds
Structured,
unstructured
polymorphic
5. Legacy Data Infrastructure Solutions Have
Architectural Flaws
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
ESB
These solutions can be
● Batch-oriented, instead of event-
oriented in real time
● Complex to scale at high
throughput
● Connected point-to-point,
instead of publish / subscribe
● Lacking data persistence and
retention
● Incapable of in-flight message
processing
App App
6. Modern Architectures are Adapting to New Data
Requirements
NoSQL DBs Big Data Analytics
But how do we
revolutionize data
flow in a world of
exploding,
distributed and ever
changing data?
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
ESB
App App
7. The Solution is a Streaming Platform for Real-Time
Data Processing
A Streaming Platform
provides a single
source of truth
about your data to
everyone in your
organization
NoSQL DBs Big Data Analytics
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App AppApp App
Streaming Platform
8. Apache Kafka®: Open Source Streaming Platform
Battle-Tested at Scale
More than 1
petabyte of
data in Kafka
Over 4.5 trillion
messages per
day
60,000+ data
streams
Source of all
data warehouse
& Hadoop data
Over 300 billion
user-related
events per day
The birthplace of Apache Kafka
23. Creating a Topic
$ kafka-topics --zookeeper zk:2181
--create
--topic my-topic
--replication-factor 3
--partitions 3
Or use the new AdminClient API!
30. The Serializer
Kafka doesn’t care about what you send to it as long as it’s been
converted to a byte stream beforehand.
JSON
CSV
Avro
XML
SERIALIZERS
01001010 01010011 01001111 01001110
01000011 01010011 01010110
01001010 01010011 01001111 01001110
01010000 01110010 01101111 01110100 ...
01011000 01001101 01001100
(if you must)
Reference
https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html
Protobuf
31. The Serializer
private Properties kafkaProps = new Properties();
kafkaProps.put(“bootstrap.servers”, “broker1:9092,broker2:9092”);
kafkaProps.put(“key.serializer”, “org.apache.kafka.common.serialization.StringSerializer”);
kafkaProps.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
producer = new KafkaProducer<String, SpecificRecord>(kafkaProps);
Reference
https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html
32. Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
[Key]
Value
Record keys determine the partition with the
default kafka partitioner
If a key isn’t provided, messages will be
produced in a round robin fashion
partitioner
33. Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
AAAA
Value
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
34. Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
BBBB
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
35. Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
CCCC
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
36. Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
DDDD
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
37. Record Keys and why they’re important – Key Cardinality
Consumers
Key cardinality affects the
amount of work done by
consumers in a group. Poor key
choice can lead to uneven
workloads.
Keys in Kafka don’t have to be
primitives, like strings or ints.
Like values, they can be be
anything: JSON, Avro, etc… So
create a key that will evenly
distribute groups of records
around the partitions.
Car·di·nal·i·ty
/ˌkärdəˈnalədē/
Noun
the number of elements in a set or other grouping, as a property of that grouping.
39. A Basic Java Consumer
final Consumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList(topic));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
-- Do Some Work --
}
}
} finally {
consumer.close();
}
}
40. Consuming From Kafka – Single Consumer
C
One consumer will
consume from all
partitions,
maintaining
partition offsets
41. Consuming From Kafka – Grouped Consumers
CC
C1
CC
consumers are
separate,
operating
independently
C2
42. Consuming From Kafka – Grouped Consumers
C C
C C
Consumers in a
consumer group
share the
workload
45. Consuming From Kafka – Grouped Consumers
0, 3 1
2 3
Another consumer in
the group picks up for
the failed consumer.
This is a rebalance.
46. Use a Good Kafka Client!
Clients
● Java/Scala - default clients, comes with Kafka
● C/C++ - https://github.com/edenhill/librdkafka
● C#/.Net - https://github.com/confluentinc/confluent-kafka-dotnet
● Python - https://github.com/confluentinc/confluent-kafka-python
● Golang - https://github.com/confluentinc/confluent-kafka-go
● Node/JavaScript - https://github.com/Blizzard/node-rdkafka (not supported by Confluent!)
New Kafka features will only be available to modern, updated clients!
47. Without Confluent and Kafka
LINE OF BUSINESS 01 LINE OF BUSINESS 02 PUBLIC CLOUD
Data architecture is rigid, complicated, and expensive - making it too hard
and cost-prohibitive to get mission-critical apps to market quickly
48. Confluent & Kafka reimagine this as the central
nervous system of your business
Hadoop ...
Device
Logs ... App ...MicroserviceMainframes
Data
Warehouse
Splunk ...
Data Stores Logs 3rd Party Apps Custom Apps / Microservices
Real-time
Inventory
Real-time Fraud
Detection
Real-time
Customer 360
Machine
Learning
Models
Real-time Data
Transformation ...
Contextual Event-Driven Applications
Universal Event Pipeline
49. Apache Kafka is one of the most popular open
source projects in the world
49
Confluent are the
Kafka Experts
Founded by the creators of
Apache Kafka, Confluent
continues to be the major
contributor.
Confluent invests in
Open Source
2020 re-architecture
removes the
scalability-limiting
use of Zookeeper
in Apache Kafka
50. Future-proof event streaming
Kafka re-engineered as a fully-managed, cloud-native service by its
original creators and major contributors of Kafka
Global
Automated disaster
recovery
Global applications with
geo-awareness
Infinite
Efficient and infinite data
with tiered storage
Unlimited horizontal
scalability for clusters
Elastic
Easy multi-cloud
orchestration
Persistent bridge to
cloud from on-prem
51. Make your applications
more valuable with
real time insights
enabled by next-gen
architecture
DATA INTEGRATION
Database changes
Log
events
IoT
events
Web events
Connected car
Fraud detection
Customer 360
Personalized
promotions
Apps driven by
real time data
Quality
assurance
SIEM/SOC
Inventory
management
Proactive
patient care
Sentiment
analysis
Capital
management
Modernize
your apps
52. Build a bridge to the cloud for your data
Ensure availability and connectivity regardless of where your data lives
53
Private Cloud
Deploy on premises with
Confluent Platform
Public/Multi-Cloud
Leverage a fully managed
service with Confluent Cloud
Hybrid Cloud
Build a persistent bridge
from datacenter to cloud
53. Confluent Platform
Dynamic Performance & Elasticity
Auto Data Balancer | Tiered Storage
Flexible DevOps Automation
Operator | Ansible
GUI-driven Mgmt & Monitoring
Control Center
Efficient
Operations at Scale
Freedom of Choice
Committer-driven Expertise
Event Streaming Database
ksqlDB
Rich Pre-built Ecosystem
Connectors | Hub | Schema Registry
Multi-language Development
Non-Java Clients | REST Proxy
Global Resilience
Multi-region Clusters | Replicator
Data Compatibility
Schema Registry | Schema Validation
Enterprise-grade Security
RBAC | Secrets | Audit Logs
ARCHITECTOPERATORDEVELOPER
Open Source | Community licensed
Unrestricted
Developer Productivity
Production-stage
Prerequisites
Fully Managed Cloud ServiceSelf-managed Software
Training Partners
Enterprise
Support
Professional
Services
Apache Kafka
54. Project Metamorphosis
Unveiling the next-gen event
streaming platform
Listen to replay and
Sign up for updates
cnfl.io/pm
Jay Kreps
Co-founder and CEO
Confluent
55. Download your Apache Kafka and
Stream Processing O'Reilly Book Bundle
Download at: https://www.confluent.io/apache-kafka-stream-processing-book-
bundle/