A Tour of Apache Kafka
Engineer, Confluent Inc.
1. Technical Overview of
2. Use Cases
What is Apache Kafka?
Kafka is a streaming platform.
A distinct tool in your toolbox, like a relational database or a traditional
A streaming platform encourages architectures that have an emphasis on
events and changes to data (not data at rest).
Widely applicable. E.g. consider Walmart.
Who Uses Kafka Today?
● 35% of Fortune 500 + thousands of companies world wide use Kafka
● Across all industries
● High growth of usage within companies
Core Kafka Pt. 1
Traditional messaging: move data
Kafka: make data available
○ High performance
○ Robust horizontal scalability
● Suitable for real-time, streaming and batch operations
● Ad-hoc consumption & reprocessing
○ Easier to debug/reason about v.s. ephemeral data
○ Auditable by default
Core Kafka Pt. 2: Why Logs?
Core Kafka Pt. 3 - Scaling
ordering per partition only
Kafka topics are partitioned logs
Core Kafka Pt. 4 - Durability
Kafka topics are replicated partitioned logs
all reads and writes are to leader replica
Core Kafka Pt. 5: How Scalable is Kafka?
● No bottleneck!
○ Many brokers
○ Many producers
○ Many consumers
○ Internet giants are driving
the limits higher - you
won’t need to worry.
○ e.g. LinkedIn > 1 trillion
messages / day through
○ 100 brokers / 2 billion
messages a day is
○ Don’t over partition
~< 100k partitions
● Just a library! A library that
makes it easy to do stateful
operations (joins, aggregations,
● Elastically scalable
● Fault tolerant
● Un-opinionated deployment
● State backed by Kafka used as
● Exactly once processing
● Record-at-a-time processing
● Complex topologies
○ (but keep it simple)
● JVM only (Java, Scala, etc.)
Confluent: A More Complete Streaming Platform
When Should You Use Kafka?
● Quantity of Data
○ Simple Applications (or not)
Buffering Pt. 1
Kafka is a very good buffer:
● Write optimized
● Highly reliable
● Tolerate data spikes
● Tolerate downstream outages
● Used by KStreams (no back
Buffering Pt. 2
Move data to multiple locations
Explosion of Data Sources and Processing Frameworks
Advanced ETL #2: Enriching stream data
Advanced ETL #2: Stream / Table Join in KSQL
CREATE STREAM enriched_weblog AS
g.location AS location
FROM weblog w
LEFT JOIN geo g ON w.ip = g.ip;
● Query is long running on a KSQL cluster.
● Create weblog stream and geo table (backed by kafka topics) first.
● Currently, KSQL can interpret AVRO, JSON and CSV.
Stream Processing App #1: Anomaly Detection / Alerting
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
WINDOW TUMBLING (SIZE 10 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
● Use Kafka Streams
input streams in
is a change log
stream where key is
What are Microservices?
● Independently deployable, small units of functionality
○ (not a formal definition)
○ Primary motivation: decouple teams (scale in people terms)
○ Usually REST endpoints + commands/queries
Microservices can also be built on a backbone of events:
○ PII Filter
○ Weblog enricher
○ SMS fraud alert notifier
○ ... just the start