Speaker: Matt Howlett, Software Engineer, Confluent
This presentation provides a technical overview of Apache Kafka® and covers some of its popular use cases.
3. 3Confidential
What is Apache Kafka?
Kafka is a streaming platform.
A distinct tool in your toolbox, like a relational database or a traditional
messaging system.
A streaming platform encourages architectures that have an emphasis on
events and changes to data (not data at rest).
Widely applicable. E.g. consider Walmart.
4. 4Confidential
Who Uses Kafka Today?
● 35% of Fortune 500 + thousands of companies world wide use Kafka
● Across all industries
● High growth of usage within companies
6. 6Confidential
● Simple:
○ High performance
○ Robust horizontal scalability
● Suitable for real-time, streaming and batch operations
● Ad-hoc consumption & reprocessing
● Immutable:
○ Easier to debug/reason about v.s. ephemeral data
○ Auditable by default
Core Kafka Pt. 2: Why Logs?
7. 7Confidential
Core Kafka Pt. 3 - Scaling
Notes:
ordering per partition only
re-partitioning
Key Value
Message
Kafka topics are partitioned logs
8. 8Confidential
Core Kafka Pt. 4 - Durability
Kafka topics are replicated partitioned logs
Notes:
all reads and writes are to leader replica
9. 9Confidential
Core Kafka Pt. 5: How Scalable is Kafka?
● No bottleneck!
○ Many brokers
○ Many producers
○ Many consumers
● Limits?
○ Internet giants are driving
the limits higher - you
won’t need to worry.
○ e.g. LinkedIn > 1 trillion
messages / day through
Kafka clusters.
○ 100 brokers / 2 billion
messages a day is
“straightforward” to
operate
○ Don’t over partition
~< 100k partitions
producers
brokers
consumers
13. 13Confidential
Kafka Streams
● Just a library! A library that
makes it easy to do stateful
operations (joins, aggregations,
windowing).
● Elastically scalable
○ distributed!
● Fault tolerant
● Un-opinionated deployment
● State backed by Kafka used as
a changelog
● Exactly once processing
● Record-at-a-time processing
● Complex topologies
○ (but keep it simple)
● JVM only (Java, Scala, etc.)
16. 16Confidential
When Should You Use Kafka?
Scalability
● Quantity of Data
○ Simple Applications (or not)
● Complexity
○ Architectural
○ Organizational
17. 17Confidential
Buffering Pt. 1
Kafka is a very good buffer:
● Write optimized
● Highly reliable
● Tolerate data spikes
● Tolerate downstream outages
● Used by KStreams (no back
pressure problems)
Other examples:
28. 28Confidential
Advanced ETL #2: Stream / Table Join in KSQL
CREATE STREAM enriched_weblog AS
SELECT
ip,
text,
g.location AS location
FROM weblog w
LEFT JOIN geo g ON w.ip = g.ip;
● Query is long running on a KSQL cluster.
● Create weblog stream and geo table (backed by kafka topics) first.
● Currently, KSQL can interpret AVRO, JSON and CSV.
29. 29Confidential
Stream Processing App #1: Anomaly Detection / Alerting
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 10 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
● Use Kafka Streams
and/or additional
input streams in
more sophisticated
algorithm
● possible_fraud
is a change log
stream where key is
[card_number,
window_start]
authorization_attempts
possible_fraud
SMS Gateway
30. 30Confidential
Microservices
What are Microservices?
● Independently deployable, small units of functionality
○ (not a formal definition)
○ Primary motivation: decouple teams (scale in people terms)
○ Usually REST endpoints + commands/queries
Microservices can also be built on a backbone of events:
○ PII Filter
○ Weblog enricher
○ SMS fraud alert notifier
○ ... just the start
33. 33Confidential
Microservices - Receiver Driven Flow Control
● Pricing Service team does not need to talk to Orders Service team
● Trade off: no statement of overall behavior