ICCHA SETHI | PRINCIPAL DEVELOPER, ATLASSIAN | @ICCHASETHI
From Requirements To Resilient Event Driven
Systems With Kafka
STORY OF NOTIFICATIONS
It all begins with a feature
spec…
STORY OF NOTIFICATIONS
@IcchaSethi
We need
Notifications!
ASAP!
@IcchaSethi
What did we know about the
technical requirements of
Notifications feature?
What did we know?
Asynchronous
Scale
What did we know?
Asynchronous
Scale
What did we know?
Graceful degradationAsynchronous
@IcchaSethi
Event-driven architecture, is
a software architecture pattern
promoting the production,
detection, consumption of, and
reaction to events.
Wikipedia, 2018
“The event-driven architecture pattern
is a popular distributed asynchronous
architecture pattern used to produce
highly scalable applications”
Software Architecture Patterns, Mark Richards
Kafka
0 1 2 3 4 5 6 7 8 9
Partition 0
0 1 2 3 4 5
Partition 3
Topic Foo
Producer
Consumer 0
Consumer 1
Consumer Group B
Consumer 0
Consumer 1
Consumer Group A
Consumer
Group B
Consumer
Group A
MESSAGE SERVICE
MESSAGE SERVICE
NOTIFICATIONS
WORKER
MESSAGE SERVICE
NOTIFICATIONS
WORKER
Send notification
NOTIFICATIONS
WORKER
Send notification
Dependency A
Dependency B
Dependency C
MICRO SERVICES MEET EVENT DRIVEN ARCHITECTURE
Direct
Messages
Worker
Send
Notification
Worker
Group
Messages
Worker
Check User
Online Worker
Dependency A Dependency B
Send notification
Direct
Messages
Worker
Send
Notification
Worker
Group
Messages
Worker
Check User
Online Worker
Dependency A Dependency B
Send notification
Direct
Messages
Worker
Send
Notification
Worker
Group
Messages
Worker
Check User
Online Worker
Send notification
Another Rule
Worker
MESSAGE SERVICE
NOTIFICATIONS
WORKER
MESSAGE SEARCH
INDEXER
Build for extensibility and
flexibility to account for
evolving requirements!
TAKEAWAY
@IcchaSethi
Good job
Team! We are
done!
Umm.. not quiet
yet…
How do we want
notifications to
behave when
things go
wrong?
Check User Online
Worker
Presence Service
Low latency
is of utmost
importance
Ordering of
Events is of
utmost
important
ORDERING LATENCY
RESILIENCY REQUIREMENTS
Ordering Latency
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
Deadletter queues
Deadletter using Kafka
0 1 2 3 4 5 6 7 8 9
Topic Foo
Consumer 0
Consumer 1
Consumer Group B
Deadletter using Kafka
0 1 2 3 4 5 6 7 8 9
Topic Foo
Consumer 0
Consumer 1
Consumer Group B
0 1 2 3 4 5
Topic
Foo.deadletter
Deadletter using Kafka
0 1 2 3 4 5 6 7 8 9
Topic Foo
Consumer 0
Consumer 1
Consumer Group B
0 1 2 3 4 5
Topic
Foo.deadletter
Deadletter using Kafka
Deadletter using Kafka
Use when order does not
matter
Do not deadletter un-
recoverable errors
Utilize metadata to track
attempts
Deadletter
with Kafka
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
Deadletter queues
Timeouts
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
Deadletter queues
Timeouts
Circuit breakers
RESILIENCY REQUIREMENTS
Ordering
HTTP Retries
Latency
Deadletter queues
Timeouts
Circuit breakers
@IcchaSethi
Check User Online
Worker
Presence Service
SCENARIO ONE - WHAT TOOLS WOULD YOU USE?
Order doesn’t
matter! Send
notifications
ASAP!
SCENARIO TWO - WHAT TOOLS WOULD YOU USE?
I guess upto 1
second delay is
acceptable to
preserve order.
Else order
doesn’t matter.
Always ASK what if things go
wrong?
TAKEAWAY
@IcchaSethi
HOW DO YOU SCALE RESILIENCY?
HOW DO YOU SCALE RESILIENCY?
Make it low-cost to opt in for
resiliency and have sensible
default values
TAKEAWAY
Why didn’t
the customer
get the
notification?
For any product
Observability
is an unwritten
requirement
ME
Observability
Stats
Logs
Tracing
What should we
measure?
• Data for Current state
• Data to predict future state
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
Arrival Rate
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
Arrival Rate
Queue Length
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
Arrival Rate Wait Time
Residence Time
Service Time
Queue Length
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
Arrival Rate Wait Time
Residence Time
Service Time
Utilization
=
Arrival Rate
*
Service Time
Queue Length
THE ANSWER LIES IN QUEUING THEORY
Arrivals
Queue
Service
Arrival Rate Wait Time
Arrival Rate
Service Time
Utilization
Queue Length
UTILIZATION = RATE OF ARRIVAL*SERVICE TIME
Hockey Stick
Residence Time vs Utilization
Applying
Queuing
Theory to
Kafka….
UTILIZATION = RATE OF ARRIVAL * SERVICE TIME
Direct Messages
Worker
Consumer Lag
Residence Time
Service Time
Broker
UTILIZATION = RATE OF ARRIVAL * SERVICE TIME
Direct Messages
Worker
Broker
Counter for messages produced
UTILIZATION = RATE OF ARRIVAL * SERVICE TIME
Observability
Stats
Logs
Tracing
What should we
measure?
• Data for Current state
• Data to predict future state
• When will HOT partitions
in Kafka occur?
• Will scaling my consumers
help?
Observability is not only for
measure of internal state of
system but can also be used
to predict its future state
TAKEAWAY
@IcchaSethi
Observability
Stats
Logs
Tracing
Logs
• All producer/consumer success and failure
handlers are wrapped with consistent log fields.
• topic
• partition
• offset
• Kafka timestamp
LOGS
Observability
Stats
Logs
Tracing
“… one of the deficiencies of the Event
Notification pattern is the lack of visibility
into the flow of the system, which makes
it hard to debug.” - Martin Fowler
Observability
Stats
Logs
Tracing
Observability
Stats
Logs
Tracing
Kafka Event Metadata
• id
• producer of the event
• source
• tags
• timestamp
• version
Observability
Stats
Logs
Tracing
"bitbucket_org-hipchat-schemas-platform-
ZipkinTrace": {
"properties": {
"parent_id": {
"type": "string"
},
"sampled": {
"type": "boolean"
},
"span_id": {
"type": "string"
},
"trace_id": {
"type": "string"
}
}
},
Observability
Stats
Logs
Tracing
Observability
Stats
Logs
Tracing
SCALING OBSERVABILITY
STORY OF NOTIFICATIONS
Product Requirements
• Build for evolving requirements - extensibility
and flexibility
• Always ask WHAT if things go wrong
Resiliency
• Resiliency affects product behavior
• Make it a low cost option to adopt resiliency
with sensible default values
Observability
Observability is not only the measure of the
internal state but can also be used to predict
future state
Takeaways
@IcchaSethi
Thank you!
@IcchaSethi

Resilient Event Driven Systems With Kafka