Kafka Summit SF 2017 - Running Kafka as a Service at Scale

1
Running Apache Kafka
as a Service at Scale
Sriram Subramanian, Director of Engineering

2
52% of Kafka users in the cloud

3
Availability
Latency
Durability

4
Managing Distributed Systems is really really hard

5
Failure is inevitable – Cloud makes it worse

7
Observability is a dark art
You can’t fix what you can’t see

8
Observability is a dark art
Observability is a dark art perfected
only with time and experience!

10
Takes years of iteration
= v1.0

13
A stitch in time saves nine
test data

14
Types of testing we do today

15
4K+
Test hours
8K+
Total tests
75%
Coverage
Testing numbers

17
Broker 1 Broker 2 Broker N
Producer Producer Producer
Consumer Consumer
Kafka Cluster
Topic partition
REPLICATIONREPLICATION
Failed tests involved correlated failures

18
m1
m3
m4
m5
m1
m2
m4
m5
Replica A Replica B
0
1
2
3
0
1
2
3
State of the logs

19
m1 m1
Replica A Replica B
0 0
HW HW
HW = High Water Mark – Tracks the committed messages
L
How Replication Works?

20
m1 m1
Replica A Replica B
0 0
HW HW
m2
L

21
m1 m1
Replica A Replica B
0 0
HW HW
L
m21

22
m1 m1
Replica A Replica B
0 0
HW HW
L
m2
m2
1

23
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m2 11

24
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m2
ACK
11

25
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1

26
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Updated HW to 1

27
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1

29
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Root Cause

30
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Root Cause

31
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2Truncate 1
Root Cause

32
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause

33
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause
Get m2

34
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause

35
m1 m1
Replica A Replica B
0 0
HW
HW
m21
Root Cause
L
Message M2 is Lost

36
Introducing Leader Generation

37
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Zero Data Loss
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0

38
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Zero Data Loss
LG0 0
LG0
LG0
LG0
LG0

39
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
LG0 0
LG0
LG0
LG0
LG0
Leader Generation Request LG0
Zero Data Loss

40
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
LG0 0
LG0
LG0
LG0
LG0
Offset = 1
Zero Data Loss

41
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1
LG0 0
LG0
LG0
LG0
LG0
Zero Data Loss

42
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1
LG0 0
LG0
LG0
LG0
LG0
Zero Data Loss

43
m1 m1
Replica A Replica B
0 0
HW HW
m2m21 1
LG0 0
LG0
LG0
LG0
LG0
L
Zero Data Loss

44
When availability became more important than ability

45
Consumer Consumer
Kafka Cluster
Topic partition
Controller
Where did my topic partitions go?

46
● Leader Election
● Replica Reassignment
● Create Topic
● Delete Topic
● Add Partitions
● Broker start and shutdown
Responsibilities of the Controller
No Controller - No Cluster

47
● Controller state - topic creation, topic deletion etc
● Time taken to perform an operation
● Rate at which an admin operation is performed
● Queue sizes within the controller
Lack of Metrics

48
Root Cause
Partition 1
Partition 2
Partition 3
Partition 4

49
Root Cause
Upgrades!
● Synchronous per-partition
zookeeper writes
● Sequential per-partition
controller-to-broker requests
● Complicated concurrency
semantics
● No separation of control plane
from data plane

50
How did we fix it?
Partition
1,2,3,4

51
Zero controller downtime!
● Highly available cluster
● 10x faster leader elections
● More number of topic partitions per cluster
● Faster broker shutdown and upgrades

52
When everything is INFINITE nothing is ever ENOUGH

53
Consumer Consumer
Kafka Cluster
Topic partition
Can I get my latency?

54
But I have bytes quota set
● Throttle byte rate per second on the broker
● Response delayed on exceeding threshold
● Avoids bad clients from consuming all the bandwidth

55
Root Cause
Byte Rate Quotas is useful but not sufficient

56
Root Cause
● Too many small sized requests
● DDos attack from client
● Decompression on the server takes a long time
● With more consumer instances, more requests

57
bin/kafka-configs --zookeeper localhost:2181 --alter --add-config
'request_percentage=50' --entity-name user1 --entity-type users
Request Quota - percentage of time a client can spend on request
handler (I/O) threads and network threads within each quota window
Predictable Latency

58
+ +
Apache Kafka Tools Automation

60
confluent.io/confluent-cloud
How do you get on Confluent Cloud?

Kafka Summit SF 2017 - Running Kafka as a Service at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Summit SF 2017 - Running Kafka as a Service at Scale

Similar to Kafka Summit SF 2017 - Running Kafka as a Service at Scale (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Summit SF 2017 - Running Kafka as a Service at Scale