As Wix Kafka usage grew to 2.5B messages per day, >20K topics and >100K leader partitions serving 2000 microservices,
we decided to migrate from self-operated single cluster per data-center to a managed cloud service (Like Amazon MSK or Confluent Cloud) with a multi-cluster setup.
The classic approach would be to perform this transition when all incoming traffic is removed from the data center.
But draining an entire data-center for an undetermined period of time, until all 2000 services complete the switch was too risky for us.
This talk is about how we gradually migrated all of our Kafka consumers and producers with 0 downtime while they continued to handle regular traffic. You will learn practical steps you can take to greatly reduce the risks and speed up the migration timeline.
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
1. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Migrating to a Multi-Cluster
Managed Kafka with 0 Downtime
Natan Silnitsky Backend Infra TL, Wix
natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
4. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Migrating to a Multi-Cluster
Managed Kafka with 0 Downtime
Natan Silnitsky Backend Infra TL, Wix
natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
6. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Enable any business,
community or person on earth
to make their online dream
come true.
Our mission
7. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
At Wix
±1B
Unique
visitors
>100M
Websites
Published by
Wix users
(5-7% of all
internet
websites)
±5000
People
Work at Wix
use Wix
platform
every month
8. @NSilnitsky
500B >600
GAs Per Day
~1500
Developers
Daily HTTP
Transactions
18
Data
Centers &
Pops
2
Cloud Providers
[Google/AWS]
Scaling out
Migrating to Multi Cluster Managed Kafka
9. @NSilnitsky
Wix has over 2500 microservices in production
Migrating to Multi Cluster Managed Kafka
11. @NSilnitsky
Kafka in Wix
Migrating to Multi Cluster Managed Kafka
2019
1 cluster per region, self hosted
5 K
Topics
> 45 K
Partitions
~ 450 M
Messages
produced a day
Kafka Broker
12. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Kafka in Wix Last Year
2021
1 cluster per region, self hosted
20 K
Topics
> 200 K
Partitions
3 B
Messages
produced a day
* still only one
13. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
2021
1 cluster per region, self hosted
To multi cluster,
managed Kafka platform
So, migrate all this
20 K
Topics
> 200 K
Partitions
3 B
Messages
produced a day
overloaded
1. Better Cluster
performance & flexibility
2. Transparent version
upgrade
3. Easy to add a new Cluster
4. Tiered Storage
14. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Wix wraps Kafka with Greyhound, a Scala/Java high-level SDK.
~2500 Wix microservices
Kafka Producer Kafka Consumer
Greyhound Producer Greyhound Consumer
15. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
Kafka Producer
Greyhound Producer Greyhound Consumer
Kafka Consumer
Kafka Broker
Checkout
Service
example
Payments
Service
23. Agenda 1. The Multi Cluster (Kafka)
2. The Migration
3. What to Expect
Migrating to Multi Cluster Managed Kafka
24. @NSilnitsky
Unbalanced brokers
Unclear Kafka strategy
Too many partitions
Real production impact
→
→
→
→
Our Starting Point
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
25. @NSilnitsky
Migrate on drained Traffic?
Migrations
we ❤
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
26. @NSilnitsky
(Blockers)
→ Specific DC services
→ Long time
→ Not gradual - Edge cases risk
Q4 2020: CANCELED
Migrate on drained Traffic?
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
27. @NSilnitsky
… HAS to be
Seamless &
Production-safe.
Migrate With Traffic!
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
36. @NSilnitsky
The Migration
Best Practices
1. Create a script that checks state by itself and stops if expected state is not reached.
2. Have a rollback readily available.
3. Start with test topics and no impact topics
4. Create custom metrics dashboards that show current state.
The Migration
37. Agenda 1. The Multi Cluster (Kafka)
2. The Migration
3. What to Expect
Migrating to Multi Cluster Managed Kafka
39. @NSilnitsky
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
Replicator
service
On-Prem
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
What to Expect when Migrating to Multi Cluster
During Nov/Dec 2021
40. @NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
41. @NSilnitsky
What to Expect when Migrating to Multi Cluster
Replicator
service
On-Prem
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
It’s Christmas eve 🎄 2021.
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
42. @NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=8388608
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
43. @NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=8388608
Kafka records start getting DELETED
faster than expected
(for compact topics too)
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
44. @NSilnitsky
What to Expect when Migrating to Multi Cluster
Restored records from
another Data Center
Kafka records start getting DELETED
faster than expected
(for compact topics too)
Luckily for us ...
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
45. @NSilnitsky
What to Expect when Migrating to Multi Cluster
Affected Broker versions: 1.1.0, 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
Fix: Changed dummy value
for all topic configs
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
48. @NSilnitsky
Migrating to Multi Cluster Managed Kafka
We used Greyhound
& dedicated
orchestration
services for
an automatic, safe,
and gradual
migration.
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
Migrations
we ❤
managed
optimized
51. @NSilnitsky
Migrate Your Kafka Cluster with Minimal Downtime
Great Confluent
Podcast Episode
https://www.youtube.com/watch?v=
oqRiagSnYfQ
Migrating to Multi Cluster Managed Kafka
52. @NSilnitsky
Thank
You!
Migrating to Multi Cluster Managed Kafka
natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
👉 slideshare.net/NatanSilnitsky
Any questions?