Wix has a huge scale of event driven traffic. More than 70 billion Kafka business events per day.
Over the past few years Wix has made a gradual transition to an event-driven architecture for its 2000 microservices.
We have made mistakes along the way but have improved and learned a lot about how to make sure our production is still maintainable, performant and resilient.
In this talk you will hear about the lessons we learned including:
1. The importance of atomic operations for databases and events
2. avoiding data consistency issues due to out-of-order and duplicate processing
3. Having essential events debugging and quick-fix tools in production
and a few more
3. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Unique
visitors use
Wix platform
every month
~1B
4. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Unique
visitors use
Wix platform
every month
~1B
Daily HTTP
Transactions
~500B
Kafka
messages a
day
~70B
5. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Unique
visitors use
Wix platform
every month
~1B
Daily HTTP
Transactions
~500B
Kafka
messages a
day
~70B
GAs every
day
> 600
Microservices in
production
2500
* scale, resilience. issues
6. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Challenges
of event-driven architecture,
that we’ve bumped into
1 Producing message failures
Processing out-of-order & duplicates
2
4 Troubleshooting production
3 Sending large payloads
* success, tools, faster
7. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
How Event-driven Architecture Works
8. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Service-to-Service Communication
Cart
Service
User
Service
Inventory
Service
Catalog
Service
9. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Request-Reply Communication
HTTP RPC
HTTP RPC
HTTP RPC
Cart
Service
User
Service
Inventory
Service
Catalog
Service
* issue scale
12. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Event-driven Communication
Producer
Broker Product Updated Topic
Event
* improve, broker, scale
Catalog Service
Kafka
13. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Broker
more robust
* DB, decoupling, no impact
Cart Service
Producer Consumer
Kafka
Catalog Service
Product Updated Topic
14. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Broker
Event processing is guaranteed
Producer Consumer
Kafka
Catalog Service Cart Service
Product Updated Topic
15. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
The following is based on a true story
*Dates and products were changed for clarity :)
* ecom simple linear
16. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
2016
Wix starts using
event-driven
We can work event-driven!!
17. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
It all began when
Ecom experienced
data issues
Data does NOT reflect
actual catalog
Risk: show wrong
prices in cart
Cart
DB
18. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
2. Produce
“Product Updated”
Event
Broker
Cart
Service
4. Show updated
prices in cart
3. Update
Product Price
Catalog
Service
1.
Update
status
After investigating
Cart
DB
20. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Broker
Cart
Service
Catalog
Service
Make DB Update & Event Producing Atomic
21. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Produce event to S3
Broker
Catalog
Service
Resilient
Producer
Catch Unsent Events
22. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Produce event to S3
Broker
Produce to
Kafka
Healer
Service
Catalog
Service
Poll
Resilient
Producer
Fallback to S3 and Heal
23. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Kafka Broker
Service A Service B
Greyhound Producer
Kafka Producer
Greyhound Consumer
Kafka Consumer
Wrap Kafka with Greyhound*
* Open source: https://github.com/wix/greyhound
25. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
2016
Wix starts using
event-driven
2018
Greyhound
Resilient producer
& Consumer retries
26. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Produce event to S3
Broker
Produce to
Kafka
Healer
Service
Catalog
Service
Poll
Resilient
Producer
Fallback to S3 and Heal
27. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Broker
Catalog
Service
Healer
Service
Remove
Discount Introduce
Discount
Then ‘out-of-order’ happened
Cart
Service
29. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Catalog
Service
Broker
Healer
Service
Introduce
Discount
Mitigating out-of-order with revision ID
# 10
# 9
Cart
Service
30. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Catalog
Service
Broker
Healer
Service
Remove
Discount Introduce
Discount
Mitigating out-of-order with revision ID
# 11
# 10
# 9
Cart
Service
* item itself
31. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Scan the
binlog. For
each entry
produce a
‘status
update’ event
Cart
Service
Broker
Catalog
Service
Mitigating out-of-order with Debezium connector
32. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
More Ecom data
issues
Data does NOT reflect
actual inventory
Risk: lose
potential customers
Inventory
DB
33. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Item 2
Item 1
Broker
Payments
Service
Investigation leads to duplicate processing
Payment for: Inventory
Service
Retry
Item 2 5 → 3
Item 1 9 → 7
* not idempotent
34. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Item 2 5 → 4
Item 1 9 → 8
Item 2
Item 1
Payment for:
Broker
txnId - a7g45
Mitigating duplicates with Transaction ID
Payments
Service
Inventory
Service
txnId - a7g45
txnId - a7g45
36. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Broker
Product Catalog
Service
Product Update
event
Cart
Service
“Dude, I can’t produce large payloads”
...
"description": "An
apple mobile which is
nothing like apple",
...
37. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
* 1MB
Challenge #3
Failure to send large payloads
Broker
...
"description": "An
apple mobile which is
nothing like apple",
...
38. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Large Payloads
Remedy I
Compression
→ Try several compression types (lz4, snappy,
etc.)
→ Compression on Kafka level is usually
better than application level, as payloads
can be compressed in batches
39. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Large Payloads
Remedy II
Chunking
Broker
1. Split to chunks
& produce
2. Consume &
reassemble
Product
Catalog
Service
Cart
Service
40. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Large Payloads
Remedy III
Reference to
Object Store
2. Produce with S3
URL
3. Consume &
download from
S3
1. Upload to S3
Product
Catalog
Service Cart
Service
Broker
41. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
2016
Wix starts using
event-driven
2018
Greyhound Resilient
producer &
Consumer retries
2019
We use IDs for
ooo & duplicates
2020
Added
compression
by default
42. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
* bottlenecks
Challenge #4
It’s hard for developers to debug and maintain event-driven
microservices at scale in production
54. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
2016
Wix starts using
event-driven
2018
We open source
Greyhound
2019
We use IDs for
ooo & duplicates
2020
Added
compression
by default
2021-22
Tools in
Production
55. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Wix developers have embraced
event-driven architecture.
56. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Meeting these challenges
made our microservices more
decoupled, resilient and scalable,
while keeping complexity low and
data consistent.
57. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
The Blog Post
https://medium.com/wix-engineerin
g/event-driven-architecture-5-pitfalls-t
o-avoid-b3ebf885bdb1
58. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
How to migrate 2000 microservices to Multi Cluster
Managed Kafka with 0 Downtime
The Next Step
https://www.youtube.com/watch?v=
XKbG8a-9NRE
59. Lessons Learned from 2000 Event-driven Microservices @NSilnitsky
Greyhound
github.com/wix/greyhound