This document summarizes how Booking.com solved scalability issues with their Event Graphite Processor (EGP) system. The EGP processes large volumes of event data to generate metrics but was limited by high RAM usage. A new approach was developed that uses event streaming and parallelization, reducing processing time from over 120 seconds to 80 seconds while using much less RAM. This was achieved through a hackathon that rewrote 260 monitors in one day. The new system uses 56-core servers, processes events in parallel groups, and requires only 500MB of RAM compared to the previous 15GB.
8. Event Graphite Processor.
• The dataset is huge and it’s growing
• Every second of events takes 10–15GB of RAM
• Monitors are split into groups to run faster
• Every group runs in a fork
• Forking provokes COW
• RAM is being saturated
• No RAM = the box is being kicked out
18. First results: promising.
• CPU: no changes
• Processing time: 60sec vs 30sec
• # of boxes: 20 vs 8
• RAM: 10GB vs 100MB
19. 1. RAM is an issue
2. New user monitors
3. New systems
4. More events every day
EGP: Time to act!
20. 1. Implement a proof of concept
2. Freeze EGP development
3. Migrate all monitors
4. Full-scale test
5. Roll out the new system
6. Profit!
EGP migration TODO
21. 1. 8 people
2. All done in 1 day
3. 260 monitors
4. 317 files changed,
10336 insertions,
11288 deletions
5. Ready to run a full-scale test
Migration.
Hackathon.
Results
28. 1. Processing time: 80sec vs 40sec
2. RAM: 16GB vs 500MB
3. # of boxes: 80 vs 30
The results.
The new system
29. 1. Engineering is the king,
collaboration is the queen
2. The ideas that
failed individually might
work together
3. Challenge everything
Lessons learned.