Solving some of the scalability problems at booking.com

Solving some of the scalability
problems at Booking.com
Ivan Kruglov
YAPC::EU 2017
based on Oleg Sidorov’s slides

What is an event?
• message with technical
and business data
• Sereal encoded
• srl([ $e, $e, $e, … ])

WEB
CRON
e-mail FAX
MySQL
VoIP
Transport
monitoring hadoop/hive A/B testing elastic search

• Distributed, dc-fault tolerant
• Generates graphite metrics from events
• Runs user-defined code (~260 monitors)
• Processes events second-by-second (epochs)
• upto ~500k events per second
Event Graphite Processor.

epoch_of_events = get_events_for_epoch(now)
foreach monitor : monitors
result = monitor.run(epoch_of_events)
graphite.send(result)

• The dataset is huge and it’s growing
• Every second of events takes 10–15GB of RAM
• Monitors are split into groups to run faster
• Every group runs in a fork
• Forking provokes COW
• RAM is being saturated
• No RAM = the box is being kicked out

Processing. Thinking different.
epoch_of_events = sereal_decoder.parse(srl_events)
foreach (event : epoch_of_events) ...
vs
iterator = sereal_decoder.iterator(srl_events)
while event = iterator.next() ...

Processing. Thinking different.
iterator = epoch_of_events.iterator
while event = iterator.next()
monitor.process_event(event)
result = monitor.post_process()
Problem: need to rewrite user code

result = monitor.run(epoch_of_events)
iterator = epoch_of_events.iterator
while event = iterator.next()
monitor.process_event(event)
result = monitor.post_process()
vs

How?
• Sereal::Path::Iterator
a. iterate over objects
(scalar/arrayref/hashref/blessed/etc)
b. decode from current
position
• no streaming
• check limitations

Still need to rewrite
user code!

1. Identical stack
2. Same problems
3. Easy to migrateA proof of concept.
FlogCron

First results: promising.
• CPU: no changes
• Processing time: 60sec vs 30sec
• # of boxes: 20 vs 8
• RAM: 10GB vs 100MB

1. RAM is an issue
2. New user monitors
3. New systems
4. More events every day
EGP: Time to act!

1. Implement a proof of concept
2. Freeze EGP development
3. Migrate all monitors
4. Full-scale test
5. Roll out the new system
6. Profit!
EGP migration TODO

1. 8 people
2. All done in 1 day
3. 260 monitors
4. 317 files changed,
10336 insertions,
11288 deletions
5. Ready to run a full-scale test
Migration.
Hackathon.
Results

Chasing the problem.
foreach (@events) {
…
}
while (my $event = $iterator->next()) {
…
}
POSIX::exit
POSIX::exit
undef $event # ~500k times, ~15GB in total

1. Xeon E5-2690
2. 56 logical cores
The new hardware.

1. No more RAM constraints
2. More aggressive forking
3. 3x more groups/forks
Parallelization.
Fork me!

1. Processing time: 80sec vs 40sec
2. RAM: 16GB vs 500MB
3. # of boxes: 80 vs 30
The results.
The new system

1. Engineering is the king,
collaboration is the queen
2. The ideas that
failed individually might
work together
3. Challenge everything
Lessons learned.

Ivan Kruglov
ivan.kruglov@booking.com
Thank you!

Solving some of the scalability problems at booking.com

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Solving some of the scalability problems at booking.com

Similar to Solving some of the scalability problems at booking.com (20)

More from Ivan Kruglov

More from Ivan Kruglov (16)

Recently uploaded

Recently uploaded (20)

Solving some of the scalability problems at booking.com