6. •Multiple downtimes
•CPU usage hitting 80% regularly
•Commit latency trending up
•66 % increase in query volume in 6 months
Things started getting worse
15. •New project supported by all the engineering teams
•More than 25 Engineers involved (4 SREs)
•Status Meeting every 3 hours
•One Goal: Keep Intercom UP
A massive effort
16. Moved queries to read
replicas
Got rid of bad database usages
Vertical partitioning
Reviewed top queries
Schema optimizations
Keep Intercom UP
32. Alignment
Current Situation Definition of done
The DB is under constant pressure
Majority of the read operations are against a
replica
CPU is routinely > 70 % Adding new replicas to spread the load is easy
AMI rollouts push us closer to overall
connection limit
Master CPU is always below 50%
Bad database usages: queues, analytics,
large tables
Master handles less than 10.000 SELECTS/s
Connection limit is not a concern
35. Split read / write load
Tuned some database settings
Removed a lot of bad database usages
MySQL training
Improved our dashboards and alarms
Definition of done
36. •Hit our definition of done
•Reduced our number of pages by 80 %
•Great atmosphere and morale
Great milestone
44. •Never underestimate the work and the iterative process needed for moving
to a new work mode
•Always align your organization with the current situation and the morale of
the team
•Good news and good results also come with consequences and can imply
radical changes
Lessons learned