3. A bit of history
•Running Open Source Cloud Foundry for 6 years
•Running v2 for 2.5 years
•1000+ apps, 2000+ containers, ~7000 QPS at peak time
•The team of 8
4. Issues we were facing
•One of three nights was without sleep
•Most of alerts were meaningless
•A lot of platform problems were hidden
•Known issues didn’t have good solutions
5. What did we want
•Deliver reliable, secure platform, maintainable
•Keep number of alerts low
•Let the platform grow while keeping size of the team
6. Know your issues
•Know what can go wrong before it does
•Crash tests – don’t wait things to break, break them first
•(kill a vm, drop all data, freeze receiving on pushing)
•Keep track of known issues and work arounds
•Simulation (identical environments)
•End-to-end monitoring (cf push, cf login, etc), only actionable alerts
7. Predictions based on metrics
•load average is your friend
•packet drops
•free space and inodes
•percentage of functional nodes (e.g. routers)
•dns response
•mutual TLS (when does your certificate expire?)
•warnings during work time, fix asap
8. Keep technical debt low
•If a user have a problem assume that problem is on your side
•Keep close to upstream
•"What if we need to redeploy it from scratch?"
10. Set your priorities
•Reliable, Secure
•Useful (outcome exceeds efforts)
•Maintainable
•Easy to use and hard to misuse
•Suitable for the majority but not all use cases