Happy users and good sleep. How?

Happy users and good sleep
How?
Stanislav German-Evtushenko
Cloud Foundry Meetup
Tokyo, 2018-11-02

About me
•DevOps Engineer
•3 years in Rakuten

A bit of history
•Running Open Source Cloud Foundry for 6 years
•Running v2 for 2.5 years
•1000+ apps, 2000+ containers, ~7000 QPS at peak time
•The team of 8

Issues we were facing
•One of three nights was without sleep
•Most of alerts were meaningless
•A lot of platform problems were hidden
•Known issues didn’t have good solutions

What did we want
•Deliver reliable, secure platform, maintainable
•Keep number of alerts low
•Let the platform grow while keeping size of the team

Know your issues
•Know what can go wrong before it does
•Crash tests – don’t wait things to break, break them first
•(kill a vm, drop all data, freeze receiving on pushing)
•Keep track of known issues and work arounds
•Simulation (identical environments)
•End-to-end monitoring (cf push, cf login, etc), only actionable alerts

Predictions based on metrics
•load average is your friend
•packet drops
•free space and inodes
•percentage of functional nodes (e.g. routers)
•dns response
•mutual TLS (when does your certificate expire?)
•warnings during work time, fix asap

Keep technical debt low
•If a user have a problem assume that problem is on your side
•Keep close to upstream
•"What if we need to redeploy it from scratch?"

Restorable backups
•Proper backups with monitoring
•Restoration trials

Set your priorities
•Reliable, Secure
•Useful (outcome exceeds efforts)
•Maintainable
•Easy to use and hard to misuse
•Suitable for the majority but not all use cases

Worth reading
•https://githubengineering.com/upgrading-github-from-rails-3-2-to-5-2

Happy users and good sleep. How?

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Happy users and good sleep. How?

Similar to Happy users and good sleep. How? (20)

Recently uploaded

Recently uploaded (20)

Happy users and good sleep. How?