SlideShare a Scribd company logo
1 of 12
Happy users and good sleep
How?
Stanislav German-Evtushenko
Cloud Foundry Meetup
Tokyo, 2018-11-02
About me
•DevOps Engineer
•3 years in Rakuten
A bit of history
•Running Open Source Cloud Foundry for 6 years
•Running v2 for 2.5 years
•1000+ apps, 2000+ containers, ~7000 QPS at peak time
•The team of 8
Issues we were facing
•One of three nights was without sleep
•Most of alerts were meaningless
•A lot of platform problems were hidden
•Known issues didn’t have good solutions
What did we want
•Deliver reliable, secure platform, maintainable
•Keep number of alerts low
•Let the platform grow while keeping size of the team
Know your issues
•Know what can go wrong before it does
•Crash tests – don’t wait things to break, break them first
•(kill a vm, drop all data, freeze receiving on pushing)
•Keep track of known issues and work arounds
•Simulation (identical environments)
•End-to-end monitoring (cf push, cf login, etc), only actionable alerts
Predictions based on metrics
•load average is your friend
•packet drops
•free space and inodes
•percentage of functional nodes (e.g. routers)
•dns response
•mutual TLS (when does your certificate expire?)
•warnings during work time, fix asap
Keep technical debt low
•If a user have a problem assume that problem is on your side
•Keep close to upstream
•"What if we need to redeploy it from scratch?"
Restorable backups
•Proper backups with monitoring
•Restoration trials
Set your priorities
•Reliable, Secure
•Useful (outcome exceeds efforts)
•Maintainable
•Easy to use and hard to misuse
•Suitable for the majority but not all use cases
Worth reading
•https://githubengineering.com/upgrading-github-from-rails-3-2-to-5-2
Thank you

More Related Content

What's hot

Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Strict-Data-Consistency-in-Distrbuted-Systems-With-FailuresStrict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Slava Imeshev
 

What's hot (14)

Solution8 v2
Solution8 v2Solution8 v2
Solution8 v2
 
Open Source Software & Open Source Hardware
Open Source Software & Open Source HardwareOpen Source Software & Open Source Hardware
Open Source Software & Open Source Hardware
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
MySQL Multi-Master Replication
MySQL Multi-Master ReplicationMySQL Multi-Master Replication
MySQL Multi-Master Replication
 
Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Strict-Data-Consistency-in-Distrbuted-Systems-With-FailuresStrict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
 
Outsourced database
Outsourced databaseOutsourced database
Outsourced database
 
Taking the Ks off your APKs - Rotem Mizrachi-Meidan, Everything.me
Taking the Ks off your APKs - Rotem Mizrachi-Meidan, Everything.meTaking the Ks off your APKs - Rotem Mizrachi-Meidan, Everything.me
Taking the Ks off your APKs - Rotem Mizrachi-Meidan, Everything.me
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
Infinum Android Talks #05 - Square tape
Infinum Android Talks #05 - Square tapeInfinum Android Talks #05 - Square tape
Infinum Android Talks #05 - Square tape
 
Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group
 
Introduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users GroupIntroduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users Group
 
Reader/writer problem
Reader/writer problemReader/writer problem
Reader/writer problem
 
Realtime
RealtimeRealtime
Realtime
 
Dealing with delayed events in Splunk
Dealing with delayed events in SplunkDealing with delayed events in Splunk
Dealing with delayed events in Splunk
 

Similar to Happy users and good sleep. How?

Creating Havoc using Human Interface Device
Creating Havoc using Human Interface DeviceCreating Havoc using Human Interface Device
Creating Havoc using Human Interface Device
Positive Hack Days
 

Similar to Happy users and good sleep. How? (20)

Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Operating Systems & Applications
Operating Systems & ApplicationsOperating Systems & Applications
Operating Systems & Applications
 
A Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data ImplementationA Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data Implementation
 
Creating Havoc using Human Interface Device
Creating Havoc using Human Interface DeviceCreating Havoc using Human Interface Device
Creating Havoc using Human Interface Device
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer games
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Nomura UCCSC 2009
Nomura UCCSC 2009Nomura UCCSC 2009
Nomura UCCSC 2009
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 

Recently uploaded

Recently uploaded (20)

Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

Happy users and good sleep. How?

  • 1. Happy users and good sleep How? Stanislav German-Evtushenko Cloud Foundry Meetup Tokyo, 2018-11-02
  • 3. A bit of history •Running Open Source Cloud Foundry for 6 years •Running v2 for 2.5 years •1000+ apps, 2000+ containers, ~7000 QPS at peak time •The team of 8
  • 4. Issues we were facing •One of three nights was without sleep •Most of alerts were meaningless •A lot of platform problems were hidden •Known issues didn’t have good solutions
  • 5. What did we want •Deliver reliable, secure platform, maintainable •Keep number of alerts low •Let the platform grow while keeping size of the team
  • 6. Know your issues •Know what can go wrong before it does •Crash tests – don’t wait things to break, break them first •(kill a vm, drop all data, freeze receiving on pushing) •Keep track of known issues and work arounds •Simulation (identical environments) •End-to-end monitoring (cf push, cf login, etc), only actionable alerts
  • 7. Predictions based on metrics •load average is your friend •packet drops •free space and inodes •percentage of functional nodes (e.g. routers) •dns response •mutual TLS (when does your certificate expire?) •warnings during work time, fix asap
  • 8. Keep technical debt low •If a user have a problem assume that problem is on your side •Keep close to upstream •"What if we need to redeploy it from scratch?"
  • 9. Restorable backups •Proper backups with monitoring •Restoration trials
  • 10. Set your priorities •Reliable, Secure •Useful (outcome exceeds efforts) •Maintainable •Easy to use and hard to misuse •Suitable for the majority but not all use cases