SlideShare a Scribd company logo
1 of 18
Monitoring microservice
applications: An SRE’s
perspective
Evgeny Potapov
State of the Infrastructure
• Modern infrastructure – multiple interacting applications/services running inside
service-orchestration system (e.g. docker+kubernetes, but not limited)
• This stack matches the paradigm shift happening in software development as an
engineering discipline. The main target is to reach business agility (we should
deliver the functionality as soon as possible). Site reliability isn’t the main
purpose.
• This obviously brings a technical debt.
• It is expected that this technical debt will be solved later, but that rarely happens.
It is expected that this technical debt would be
solved later, but that rarely happens.
And the infrastructure encourages it!
Kubernetes is so awesome that one of our JVM containers
has been periodically running out of memory for more than
a year, and we just recently learned about it.
https://danlebrero.com/2018/11/20/how-to-do-java-jvm-heapdump-in-kubernetes/
e.g.:
• Sales report service was updated to provide updates for the regional sales director
• Each time the app was loading the whole database instead of a single month
• SREs were seeing occasional (once a month) service restarts, and no attention was given
What can we do?
Top (10?) problems in microservice-
oriented architecture monitoring
No service restart monitoring
• Problem: rare server restarts are unknown and aren’t investigated
• Leads to late reaction especially to OOM issues (sometimes months)
• Monitor/alert on change in number of restarts.
• Technical view: number of restarts in Kubernetes/docker swarm/etc.
• Prometheus/AlertManager, DataDog, etc.
No app-level error monitoring
• Problem: The app might be running but might throw critical errors to
logs
• Might mean big issues developers/business is unaware of
• Monitor service logs for known error log patterns (ask developers for
templates)
• Technical view: Grep over Elastic, X-Pack
• Bonus for developers: Sentry (developer user friendly)
• Prometheus/AlertManager, Sentry, DataDog, etc.
No service health checks (or minimal health
checks)
• Problem: service health check works as “print ‘I’m OK’”
• Leads to OK health check while service is completely down
• Work with developers/management to add proper health checks
(might mean for them to schedule more time for development plans,
usually not included: people expect it to be covered by monitoring
tools)
No API response-time checks (and no tracing)
• Problem: service might be working, but too slowly, which means the
whole app is slow
• In SOA/microservices/macroservices we need to check inter-service
communication as well
• Work with developers to add tracing to your app
• There’s a way to include Jaeger metrics to Prometheus:
https://www.youtube.com/watch?v=fjYAU3jayVo — Distributed
Tracing with Jaeger & Prometheus on Kubernetes
• https://github.com/opentracing-contrib
Service is still an application and still
consumes CPU and RAM
• Common problem: the application/cluster might be covered well by
monitoring service-wise, but still a single service might consume a lot
of CPU and that’s going to be left unnoticed
• Just don’t forget to monitor it
If someone adds new services, that should be
monitored as well
• Problem: sometimes cross-functional teams might have the ability to
add new services and this might be unknown to the SRE team
• Leads to: unmonitored service down unnoticed, a lot of investigation
• Monitor cluster configuration metrics (deployments/number of
namespaces etc.). Alert on difference.
• Not a PRIO 1 investigation, but when it happens, work with the team
to introduce the right workflow
CI/CD time should be monitored
• Problem: deploy time (docker build, tests) increases and new
developers are used to it.
• Leads to developer frustration, time spent just waiting
• Gradual changes to Dockerfile might gradually increase build time. 3
mins -> 5 mins -> 10 mins -> 20 mins. People get used to it
• Monitor time spent on integration and delivery, investigate reasons.
CI/CD support is now part of SRE live as well.
APM and profiling capabilities
• Problem: the app is slow, no one knows why
• Leads to really long investigation
• Development teams still don’t use APM/profiling in many cases; work
with developers to add APM when possible
Security monitoring
• Security is a part of SRE life as well
• Might be a good idea to monitor the number of WAF-related events
to investigate jumps/block attackers, escalate to security team.
• Monitor npm/yarn audit, docker image audits for critical CVEs, alert
when present
Summing up
• Alerts on service restarts
• Alerts on app-level errors
• Advanced health checks
• API response time alerts (including external APIs)
• Microservice architecture change notifications
• CI/CD build/delivery time monitoring and notifications
• APM/profiling added
• WAF event monitoring
Bonus track: Page response time is not
server response time anymore
• Problem: if your “server” responds in 200ms, but the page is
rendered in a browser in 60 seconds, it’s still 60 seconds for users.
• So monitor page rendering time as well
• Pingdom, Site24x7, a lot of headless browsers available
Evgeny Potapov
CEO, DevOpsProdigy
Twitter: @eapotapov
Email: eapotapov@devopsprodigy.com
LinkedIn: @eapotapov

More Related Content

What's hot

Quickstart for continuous integration
Quickstart for continuous integrationQuickstart for continuous integration
Quickstart for continuous integrationFabricio Epaminondas
 
Dev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauroDev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauroMatt Tesauro
 
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBMAdopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBMJules Pierre-Louis
 
Continuous Integration and Builds
Continuous Integration and BuildsContinuous Integration and Builds
Continuous Integration and BuildsBhavin Javia
 
Continuous integration, delivery & deployment
Continuous integration,  delivery & deploymentContinuous integration,  delivery & deployment
Continuous integration, delivery & deploymentMartijn van der Kamp
 
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...Matt Tesauro
 
Jenkins Test Automation with codeBeamer ALM
Jenkins Test Automation with codeBeamer ALMJenkins Test Automation with codeBeamer ALM
Jenkins Test Automation with codeBeamer ALMIntland Software GmbH
 
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next ProjectAbeer R
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration TestCampRO
 
Simulating Different-Network Speeds using JMETER
Simulating Different-Network Speeds using JMETERSimulating Different-Network Speeds using JMETER
Simulating Different-Network Speeds using JMETERAgile Testing Alliance
 
Introduction to continuous delivery
Introduction to continuous deliveryIntroduction to continuous delivery
Introduction to continuous deliveryOlympicSoftware
 
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksPerformance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksThoughtworks
 
Continuous Testing in the Agile Age
Continuous Testing in the Agile AgeContinuous Testing in the Agile Age
Continuous Testing in the Agile AgeBlazeMeter
 
How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks Ulf Mattsson
 
ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization Agile Testing Alliance
 
Verification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integrationVerification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integrationRogue Wave Software
 
Performance Testing
Performance TestingPerformance Testing
Performance Testingsharmaparish
 
Need To Automate Test And Integration Beyond Current Limits?
Need To Automate Test And Integration Beyond Current Limits?Need To Automate Test And Integration Beyond Current Limits?
Need To Automate Test And Integration Beyond Current Limits?Ghodhbane Mohamed Amine
 

What's hot (20)

Quickstart for continuous integration
Quickstart for continuous integrationQuickstart for continuous integration
Quickstart for continuous integration
 
Dev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauroDev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauro
 
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBMAdopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
 
Continuous Integration and Builds
Continuous Integration and BuildsContinuous Integration and Builds
Continuous Integration and Builds
 
Continuous integration, delivery & deployment
Continuous integration,  delivery & deploymentContinuous integration,  delivery & deployment
Continuous integration, delivery & deployment
 
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...
Matt tesauro Lessons from DevOps: Taking DevOps practices into your AppSec Li...
 
Jenkins Test Automation with codeBeamer ALM
Jenkins Test Automation with codeBeamer ALMJenkins Test Automation with codeBeamer ALM
Jenkins Test Automation with codeBeamer ALM
 
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
 
Continuous Testing With Terraform
Continuous Testing With TerraformContinuous Testing With Terraform
Continuous Testing With Terraform
 
Devops
DevopsDevops
Devops
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration
 
Simulating Different-Network Speeds using JMETER
Simulating Different-Network Speeds using JMETERSimulating Different-Network Speeds using JMETER
Simulating Different-Network Speeds using JMETER
 
Introduction to continuous delivery
Introduction to continuous deliveryIntroduction to continuous delivery
Introduction to continuous delivery
 
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksPerformance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
 
Continuous Testing in the Agile Age
Continuous Testing in the Agile AgeContinuous Testing in the Agile Age
Continuous Testing in the Agile Age
 
How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks
 
ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization
 
Verification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integrationVerification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integration
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Need To Automate Test And Integration Beyond Current Limits?
Need To Automate Test And Integration Beyond Current Limits?Need To Automate Test And Integration Beyond Current Limits?
Need To Automate Test And Integration Beyond Current Limits?
 

Similar to Monitoring microservice applications: An SRE’s perspective

Testing in the new age of DevOps
Testing in the new age of DevOpsTesting in the new age of DevOps
Testing in the new age of DevOpsMoataz Mahmoud
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev opsLen Bass
 
Session on evaluation of DevSecOps
Session on evaluation of DevSecOpsSession on evaluation of DevSecOps
Session on evaluation of DevSecOpsAbdullah al Mamun
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtVincent Burckhardt
 
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...Stacey Whitney
 
Preparing for DevOps
Preparing for DevOpsPreparing for DevOps
Preparing for DevOpsEklove Mohan
 
Continuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentContinuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentNeotys
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
DevOps Engineering.pptx
DevOps Engineering.pptxDevOps Engineering.pptx
DevOps Engineering.pptxAbalBoot
 
Softweare Engieering
Softweare Engieering Softweare Engieering
Softweare Engieering Huda Alameen
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build AutomationHeiswayi Nrird
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesEduardo Piairo
 

Similar to Monitoring microservice applications: An SRE’s perspective (20)

Testing in the new age of DevOps
Testing in the new age of DevOpsTesting in the new age of DevOps
Testing in the new age of DevOps
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev ops
 
Session on evaluation of DevSecOps
Session on evaluation of DevSecOpsSession on evaluation of DevSecOps
Session on evaluation of DevSecOps
 
OOSE UNIT-1.pdf
OOSE UNIT-1.pdfOOSE UNIT-1.pdf
OOSE UNIT-1.pdf
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is built
 
DevOps Presentation.pptx
DevOps Presentation.pptxDevOps Presentation.pptx
DevOps Presentation.pptx
 
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...
Mage Titans USA 2016 - Mathew Beane - Edit Fully Stacked: Less OOPS, More OPS...
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
 
Preparing for DevOps
Preparing for DevOpsPreparing for DevOps
Preparing for DevOps
 
Containerization Strategy
Containerization StrategyContainerization Strategy
Containerization Strategy
 
Continuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentContinuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile Development
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
DevOps Engineering.pptx
DevOps Engineering.pptxDevOps Engineering.pptx
DevOps Engineering.pptx
 
Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014
 
Softweare Engieering
Softweare Engieering Softweare Engieering
Softweare Engieering
 
Se lec 3
Se lec 3Se lec 3
Se lec 3
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build Automation
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Tell me how you provision and I'll tell you how you are
Tell me how you provision and I'll tell you how you areTell me how you provision and I'll tell you how you are
Tell me how you provision and I'll tell you how you are
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Monitoring microservice applications: An SRE’s perspective

  • 1. Monitoring microservice applications: An SRE’s perspective Evgeny Potapov
  • 2. State of the Infrastructure • Modern infrastructure – multiple interacting applications/services running inside service-orchestration system (e.g. docker+kubernetes, but not limited) • This stack matches the paradigm shift happening in software development as an engineering discipline. The main target is to reach business agility (we should deliver the functionality as soon as possible). Site reliability isn’t the main purpose. • This obviously brings a technical debt. • It is expected that this technical debt will be solved later, but that rarely happens.
  • 3. It is expected that this technical debt would be solved later, but that rarely happens. And the infrastructure encourages it!
  • 4. Kubernetes is so awesome that one of our JVM containers has been periodically running out of memory for more than a year, and we just recently learned about it. https://danlebrero.com/2018/11/20/how-to-do-java-jvm-heapdump-in-kubernetes/ e.g.: • Sales report service was updated to provide updates for the regional sales director • Each time the app was loading the whole database instead of a single month • SREs were seeing occasional (once a month) service restarts, and no attention was given
  • 5.
  • 6. What can we do? Top (10?) problems in microservice- oriented architecture monitoring
  • 7. No service restart monitoring • Problem: rare server restarts are unknown and aren’t investigated • Leads to late reaction especially to OOM issues (sometimes months) • Monitor/alert on change in number of restarts. • Technical view: number of restarts in Kubernetes/docker swarm/etc. • Prometheus/AlertManager, DataDog, etc.
  • 8. No app-level error monitoring • Problem: The app might be running but might throw critical errors to logs • Might mean big issues developers/business is unaware of • Monitor service logs for known error log patterns (ask developers for templates) • Technical view: Grep over Elastic, X-Pack • Bonus for developers: Sentry (developer user friendly) • Prometheus/AlertManager, Sentry, DataDog, etc.
  • 9. No service health checks (or minimal health checks) • Problem: service health check works as “print ‘I’m OK’” • Leads to OK health check while service is completely down • Work with developers/management to add proper health checks (might mean for them to schedule more time for development plans, usually not included: people expect it to be covered by monitoring tools)
  • 10. No API response-time checks (and no tracing) • Problem: service might be working, but too slowly, which means the whole app is slow • In SOA/microservices/macroservices we need to check inter-service communication as well • Work with developers to add tracing to your app • There’s a way to include Jaeger metrics to Prometheus: https://www.youtube.com/watch?v=fjYAU3jayVo — Distributed Tracing with Jaeger & Prometheus on Kubernetes • https://github.com/opentracing-contrib
  • 11. Service is still an application and still consumes CPU and RAM • Common problem: the application/cluster might be covered well by monitoring service-wise, but still a single service might consume a lot of CPU and that’s going to be left unnoticed • Just don’t forget to monitor it
  • 12. If someone adds new services, that should be monitored as well • Problem: sometimes cross-functional teams might have the ability to add new services and this might be unknown to the SRE team • Leads to: unmonitored service down unnoticed, a lot of investigation • Monitor cluster configuration metrics (deployments/number of namespaces etc.). Alert on difference. • Not a PRIO 1 investigation, but when it happens, work with the team to introduce the right workflow
  • 13. CI/CD time should be monitored • Problem: deploy time (docker build, tests) increases and new developers are used to it. • Leads to developer frustration, time spent just waiting • Gradual changes to Dockerfile might gradually increase build time. 3 mins -> 5 mins -> 10 mins -> 20 mins. People get used to it • Monitor time spent on integration and delivery, investigate reasons. CI/CD support is now part of SRE live as well.
  • 14. APM and profiling capabilities • Problem: the app is slow, no one knows why • Leads to really long investigation • Development teams still don’t use APM/profiling in many cases; work with developers to add APM when possible
  • 15. Security monitoring • Security is a part of SRE life as well • Might be a good idea to monitor the number of WAF-related events to investigate jumps/block attackers, escalate to security team. • Monitor npm/yarn audit, docker image audits for critical CVEs, alert when present
  • 16. Summing up • Alerts on service restarts • Alerts on app-level errors • Advanced health checks • API response time alerts (including external APIs) • Microservice architecture change notifications • CI/CD build/delivery time monitoring and notifications • APM/profiling added • WAF event monitoring
  • 17. Bonus track: Page response time is not server response time anymore • Problem: if your “server” responds in 200ms, but the page is rendered in a browser in 60 seconds, it’s still 60 seconds for users. • So monitor page rendering time as well • Pingdom, Site24x7, a lot of headless browsers available
  • 18. Evgeny Potapov CEO, DevOpsProdigy Twitter: @eapotapov Email: eapotapov@devopsprodigy.com LinkedIn: @eapotapov