ProductionOperationsAn Architect and Developer Perspective“Understanding the production environment and the core functions of the live service Operations and Service Delivery teams helps us to design and build solutions that can be better operated and supported during live service and beyond.”
About Mewww.design-build-run.net
About This SessionThis session provides practical insights and proven principles for architects and developers to help design, build and implement solutions that are resilient and operable.
The Production Service is a critical part of running and sustaining the business.
The Production Environmentis the execution foundation for the live service systemand needs to be resilient and operable.
Data Centers arefacilities where all the productionservers, networking components and infrastructure are stored.Courtesy of Wikipedia
Live Service Management includes Operations, Service Deliveryand Application Maintenance.
The Production Operations team is responsible for monitoring and operatingall the systems in live service.Courtesy of Wikipedia
Monitoring Softwareprovides a viewon the currentstateof the organization’s production applications andenvironments.
Monitoring Rulesare used to filter and control what information is collected and what alerts get raised on the bridge.
Informational Alerts indicate something of interest has happened.INFORMATION – SOMETHING OF INTEREST HAS HAPPENED
Warning Alerts indicate something that might need attention.INFORMATION – SOMETHING OF INTEREST HAS HAPPENEDWARNING – A ERROR IS ABOUT TO OCCUR (MAYBE)
Error Alerts indicate a critical situation has occurred.INFORMATION – SOMETHING OF INTEREST HAS HAPPENEDWARNING – A ERROR IS ABOUT TO OCCUR (MAYBE)CRITICAL ERROR – A CRITICAL ERROR HAS OCCURRED
Consider: “All servers will have processor utilization, measured over any 5 minute period, of less than 70% for 95% of the time.”Information, Warning or  Alert?CPU Spike
Unnecessary alertsare justnoise and will be ignored.
Hardware Fault Tolerance improves availability and can increase performance and scalability.
Software Fault Tolerance improves availabilityand provides non-stop processing.NOSTOPPING
Wait-Retrylogic can help ensure your software isresilient.
Data Replication and Recovery helps increase availability and  reduce failover time
We need to carefully BalanceResilience along withPerformance and Operability.
Alertsshould contain information that is useful from an operational  and investigative point of view.
The Operations Procedures are the most important “documents” for Live Service Support.OperationsProcedures
Automate corrective actions(where possible) to reducemanual effort.
Service Delivery is responsible for supporting the production service.
Batch and house-keeping routines ensure the system is in good operational state.
Reporting allows us to monitor and trend live service performance and can help with Incident Investigation.
Incident InvestigationInvestigate
Remember Live Service Outages and Down Time cost money.
Thank YouQuestions?

Production Operations An Architect And Developers Perspective (Without Notes)