2. Who areGenesys?
“Genesys® powers the world’s
best customer experiences,
across every channel, on-
premise and in the cloud.”
11,000+ customers
100+ countries
6 AWS Regions
25 billion interactions / year
4. WhyPracticeSREinGenesys?
• Platform stability as important as delivering features
• Production-first mindset
• Cloud platform
• Break your system and learn from it
THE ONLY GUARANTEE IS CHANGE AND FAILURE
5. GenesysCloud-SRE
and QAHighlights
Created SRE team in 2017 to enhance scalability and reliability of the
Genesys Cloud platform (600% usage growth)
Recreation of every production incident
Load testing @ 1.7X- 2X offered load runs each day
~500 chaos events daily (12 types)
~2000 deployments weekly in dev
~721 deployments weekly in prod
15,904 test jobs weekly; 50-250 tests per job
29,000 Automated Tests
7. SREReview
◦ Run by SRE Team
• before building a new service
• when production incidents repeat (“Code Red”)
◦ SRE hold the keys to Production
◦ Perform Fire Drills – single team
◦ Game Day – large scale Chaos involving whole organization
9
8. SREChecks
Alerts
Define alerts to help prevent problems instead of notifying of
problems
Documentation
Architecture diagrams
Escalation policies
Run playbooks
Lower Env. Deployments
Did you test rollbacks?
Versioning strategy?
Disaster recovery strategy?
Downtime & SLA
SLA expectations
When and why would you need to schedule a downtime
Dependencies Enumerate all dependencies internal & external
Fire drill
Identify chaos experiments
Test for failure paths under load
10. Monitoring & Alerting
• You don’t know what’s wrong if you’re not monitoring it
• New Relic + Sumologic feed into Pagerduty
• All alerts defined as code in service repo
• Each team defines what alerts wake them up
• During work hours: Non-Prod = Prod
11. Automation
15
Maintain
Automation to enable
monitoring and perform
necessary operation tasks.
Support
Automation & tools to enable
teams to support the
applications.
Deploy
Automation to deploy and
validate application stability.
Build
Automation to build, publish
and archive Artifacts.
SRE Review
13. Erebus – Our ChaosEngine
◦ Network-related issues
◦ CPU spikes
◦ Memory issues
◦ Disk full
◦ I/O spikes
◦ DNS issues
◦ Imposter box
14. Root Cause Analysis
• Necessary for Production incidents and near misses
• Blameless process
• Results of the RCA shared to the whole organization and reviewed weekly
• Training on how to write an RCAs
• RCA reviews on:
• past incidents
• Erebus-incidents
15. Load Testing
◦ 2x Prod load test in test environment
◦ $$$
◦ Identify deployment issues under load (just like Prod)
◦ Identifying bottlenecks and cost reduction
◦ Capacity planning