Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevOpsDays Galway 2019 - SRE at Genesys

977 views

Published on

Whirlwind tour of SRE practices in Genesys presented by Colm Hally and Siddharth Raizada at DevOps Days Galway 2019.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

DevOpsDays Galway 2019 - SRE at Genesys

  1. 1. HowChaosMonkeyWentfromFeared FoetoTrustedFriend–SRE@Genesys Colm Hally – colm.hally@genesys.com / @colmhally Siddharth Raizada – siddharth.raizada@genesys.com
  2. 2. Who areGenesys? “Genesys® powers the world’s best customer experiences, across every channel, on- premise and in the cloud.” 11,000+ customers 100+ countries 6 AWS Regions 25 billion interactions / year
  3. 3. Who/WhatisSRE? In Genesys, SRE = Service Reliability Engineering
  4. 4. WhyPracticeSREinGenesys? • Platform stability as important as delivering features • Production-first mindset • Cloud platform • Break your system and learn from it THE ONLY GUARANTEE IS CHANGE AND FAILURE
  5. 5. GenesysCloud-SRE and QAHighlights Created SRE team in 2017 to enhance scalability and reliability of the Genesys Cloud platform (600% usage growth) Recreation of every production incident Load testing @ 1.7X- 2X offered load runs each day ~500 chaos events daily (12 types) ~2000 deployments weekly in dev ~721 deployments weekly in prod 15,904 test jobs weekly; 50-250 tests per job 29,000 Automated Tests
  6. 6. Service Owners BUILD IT RUN IT SUPPORT IT SECURE IT OWN IT
  7. 7. SREReview ◦ Run by SRE Team • before building a new service • when production incidents repeat (“Code Red”) ◦ SRE hold the keys to Production ◦ Perform Fire Drills – single team ◦ Game Day – large scale Chaos involving whole organization 9
  8. 8. SREChecks Alerts Define alerts to help prevent problems instead of notifying of problems Documentation Architecture diagrams Escalation policies Run playbooks Lower Env. Deployments Did you test rollbacks? Versioning strategy? Disaster recovery strategy? Downtime & SLA SLA expectations When and why would you need to schedule a downtime Dependencies Enumerate all dependencies internal & external Fire drill Identify chaos experiments Test for failure paths under load
  9. 9. SRELifecycle Production Incidents Critical Escalations IDENTIFY TRENDS SRE REVIEWS Resiliency Recovery FIRE DRILL Chaos Validate assumptions Product Priority Education & Training FEED BACK Update Tooling
  10. 10. Monitoring & Alerting • You don’t know what’s wrong if you’re not monitoring it • New Relic + Sumologic feed into Pagerduty • All alerts defined as code in service repo • Each team defines what alerts wake them up • During work hours: Non-Prod = Prod
  11. 11. Automation 15 Maintain Automation to enable monitoring and perform necessary operation tasks. Support Automation & tools to enable teams to support the applications. Deploy Automation to deploy and validate application stability. Build Automation to build, publish and archive Artifacts. SRE Review
  12. 12. SRELifecycle Production Incidents Critical Escalations IDENTIFY TRENDS SRE REVIEWS Resiliency Recovery FIRE DRILL Chaos Validate assumptions Product Priority Education & Training FEED BACK Update Tooling Automation
  13. 13. Erebus – Our ChaosEngine ◦ Network-related issues ◦ CPU spikes ◦ Memory issues ◦ Disk full ◦ I/O spikes ◦ DNS issues ◦ Imposter box
  14. 14. Root Cause Analysis • Necessary for Production incidents and near misses • Blameless process • Results of the RCA shared to the whole organization and reviewed weekly • Training on how to write an RCAs • RCA reviews on: • past incidents • Erebus-incidents
  15. 15. Load Testing ◦ 2x Prod load test in test environment ◦ $$$ ◦ Identify deployment issues under load (just like Prod) ◦ Identifying bottlenecks and cost reduction ◦ Capacity planning
  16. 16. KeyTakeaways • Production-first mindset • Embrace Chaos as a learning tool • SRE takes time, money, and buy-in!
  17. 17. Thank You We’re hiring https://careers.genesys.com/galway Colm Hally – colm.hally@genesys.com / @colmhally Siddharth Raizada – siddharth.raizada@genesys.com / Siddharth Raizada - LinkedIn

×