DevOpsDays Galway 2019 - SRE at Genesys

•

0 likes•1,054 views

Colm Hally

Whirlwind tour of SRE practices in Genesys presented by Colm Hally and Siddharth Raizada at DevOps Days Galway 2019.

Engineering

HowChaosMonkeyWentfromFeared
FoetoTrustedFriend–SRE@Genesys
Colm Hally – colm.hally@genesys.com / @colmhally
Siddharth Raizada – siddharth.raizada@genesys.com

Who areGenesys?
“Genesys® powers the world’s
best customer experiences,
across every channel, on-
premise and in the cloud.”
11,000+ customers
100+ countries
6 AWS Regions
25 billion interactions / year

Who/WhatisSRE?
In Genesys, SRE = Service Reliability Engineering

WhyPracticeSREinGenesys?
• Platform stability as important as delivering features
• Production-first mindset
• Cloud platform
• Break your system and learn from it
THE ONLY GUARANTEE IS CHANGE AND FAILURE

GenesysCloud-SRE
and QAHighlights
Created SRE team in 2017 to enhance scalability and reliability of the
Genesys Cloud platform (600% usage growth)
Recreation of every production incident
Load testing @ 1.7X- 2X offered load runs each day
~500 chaos events daily (12 types)
~2000 deployments weekly in dev
~721 deployments weekly in prod
15,904 test jobs weekly; 50-250 tests per job
29,000 Automated Tests

Service Owners
BUILD IT RUN IT SUPPORT IT SECURE IT OWN IT

SREReview
◦ Run by SRE Team
• before building a new service
• when production incidents repeat (“Code Red”)
◦ SRE hold the keys to Production
◦ Perform Fire Drills – single team
◦ Game Day – large scale Chaos involving whole organization
9

SREChecks
Alerts
Define alerts to help prevent problems instead of notifying of
problems
Documentation
Architecture diagrams
Escalation policies
Run playbooks
Lower Env. Deployments
Did you test rollbacks?
Versioning strategy?
Disaster recovery strategy?
Downtime & SLA
SLA expectations
When and why would you need to schedule a downtime
Dependencies Enumerate all dependencies internal & external
Fire drill
Identify chaos experiments
Test for failure paths under load

SRELifecycle
Production
Incidents
Critical
Escalations
IDENTIFY
TRENDS
SRE
REVIEWS
Resiliency
Recovery
FIRE DRILL
Chaos
Validate
assumptions
Product Priority
Education &
Training
FEED BACK
Update
Tooling

Monitoring & Alerting
• You don’t know what’s wrong if you’re not monitoring it
• New Relic + Sumologic feed into Pagerduty
• All alerts defined as code in service repo
• Each team defines what alerts wake them up
• During work hours: Non-Prod = Prod

Automation
15
Maintain
Automation to enable
monitoring and perform
necessary operation tasks.
Support
Automation & tools to enable
teams to support the
applications.
Deploy
Automation to deploy and
validate application stability.
Build
Automation to build, publish
and archive Artifacts.
SRE Review

Erebus – Our ChaosEngine
◦ Network-related issues
◦ CPU spikes
◦ Memory issues
◦ Disk full
◦ I/O spikes
◦ DNS issues
◦ Imposter box

Root Cause Analysis
• Necessary for Production incidents and near misses
• Blameless process
• Results of the RCA shared to the whole organization and reviewed weekly
• Training on how to write an RCAs
• RCA reviews on:
• past incidents
• Erebus-incidents

Load Testing
◦ 2x Prod load test in test environment
◦ $$$
◦ Identify deployment issues under load (just like Prod)
◦ Identifying bottlenecks and cost reduction
◦ Capacity planning

KeyTakeaways
• Production-first mindset
• Embrace Chaos as a learning tool
• SRE takes time, money, and buy-in!

Thank You
We’re hiring https://careers.genesys.com/galway
Colm Hally – colm.hally@genesys.com / @colmhally
Siddharth Raizada – siddharth.raizada@genesys.com / Siddharth Raizada - LinkedIn

What's hot

The Human Side of DevSecOpsJules Pierre-Louis

Making Security Agile - Oleg GrybSeniorStoryteller

Security as CodeEd Bellis

Culture Hacker: How to Herd CATTs and Inspire Rebels to Change the World! - S...SeniorStoryteller

Scaling Rugged DevOps to Thousands of Applications - Panel DiscussionSeniorStoryteller

Reckon Conf2015 (AU / NZ) Moving your practice to the cloudReckon

Integrating Security into DevOpsCloudPassage

A Day in the Life of a Cross-platform, DevOps-enabled TeamDeborah Schalm

Automating OWASP Tests in your CI/CDrkadayam

Continuous Delivery and the CloudNigel Fernandes

Implementing DevOps in a Regulated Environment - DJ SchleenSeniorStoryteller

OpenStack: Upstream FirstTesora

Addressing the Challenges of Mobile Test AutomationTechWell

Webinar_DevOps_Nov10_D2Phil Christensen

Succeeding-Marriage-Cybersecurity-DevOps finalrkadayam

Delivering New Features to Over 30,000 Customers — Dailycolleenfry

What's New in Puppet Enterprise 2015.3 (APAC)Puppet

Full Spectrum Engineering – The New Full-stack Deborah Schalm

How Optimal Alerting is Better for Cloud EnvironmentsDeborah Schalm

Boris Devouge (Microsoft) - DevOps on AzureOutlyer

What's hot (20)

The Human Side of DevSecOps

Making Security Agile - Oleg Gryb

Security as Code

Culture Hacker: How to Herd CATTs and Inspire Rebels to Change the World! - S...

Scaling Rugged DevOps to Thousands of Applications - Panel Discussion

Reckon Conf2015 (AU / NZ) Moving your practice to the cloud

Integrating Security into DevOps

A Day in the Life of a Cross-platform, DevOps-enabled Team

Automating OWASP Tests in your CI/CD

Continuous Delivery and the Cloud

Implementing DevOps in a Regulated Environment - DJ Schleen

OpenStack: Upstream First

Addressing the Challenges of Mobile Test Automation

Webinar_DevOps_Nov10_D2

Succeeding-Marriage-Cybersecurity-DevOps final

Delivering New Features to Over 30,000 Customers — Daily

What's New in Puppet Enterprise 2015.3 (APAC)

Full Spectrum Engineering – The New Full-stack

How Optimal Alerting is Better for Cloud Environments

Boris Devouge (Microsoft) - DevOps on Azure

Similar to DevOpsDays Galway 2019 - SRE at Genesys

Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013Jesse Robbins

Monitoring Cloud/Virtual/Physical IT InfrastructuresJohnnie Burke-Gaffney

Managing and Monitoring Virtual/Cloud/Physical InfrastructuresJohnnie Burke-Gaffney

Adam azure presentationMicrosoft Developer Network (MSDN) - Belgium and Luxembourg

Managing IT environment complexity in a Multi-Cloud WorldShashi Kiran

HCI ECOCAST Melina Black

(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014Amazon Web Services

From 0 to DevOps in 80 Days [Webinar Replay]Dynatrace

Harman deepak v - agile on steriod - dev ops led transformationXebia India

SCS DevSecOps Seminar - State of DevSecOpsStefan Streichsbier

Continuous Delivery series: How to automate your infrastructure toolchainSerena Software

Webinar: Automate Your Environment Provisioning for Mobile App Development Skytap Cloud

Raghu VM_Cloud ResumeRaghu Ravi

AvenDATA and DevopsRajbahadur Rajput

Introduction to Chaos EngineeringRaymond Adrian (Rad) Butalid

Percona presentation v2Sandro Mazziotta

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"Aaron Rinehart

ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdfPhil Johnson

Unlocking the Potential of Database AutomationDBmaestro - Database DevOps

Moving from Legacy Development Tools to transformative DevOps with Enterprise...Infostretch

Similar to DevOpsDays Galway 2019 - SRE at Genesys (20)

Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013

Monitoring Cloud/Virtual/Physical IT Infrastructures

Managing and Monitoring Virtual/Cloud/Physical Infrastructures

Adam azure presentation

Managing IT environment complexity in a Multi-Cloud World

HCI ECOCAST

(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014

From 0 to DevOps in 80 Days [Webinar Replay]

Harman deepak v - agile on steriod - dev ops led transformation

SCS DevSecOps Seminar - State of DevSecOps

Continuous Delivery series: How to automate your infrastructure toolchain

Webinar: Automate Your Environment Provisioning for Mobile App Development

Raghu VM_Cloud Resume

AvenDATA and Devops

Introduction to Chaos Engineering

Percona presentation v2

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf

Unlocking the Potential of Database Automation

Moving from Legacy Development Tools to transformative DevOps with Enterprise...

Recently uploaded

Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan

BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR

A case study of cinema management system project report..pdfKamal Acharya

Laundry management system project report.pdfKamal Acharya

Furniture showroom management system project.pdfKamal Acharya

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya

Peek implant persentation - Copy (1).pdfAyahmorsy

Toll tax management system project report..pdfKamal Acharya

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult

Halogenation process of chemical process industriesMuhammadTufail242431

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationDr. Radhey Shyam

Online resume builder management system project report.pdfKamal Acharya

IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisDr. Radhey Shyam

Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology

fluid mechanics gate notes . gate all pyqs answerapareshmondalnita

Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42

Explosives Industry manufacturing process.pdf884710SadaqatAli

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran

2024 DevOps Pro Europe - Growing at the edgePaco Orozco

WATER CRISIS and its solutions-pptx 1234AafreenAbuthahir2

Recently uploaded (20)

Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx

BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING

A case study of cinema management system project report..pdf

Laundry management system project report.pdf

Furniture showroom management system project.pdf

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf

Peek implant persentation - Copy (1).pdf

Toll tax management system project report..pdf

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx

Halogenation process of chemical process industries

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization

Online resume builder management system project report.pdf

IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

Scaling in conventional MOSFET for constant electric field and constant voltage

fluid mechanics gate notes . gate all pyqs answer

Quality defects in TMT Bars, Possible causes and Potential Solutions.

Explosives Industry manufacturing process.pdf

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering

2024 DevOps Pro Europe - Growing at the edge

WATER CRISIS and its solutions-pptx 1234

DevOpsDays Galway 2019 - SRE at Genesys

1. HowChaosMonkeyWentfromFeared FoetoTrustedFriend–SRE@Genesys Colm Hally – colm.hally@genesys.com / @colmhally Siddharth Raizada – siddharth.raizada@genesys.com

2. Who areGenesys? “Genesys® powers the world’s best customer experiences, across every channel, on- premise and in the cloud.” 11,000+ customers 100+ countries 6 AWS Regions 25 billion interactions / year

3. Who/WhatisSRE? In Genesys, SRE = Service Reliability Engineering

4. WhyPracticeSREinGenesys? • Platform stability as important as delivering features • Production-first mindset • Cloud platform • Break your system and learn from it THE ONLY GUARANTEE IS CHANGE AND FAILURE

5. GenesysCloud-SRE and QAHighlights Created SRE team in 2017 to enhance scalability and reliability of the Genesys Cloud platform (600% usage growth) Recreation of every production incident Load testing @ 1.7X- 2X offered load runs each day ~500 chaos events daily (12 types) ~2000 deployments weekly in dev ~721 deployments weekly in prod 15,904 test jobs weekly; 50-250 tests per job 29,000 Automated Tests

6. Service Owners BUILD IT RUN IT SUPPORT IT SECURE IT OWN IT

7. SREReview ◦ Run by SRE Team • before building a new service • when production incidents repeat (“Code Red”) ◦ SRE hold the keys to Production ◦ Perform Fire Drills – single team ◦ Game Day – large scale Chaos involving whole organization 9

8. SREChecks Alerts Define alerts to help prevent problems instead of notifying of problems Documentation Architecture diagrams Escalation policies Run playbooks Lower Env. Deployments Did you test rollbacks? Versioning strategy? Disaster recovery strategy? Downtime & SLA SLA expectations When and why would you need to schedule a downtime Dependencies Enumerate all dependencies internal & external Fire drill Identify chaos experiments Test for failure paths under load

9. SRELifecycle Production Incidents Critical Escalations IDENTIFY TRENDS SRE REVIEWS Resiliency Recovery FIRE DRILL Chaos Validate assumptions Product Priority Education & Training FEED BACK Update Tooling

10. Monitoring & Alerting • You don’t know what’s wrong if you’re not monitoring it • New Relic + Sumologic feed into Pagerduty • All alerts defined as code in service repo • Each team defines what alerts wake them up • During work hours: Non-Prod = Prod

11. Automation 15 Maintain Automation to enable monitoring and perform necessary operation tasks. Support Automation & tools to enable teams to support the applications. Deploy Automation to deploy and validate application stability. Build Automation to build, publish and archive Artifacts. SRE Review

12. SRELifecycle Production Incidents Critical Escalations IDENTIFY TRENDS SRE REVIEWS Resiliency Recovery FIRE DRILL Chaos Validate assumptions Product Priority Education & Training FEED BACK Update Tooling Automation

13. Erebus – Our ChaosEngine ◦ Network-related issues ◦ CPU spikes ◦ Memory issues ◦ Disk full ◦ I/O spikes ◦ DNS issues ◦ Imposter box

14. Root Cause Analysis • Necessary for Production incidents and near misses • Blameless process • Results of the RCA shared to the whole organization and reviewed weekly • Training on how to write an RCAs • RCA reviews on: • past incidents • Erebus-incidents

15. Load Testing ◦ 2x Prod load test in test environment ◦ $$$ ◦ Identify deployment issues under load (just like Prod) ◦ Identifying bottlenecks and cost reduction ◦ Capacity planning

16. KeyTakeaways • Production-first mindset • Embrace Chaos as a learning tool • SRE takes time, money, and buy-in!

17. Thank You We’re hiring https://careers.genesys.com/galway Colm Hally – colm.hally@genesys.com / @colmhally Siddharth Raizada – siddharth.raizada@genesys.com / Siddharth Raizada - LinkedIn

DevOpsDays Galway 2019 - SRE at Genesys

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DevOpsDays Galway 2019 - SRE at Genesys

Similar to DevOpsDays Galway 2019 - SRE at Genesys (20)

Recently uploaded

Recently uploaded (20)

DevOpsDays Galway 2019 - SRE at Genesys