SlideShare a Scribd company logo
Scaling A Start-up DevOps Team To 10x
While Scaling The System 50x
Christian Beedgen – Co-Founder & CTO
Stefan Zier – Lead Architect
DevOpsDays Austin 2014
Christian Beedgen
– Co-Founder, CTO
– ArcSight, Amazon, …
– No prior experience running production systems
Stefan Zier
– Lead Architect, first engineer
– ArcSight, Amazon,…
– No prior experience running production systems
Intro
2
3
Scaling
Spreading constructive beliefs and behavior
from the few to the many.
Robert I. Sutton
Scaling up Excellence: Getting to More Without Settling for Less
4
Petabyte scale log management platform
Big Data™, High Velocity, Human Real Time
Distributed
100% in AWS
Service Oriented Architecture
99% in Scala
Run by engineers
The Sumo Logic Service
5
Data Ingest
6
Code Commits, Services
7
Engineering Head Count
Sumo Logic Confidential8
0
10
20
30
40
50
60
The Challenge
9
Scaling Sumo Logic
– More confidence and uptime
– More operators
– More change
– More services
10
DevOps Culture
Spreading Knowledge
Control surfaces
How We Scaled
11
12
Culture
a shared, learned, system of values,
beliefs and attitudes that shapes and
influences perception and behavior — an
abstract “mental blueprint” or “mental
code.”
One week, 24/7 responsibility for
– Operational decision making
– Alert response
– Deploying the bits
– Configuration changes
Pair of people (primary, secondary)
– Social schedules & travel
– Training
– Relief after a noisy night
Being On Call
13
Sumo on Sumo
– Perfect dog fooding use case
Post mortems
– Drive improvements from incidents
Alerting
– Code I wrote yesterday just woke me up at 4am
Feedback Loops
14
Mandated for PCI compliance
– Change Management Board = Channel on Slack
– Change Request = JIRA ticket
– Audit trail = Paste slack conversation into JIRA
Actually helpful
– Good documentation
– Starts good discussions
– Makes change mindful
Change Management
15
16
Spreading Knowledge
Tactical
– Daily Standups
– Chat
– Playbooks
Strategic
– Mentoring
– “How the sausage is made” sessions
– Checklists
Spreading Knowledge
17
18
Playbooks
19
Linked to alert
– GitHub wikis
– URL in alert
Focused on MTTR
– Steps to restore service
– List of Subject Matter Experts to call
Continuously improved
– Boy Scout rule
Culture
Knowledge
Control surfaces
Three Pillars
Sumo Logic Confidential20
Checklists
21
Improve outcomes
– Ensure experts don’t miss any critical steps
– Prevent repeating mistakes
Well designed
– Coherent
– Living documents
– Concise, clear and require specific actions
– Need to be short and well-organized
– Are NOT step-by-step instructions
22
23
DevOps Friendly
24
Control Surfaces matter for scale
– Simplify complex operations
– Consistent view
– Built-in safety
Natural to use
– Easy to learn, discover
Natural to extend
– Every developer
25
dsh
26
dsh
– CLI
– Full stack
– Fast
– Safe
– Secure
– Proactive
– Discoverable
Model Driven
27
Creates consistency
Provides guard rails
Deployment
– Cluster
• Instance
– Assembly
Configured at all levels
28
daemon restart api:p:25,receiver:p:10
29
dsh
30
dsh
– Scala
– Model based
– Trivial to extend
– Specific to OUR needs
– Meaningful defaults
– Prevents mistakes
31
val filter = FilterBuilder.withCluster(“zk”).
withOnlyRunningInstances.build()
val instances = deployment.connect.describeInstances(filter)
instances.par.foreach {
instance =>
val ssh = instance.connectSSH
ssh.execute(“sudo service api restart”)
}
What would we do differently next time?
32
Upgrade the system less monolithic
Don’t ask UI developers do operations
Clearer guidelines on managers & operations
Next Experiments
33
Divide up big rotation
Bring India development team into rotation
Switch from 24/7 shifts to 12/7
Deploy smaller parts of the system more often
Bring full-time operations people into the mix
Thank You!
34
Christian Beedgen
@raychaser
Stefan Zier
@stefanzier
We’re hiring!
go.sumologic.com/jobs

More Related Content

Similar to Scaling A Start-up DevOps Team To 10x While Scaling The System 50x - DevOpsDays Austin 2014

Secrets of High Performing Report Development Teams
Secrets of High Performing Report Development TeamsSecrets of High Performing Report Development Teams
Secrets of High Performing Report Development Teams
Senturus
 
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith....NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
NETFest
 
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
NETFest
 
Desmistificando Tecnologias
Desmistificando TecnologiasDesmistificando Tecnologias
Desmistificando Tecnologias
Juliano Martins
 
The Need for Speed
The Need for SpeedThe Need for Speed
The Need for Speed
Capgemini
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
Shannon Lietz
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
Shannon Lietz
 
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
DevOpsGroup
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
Shannon Lietz
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
DATAVERSITY
 
Introduction to DevOps slides.pdf
Introduction to DevOps slides.pdfIntroduction to DevOps slides.pdf
Introduction to DevOps slides.pdf
BoreVishnusai
 
Introduction to DevOps slides-converted (1).pptx
Introduction to DevOps slides-converted (1).pptxIntroduction to DevOps slides-converted (1).pptx
Introduction to DevOps slides-converted (1).pptx
aasssss1
 
Sustaining Your Career
Sustaining Your CareerSustaining Your Career
Sustaining Your Career
Scott Lowe
 
OpenDevOps 2019 - Disconnected pipelines the missing link
OpenDevOps 2019 - Disconnected pipelines the missing linkOpenDevOps 2019 - Disconnected pipelines the missing link
OpenDevOps 2019 - Disconnected pipelines the missing link
Emerasoft, solutions to collaborate
 
From SOA to MSA
From SOA to MSAFrom SOA to MSA
From SOA to MSA
William Yang
 
Introducción a Microservicios, SUSE CaaS Platform y Kubernetes
Introducción a Microservicios, SUSE CaaS Platform y KubernetesIntroducción a Microservicios, SUSE CaaS Platform y Kubernetes
Introducción a Microservicios, SUSE CaaS Platform y Kubernetes
SUSE España
 
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE ManagerSviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
SUSE Italy
 
140910-doverick-agile103.pdf
140910-doverick-agile103.pdf140910-doverick-agile103.pdf
140910-doverick-agile103.pdf
miaoli35
 
Keynote - From Monolith to Microservices - Lessons Learned in the Real World
Keynote - From Monolith to Microservices - Lessons Learned in the Real WorldKeynote - From Monolith to Microservices - Lessons Learned in the Real World
Keynote - From Monolith to Microservices - Lessons Learned in the Real World
Eran Stiller
 
Modernizing IT with Microservices
Modernizing IT with MicroservicesModernizing IT with Microservices
Modernizing IT with Microservices
LeanIX GmbH
 

Similar to Scaling A Start-up DevOps Team To 10x While Scaling The System 50x - DevOpsDays Austin 2014 (20)

Secrets of High Performing Report Development Teams
Secrets of High Performing Report Development TeamsSecrets of High Performing Report Development Teams
Secrets of High Performing Report Development Teams
 
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith....NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
.NET Fest 2019. Eran Stiller. 6 Lessons I Learned on My Journey from Monolith...
 
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
.NET Fest 2018. Леонид Молотиевский. Как выжить с микросервисами
 
Desmistificando Tecnologias
Desmistificando TecnologiasDesmistificando Tecnologias
Desmistificando Tecnologias
 
The Need for Speed
The Need for SpeedThe Need for Speed
The Need for Speed
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Introduction to DevOps slides.pdf
Introduction to DevOps slides.pdfIntroduction to DevOps slides.pdf
Introduction to DevOps slides.pdf
 
Introduction to DevOps slides-converted (1).pptx
Introduction to DevOps slides-converted (1).pptxIntroduction to DevOps slides-converted (1).pptx
Introduction to DevOps slides-converted (1).pptx
 
Sustaining Your Career
Sustaining Your CareerSustaining Your Career
Sustaining Your Career
 
OpenDevOps 2019 - Disconnected pipelines the missing link
OpenDevOps 2019 - Disconnected pipelines the missing linkOpenDevOps 2019 - Disconnected pipelines the missing link
OpenDevOps 2019 - Disconnected pipelines the missing link
 
From SOA to MSA
From SOA to MSAFrom SOA to MSA
From SOA to MSA
 
Introducción a Microservicios, SUSE CaaS Platform y Kubernetes
Introducción a Microservicios, SUSE CaaS Platform y KubernetesIntroducción a Microservicios, SUSE CaaS Platform y Kubernetes
Introducción a Microservicios, SUSE CaaS Platform y Kubernetes
 
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE ManagerSviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
Sviluppare velocemente applicazioni sicure con SUSE CaaS Platform e SUSE Manager
 
140910-doverick-agile103.pdf
140910-doverick-agile103.pdf140910-doverick-agile103.pdf
140910-doverick-agile103.pdf
 
Keynote - From Monolith to Microservices - Lessons Learned in the Real World
Keynote - From Monolith to Microservices - Lessons Learned in the Real WorldKeynote - From Monolith to Microservices - Lessons Learned in the Real World
Keynote - From Monolith to Microservices - Lessons Learned in the Real World
 
Modernizing IT with Microservices
Modernizing IT with MicroservicesModernizing IT with Microservices
Modernizing IT with Microservices
 

Recently uploaded

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 

Recently uploaded (20)

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 

Scaling A Start-up DevOps Team To 10x While Scaling The System 50x - DevOpsDays Austin 2014

  • 1. Scaling A Start-up DevOps Team To 10x While Scaling The System 50x Christian Beedgen – Co-Founder & CTO Stefan Zier – Lead Architect DevOpsDays Austin 2014
  • 2. Christian Beedgen – Co-Founder, CTO – ArcSight, Amazon, … – No prior experience running production systems Stefan Zier – Lead Architect, first engineer – ArcSight, Amazon,… – No prior experience running production systems Intro 2
  • 3. 3 Scaling Spreading constructive beliefs and behavior from the few to the many. Robert I. Sutton Scaling up Excellence: Getting to More Without Settling for Less
  • 4. 4
  • 5. Petabyte scale log management platform Big Data™, High Velocity, Human Real Time Distributed 100% in AWS Service Oriented Architecture 99% in Scala Run by engineers The Sumo Logic Service 5
  • 8. Engineering Head Count Sumo Logic Confidential8 0 10 20 30 40 50 60
  • 9. The Challenge 9 Scaling Sumo Logic – More confidence and uptime – More operators – More change – More services
  • 10. 10
  • 11. DevOps Culture Spreading Knowledge Control surfaces How We Scaled 11
  • 12. 12 Culture a shared, learned, system of values, beliefs and attitudes that shapes and influences perception and behavior — an abstract “mental blueprint” or “mental code.”
  • 13. One week, 24/7 responsibility for – Operational decision making – Alert response – Deploying the bits – Configuration changes Pair of people (primary, secondary) – Social schedules & travel – Training – Relief after a noisy night Being On Call 13
  • 14. Sumo on Sumo – Perfect dog fooding use case Post mortems – Drive improvements from incidents Alerting – Code I wrote yesterday just woke me up at 4am Feedback Loops 14
  • 15. Mandated for PCI compliance – Change Management Board = Channel on Slack – Change Request = JIRA ticket – Audit trail = Paste slack conversation into JIRA Actually helpful – Good documentation – Starts good discussions – Makes change mindful Change Management 15
  • 17. Tactical – Daily Standups – Chat – Playbooks Strategic – Mentoring – “How the sausage is made” sessions – Checklists Spreading Knowledge 17
  • 18. 18
  • 19. Playbooks 19 Linked to alert – GitHub wikis – URL in alert Focused on MTTR – Steps to restore service – List of Subject Matter Experts to call Continuously improved – Boy Scout rule
  • 21. Checklists 21 Improve outcomes – Ensure experts don’t miss any critical steps – Prevent repeating mistakes Well designed – Coherent – Living documents – Concise, clear and require specific actions – Need to be short and well-organized – Are NOT step-by-step instructions
  • 22. 22
  • 23. 23
  • 24. DevOps Friendly 24 Control Surfaces matter for scale – Simplify complex operations – Consistent view – Built-in safety Natural to use – Easy to learn, discover Natural to extend – Every developer
  • 25. 25
  • 26. dsh 26 dsh – CLI – Full stack – Fast – Safe – Secure – Proactive – Discoverable
  • 27. Model Driven 27 Creates consistency Provides guard rails Deployment – Cluster • Instance – Assembly Configured at all levels
  • 29. 29
  • 30. dsh 30 dsh – Scala – Model based – Trivial to extend – Specific to OUR needs – Meaningful defaults – Prevents mistakes
  • 31. 31 val filter = FilterBuilder.withCluster(“zk”). withOnlyRunningInstances.build() val instances = deployment.connect.describeInstances(filter) instances.par.foreach { instance => val ssh = instance.connectSSH ssh.execute(“sudo service api restart”) }
  • 32. What would we do differently next time? 32 Upgrade the system less monolithic Don’t ask UI developers do operations Clearer guidelines on managers & operations
  • 33. Next Experiments 33 Divide up big rotation Bring India development team into rotation Switch from 24/7 shifts to 12/7 Deploy smaller parts of the system more often Bring full-time operations people into the mix
  • 34. Thank You! 34 Christian Beedgen @raychaser Stefan Zier @stefanzier We’re hiring! go.sumologic.com/jobs

Editor's Notes

  1. Founders and initial team all back end Java devs
  2. Organically grown, possibly unique to us. May give you ideas.
  3. Learned. You become encultured when you join Sumo. 2) Shared by the members of the on-call rotation. 3) Patterned. People in the rotation live and think in ways that form definite patterns. 4) Mutually constructed through a constant process of social interaction. 5) Internalized. Habitual. Taken-for-granted. Perceived as “natural.” Examples of our culture.
  4. We like feedback loops.
  5. Members chosen based on track record. Theres no meetings. 24/7 CMB session Quick and frictionless.
  6. How to you learn what you need to know, then stay in the picture?
  7. Tactical: What’s going on with the system NOW? Strategic: What do I need to know to run the system?
  8. Health checks embedded in the code Require a playbook for every alert Documentation “unhealthy when” Side effects: Force meaningful alerts
  9. Example: Doctors leave clamps in patients. Used in other industries with great success (pilots, doctors) Atul Gawande – Checklist Manifesto Need to be well-managed Focus on the 80% Coherent = edited by 1 person, with suggestions from everybody
  10. Create Sections and describe when they matter Sometimes include reminders of when to do non-obvious things Checklists we use regularly GA readiness Deploy to production Getting ready for on-call rotation On-call handover
  11. The interfaces DevOps touch and interact with Turns out, they matter.
  12. Good control surfaces help scaling Help learning Help automating So… what’s do backend developers like. Uis? Mice? No.
  13. They’re good with CLIs. But CLIs have to be good and easy to learn.
  14. Our internal orchestration tool is called dsh. It’s a CLI. Does the full stack. Uses a really nice readline prompt (jline) with tab completion, history, all the stuff bash has. Uses threading aggressively to make things fast. Has lots of built-in safe guards. We learned from our mistakes. Encourages good security practices. Example: Integration with IronKeys. Proactive – check proactively for things that may cause things to fail. Example: AWS instance limit
  15. Forces users to do the right thing in a standardized way.
  16. This command performs a rolling restart of the api and receiver assemblies. Here’s what happens behind the scenes: We load the model and find out which account the deployment is in. We load the credentials for that account from the IronKey Consult the model to find out which clusters run api and receiver. Use AWS API to query for the list of instances running in those clusters (using tag query). Query an external service for our own IP address. Use AWS API to query security group. If our IP address isn’t included, add it. Calculate what 25% of API and 10% of receiver amounts to. Launch a thread pool with the correct number of threads. SSH into the machines. Run the script that restarts the daemon. Check Zookeeper and wait for the daemon to be back in service. If applicable, wait for ELB to show the instance healthy again. Gather any error messages.
  17. Started out as developers, chose Scala since it was most natural. Model of deployments, clusters, instances, other AWS resources. Adding new commands is REALLY easy. The model is deeply engrained and omnipresent. Some of the functionality is aware of our application code. Use defaults to manage how you want ops to behave. Special safeguards for production deployments. Make any mistake exactly ONCE. – Example – don’t allow deleting EBS volumes in prod. Don’t allow deploying SNAPSHOT builds to prod.
  18. Example of how our model interacts with AWS and Scala. Worth noting how you can easily interact with the model without knowing much about the guts.
  19. As we scale the team further, we will keep on experimenting.