Sumo Logic has built a strong DevOps culture. All our back-end developers are on the on-call rotation, taking responsibility for running the system 24/7 and rolling out new code multiple times a week. This talk is about how we successfully grew that culture. The lessons learned, things that worked, things that didn’t. How our culture has evolved.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Scaling a Start-up DevOps team to 10x while scaling the system 50x
1. Scaling A Start-up DevOps Team To 10x
While Scaling The System 50x
Christian Beedgen – Co-Founder & CTO
Stefan Zier – Lead Architect
DevOpsDays Austin 2014
2. Christian Beedgen
– Co-Founder, CTO
– ArcSight, Amazon, …
– No prior experience running production systems
Stefan Zier
– Lead Architect, first engineer
– ArcSight, Amazon,…
– No prior experience running production systems
Intro
2
3. 3
Scaling
Spreading constructive beliefs and behavior
from the few to the many.
Robert I. Sutton
Scaling up Excellence: Getting to More Without Settling for Less
5. Petabyte scale log management platform
Big Data™, High Velocity, Human Real Time
Distributed
100% in AWS
Service Oriented Architecture
99% in Scala
Run by engineers
The Sumo Logic Service
5
12. 12
Culture
a shared, learned, system of values,
beliefs and attitudes that shapes and
influences perception and behavior — an
abstract “mental blueprint” or “mental
code.”
13. One week, 24/7 responsibility for
– Operational decision making
– Alert response
– Deploying the bits
– Configuration changes
Pair of people (primary, secondary)
– Social schedules & travel
– Training
– Relief after a noisy night
Being On Call
13
14. Sumo on Sumo
– Perfect dog fooding use case
Post mortems
– Drive improvements from incidents
Alerting
– Code I wrote yesterday just woke me up at 4am
Feedback Loops
14
15. Mandated for PCI compliance
– Change Management Board = Channel on Slack
– Change Request = JIRA ticket
– Audit trail = Paste slack conversation into JIRA
Actually helpful
– Good documentation
– Starts good discussions
– Makes change mindful
Change Management
15
19. Playbooks
19
Linked to alert
– GitHub wikis
– URL in alert
Focused on MTTR
– Steps to restore service
– List of Subject Matter Experts to call
Continuously improved
– Boy Scout rule
21. Checklists
21
Improve outcomes
– Ensure experts don’t miss any critical steps
– Prevent repeating mistakes
Well designed
– Coherent
– Living documents
– Concise, clear and require specific actions
– Need to be short and well-organized
– Are NOT step-by-step instructions
24. DevOps Friendly
24
Control Surfaces matter for scale
– Simplify complex operations
– Consistent view
– Built-in safety
Natural to use
– Easy to learn, discover
Natural to extend
– Every developer
30. dsh
30
dsh
– Scala
– Model based
– Trivial to extend
– Specific to OUR needs
– Meaningful defaults
– Prevents mistakes
31. 31
val filter = FilterBuilder.withCluster(“zk”).
withOnlyRunningInstances.build()
val instances = deployment.connect.describeInstances(filter)
instances.par.foreach {
instance =>
val ssh = instance.connectSSH
ssh.execute(“sudo service api restart”)
}
32. What would we do differently next time?
32
Upgrade the system less monolithic
Don’t ask UI developers do operations
Clearer guidelines on managers & operations
33. Next Experiments
33
Divide up big rotation
Bring India development team into rotation
Switch from 24/7 shifts to 12/7
Deploy smaller parts of the system more often
Bring full-time operations people into the mix
Organically grown, possibly unique to us.May give you ideas.
Learned. You become encultured when you join Sumo.2) Shared by the members of the on-call rotation.3) Patterned. People in the rotation live and think in ways that form definite patterns.4) Mutually constructed through a constant process of social interaction.5) Internalized. Habitual. Taken-for-granted. Perceived as “natural.”Examples of our culture.
We like feedback loops.
Members chosen based on track record.Theres no meetings. 24/7 CMB session Quick and frictionless.
How to you learn what you need to know, then stay in the picture?
Tactical: What’s going on with the system NOW?Strategic: What do I need to know to run the system?
Health checks embedded in the codeRequire a playbook for every alertDocumentation “unhealthy when”Side effects:Force meaningful alerts
Example: Doctors leave clamps in patients. Used in other industries with great success (pilots, doctors) AtulGawande – Checklist ManifestoNeed to be well-managedFocus on the 80%Coherent = edited by 1 person, with suggestions from everybody
Create Sectionsand describe when they matterSometimes include reminders of when to do non-obvious thingsChecklists we use regularlyGA readinessDeploy to productionGetting ready for on-call rotationOn-call handover
The interfaces DevOps touch and interact withTurns out, they matter.
Good control surfaces help scalingHelp learningHelp automatingSo… what’s do backend developers like. Uis? Mice? No.
They’re good with CLIs.But CLIs have to be good and easy to learn.
Our internal orchestration tool is called dsh. It’s a CLI. Does the full stack. Uses a really nice readline prompt (jline) with tab completion, history, all the stuff bash has. Uses threading aggressively to make things fast.Has lots of built-in safe guards. We learned from our mistakes. Encourages good security practices. Example: Integration with IronKeys. Proactive – check proactively for things that may cause things to fail. Example: AWS instance limit
Forces users to do the right thing in a standardized way.
This command performs a rolling restart of the api and receiver assemblies. Here’s what happens behind the scenes: We load the model and find out which account the deployment is in. We load the credentials for that account from the IronKeyConsult the model to find out which clusters run api and receiver. Use AWS API to query for the list of instances running in those clusters (using tag query). Query an external service for our own IP address. Use AWS API to query security group. If our IP address isn’t included, add it. Calculate what 25% of API and 10% of receiver amounts to. Launch a thread pool with the correct number of threads.SSH into the machines.Run the script that restarts the daemon. Check Zookeeper and wait for the daemon to be back in service. If applicable, wait for ELB to show the instance healthy again. Gather any error messages.
Started out as developers, chose Scala since it was most natural. Model of deployments, clusters, instances, other AWS resources. Adding new commands is REALLY easy. The model is deeply engrained and omnipresent. Some of the functionality is aware of our application code.Use defaults to manage how you want ops to behave. Special safeguards for production deployments. Make any mistake exactly ONCE. – Example – don’t allow deleting EBS volumes in prod. Don’t allow deploying SNAPSHOT builds to prod.
Example of how our model interacts with AWS and Scala. Worth noting how you can easily interact with the model without knowing much about the guts.
As we scale the team further, we will keep on experimenting.