Scaling A Start-up DevOps Team To 10x
While Scaling The System 50x
Christian Beedgen – Co-Founder & CTO
Stefan Zier – Lead Architect
DevOpsDays Austin 2014
Christian Beedgen
– Co-Founder, CTO
– ArcSight, Amazon, …
– No prior experience running production systems
Stefan Zier
– Lead Architect, first engineer
– ArcSight, Amazon,…
– No prior experience running production systems
Intro
2
3
Scaling
Spreading constructive beliefs and behavior
from the few to the many.
Robert I. Sutton
Scaling up Excellence: Getting to More Without Settling for Less
4
Petabyte scale log management platform
Big Data™, High Velocity, Human Real Time
Distributed
100% in AWS
Service Oriented Architecture
99% in Scala
Run by engineers
The Sumo Logic Service
5
Data Ingest
6
Code Commits, Services
7
Engineering Head Count
Sumo Logic Confidential8
0
10
20
30
40
50
60
The Challenge
9
Scaling Sumo Logic
– More confidence and uptime
– More operators
– More change
– More services
10
DevOps Culture
Spreading Knowledge
Control surfaces
How We Scaled
11
12
Culture
a shared, learned, system of values,
beliefs and attitudes that shapes and
influences perception and behavior — an
abstract “mental blueprint” or “mental
code.”
One week, 24/7 responsibility for
– Operational decision making
– Alert response
– Deploying the bits
– Configuration changes
Pair of people (primary, secondary)
– Social schedules & travel
– Training
– Relief after a noisy night
Being On Call
13
Sumo on Sumo
– Perfect dog fooding use case
Post mortems
– Drive improvements from incidents
Alerting
– Code I wrote yesterday just woke me up at 4am
Feedback Loops
14
Mandated for PCI compliance
– Change Management Board = Channel on Slack
– Change Request = JIRA ticket
– Audit trail = Paste slack conversation into JIRA
Actually helpful
– Good documentation
– Starts good discussions
– Makes change mindful
Change Management
15
16
Spreading Knowledge
Tactical
– Daily Standups
– Chat
– Playbooks
Strategic
– Mentoring
– “How the sausage is made” sessions
– Checklists
Spreading Knowledge
17
18
Playbooks
19
Linked to alert
– GitHub wikis
– URL in alert
Focused on MTTR
– Steps to restore service
– List of Subject Matter Experts to call
Continuously improved
– Boy Scout rule
Culture
Knowledge
Control surfaces
Three Pillars
Sumo Logic Confidential20
Checklists
21
Improve outcomes
– Ensure experts don’t miss any critical steps
– Prevent repeating mistakes
Well designed
– Coherent
– Living documents
– Concise, clear and require specific actions
– Need to be short and well-organized
– Are NOT step-by-step instructions
22
23
DevOps Friendly
24
Control Surfaces matter for scale
– Simplify complex operations
– Consistent view
– Built-in safety
Natural to use
– Easy to learn, discover
Natural to extend
– Every developer
25
dsh
26
dsh
– CLI
– Full stack
– Fast
– Safe
– Secure
– Proactive
– Discoverable
Model Driven
27
Creates consistency
Provides guard rails
Deployment
– Cluster
• Instance
– Assembly
Configured at all levels
28
daemon restart api:p:25,receiver:p:10
29
dsh
30
dsh
– Scala
– Model based
– Trivial to extend
– Specific to OUR needs
– Meaningful defaults
– Prevents mistakes
31
val filter = FilterBuilder.withCluster(“zk”).
withOnlyRunningInstances.build()
val instances = deployment.connect.describeInstances(filter)
instances.par.foreach {
instance =>
val ssh = instance.connectSSH
ssh.execute(“sudo service api restart”)
}
What would we do differently next time?
32
Upgrade the system less monolithic
Don’t ask UI developers do operations
Clearer guidelines on managers & operations
Next Experiments
33
Divide up big rotation
Bring India development team into rotation
Switch from 24/7 shifts to 12/7
Deploy smaller parts of the system more often
Bring full-time operations people into the mix
Thank You!
34
Christian Beedgen
@raychaser
Stefan Zier
@stefanzier
We’re hiring!
go.sumologic.com/jobs

Scaling A Start-up DevOps Team To 10x While Scaling The System 50x - DevOpsDays Austin 2014

  • 1.
    Scaling A Start-upDevOps Team To 10x While Scaling The System 50x Christian Beedgen – Co-Founder & CTO Stefan Zier – Lead Architect DevOpsDays Austin 2014
  • 2.
    Christian Beedgen – Co-Founder,CTO – ArcSight, Amazon, … – No prior experience running production systems Stefan Zier – Lead Architect, first engineer – ArcSight, Amazon,… – No prior experience running production systems Intro 2
  • 3.
    3 Scaling Spreading constructive beliefsand behavior from the few to the many. Robert I. Sutton Scaling up Excellence: Getting to More Without Settling for Less
  • 4.
  • 5.
    Petabyte scale logmanagement platform Big Data™, High Velocity, Human Real Time Distributed 100% in AWS Service Oriented Architecture 99% in Scala Run by engineers The Sumo Logic Service 5
  • 6.
  • 7.
  • 8.
    Engineering Head Count SumoLogic Confidential8 0 10 20 30 40 50 60
  • 9.
    The Challenge 9 Scaling SumoLogic – More confidence and uptime – More operators – More change – More services
  • 10.
  • 11.
  • 12.
    12 Culture a shared, learned,system of values, beliefs and attitudes that shapes and influences perception and behavior — an abstract “mental blueprint” or “mental code.”
  • 13.
    One week, 24/7responsibility for – Operational decision making – Alert response – Deploying the bits – Configuration changes Pair of people (primary, secondary) – Social schedules & travel – Training – Relief after a noisy night Being On Call 13
  • 14.
    Sumo on Sumo –Perfect dog fooding use case Post mortems – Drive improvements from incidents Alerting – Code I wrote yesterday just woke me up at 4am Feedback Loops 14
  • 15.
    Mandated for PCIcompliance – Change Management Board = Channel on Slack – Change Request = JIRA ticket – Audit trail = Paste slack conversation into JIRA Actually helpful – Good documentation – Starts good discussions – Makes change mindful Change Management 15
  • 16.
  • 17.
    Tactical – Daily Standups –Chat – Playbooks Strategic – Mentoring – “How the sausage is made” sessions – Checklists Spreading Knowledge 17
  • 18.
  • 19.
    Playbooks 19 Linked to alert –GitHub wikis – URL in alert Focused on MTTR – Steps to restore service – List of Subject Matter Experts to call Continuously improved – Boy Scout rule
  • 20.
  • 21.
    Checklists 21 Improve outcomes – Ensureexperts don’t miss any critical steps – Prevent repeating mistakes Well designed – Coherent – Living documents – Concise, clear and require specific actions – Need to be short and well-organized – Are NOT step-by-step instructions
  • 22.
  • 23.
  • 24.
    DevOps Friendly 24 Control Surfacesmatter for scale – Simplify complex operations – Consistent view – Built-in safety Natural to use – Easy to learn, discover Natural to extend – Every developer
  • 25.
  • 26.
    dsh 26 dsh – CLI – Fullstack – Fast – Safe – Secure – Proactive – Discoverable
  • 27.
    Model Driven 27 Creates consistency Providesguard rails Deployment – Cluster • Instance – Assembly Configured at all levels
  • 28.
  • 29.
  • 30.
    dsh 30 dsh – Scala – Modelbased – Trivial to extend – Specific to OUR needs – Meaningful defaults – Prevents mistakes
  • 31.
    31 val filter =FilterBuilder.withCluster(“zk”). withOnlyRunningInstances.build() val instances = deployment.connect.describeInstances(filter) instances.par.foreach { instance => val ssh = instance.connectSSH ssh.execute(“sudo service api restart”) }
  • 32.
    What would wedo differently next time? 32 Upgrade the system less monolithic Don’t ask UI developers do operations Clearer guidelines on managers & operations
  • 33.
    Next Experiments 33 Divide upbig rotation Bring India development team into rotation Switch from 24/7 shifts to 12/7 Deploy smaller parts of the system more often Bring full-time operations people into the mix
  • 34.
    Thank You! 34 Christian Beedgen @raychaser StefanZier @stefanzier We’re hiring! go.sumologic.com/jobs

Editor's Notes

  • #9 Founders and initial team all back end Java devs
  • #12 Organically grown, possibly unique to us. May give you ideas.
  • #13 Learned. You become encultured when you join Sumo. 2) Shared by the members of the on-call rotation. 3) Patterned. People in the rotation live and think in ways that form definite patterns. 4) Mutually constructed through a constant process of social interaction. 5) Internalized. Habitual. Taken-for-granted. Perceived as “natural.” Examples of our culture.
  • #15 We like feedback loops.
  • #16 Members chosen based on track record. Theres no meetings. 24/7 CMB session Quick and frictionless.
  • #17 How to you learn what you need to know, then stay in the picture?
  • #18 Tactical: What’s going on with the system NOW? Strategic: What do I need to know to run the system?
  • #20 Health checks embedded in the code Require a playbook for every alert Documentation “unhealthy when” Side effects: Force meaningful alerts
  • #22 Example: Doctors leave clamps in patients. Used in other industries with great success (pilots, doctors) Atul Gawande – Checklist Manifesto Need to be well-managed Focus on the 80% Coherent = edited by 1 person, with suggestions from everybody
  • #23 Create Sections and describe when they matter Sometimes include reminders of when to do non-obvious things Checklists we use regularly GA readiness Deploy to production Getting ready for on-call rotation On-call handover
  • #24 The interfaces DevOps touch and interact with Turns out, they matter.
  • #25 Good control surfaces help scaling Help learning Help automating So… what’s do backend developers like. Uis? Mice? No.
  • #26 They’re good with CLIs. But CLIs have to be good and easy to learn.
  • #27 Our internal orchestration tool is called dsh. It’s a CLI. Does the full stack. Uses a really nice readline prompt (jline) with tab completion, history, all the stuff bash has. Uses threading aggressively to make things fast. Has lots of built-in safe guards. We learned from our mistakes. Encourages good security practices. Example: Integration with IronKeys. Proactive – check proactively for things that may cause things to fail. Example: AWS instance limit
  • #28 Forces users to do the right thing in a standardized way.
  • #29 This command performs a rolling restart of the api and receiver assemblies. Here’s what happens behind the scenes: We load the model and find out which account the deployment is in. We load the credentials for that account from the IronKey Consult the model to find out which clusters run api and receiver. Use AWS API to query for the list of instances running in those clusters (using tag query). Query an external service for our own IP address. Use AWS API to query security group. If our IP address isn’t included, add it. Calculate what 25% of API and 10% of receiver amounts to. Launch a thread pool with the correct number of threads. SSH into the machines. Run the script that restarts the daemon. Check Zookeeper and wait for the daemon to be back in service. If applicable, wait for ELB to show the instance healthy again. Gather any error messages.
  • #31 Started out as developers, chose Scala since it was most natural. Model of deployments, clusters, instances, other AWS resources. Adding new commands is REALLY easy. The model is deeply engrained and omnipresent. Some of the functionality is aware of our application code. Use defaults to manage how you want ops to behave. Special safeguards for production deployments. Make any mistake exactly ONCE. – Example – don’t allow deleting EBS volumes in prod. Don’t allow deploying SNAPSHOT builds to prod.
  • #32 Example of how our model interacts with AWS and Scala. Worth noting how you can easily interact with the model without knowing much about the guts.
  • #34 As we scale the team further, we will keep on experimenting.