Scaling a Start-up DevOps team to 10x while scaling the system 50x

Scaling A Start-up DevOps Team To 10x
While Scaling The System 50x
Christian Beedgen – Co-Founder & CTO
Stefan Zier – Lead Architect
DevOpsDays Austin 2014

Christian Beedgen
– Co-Founder, CTO
– ArcSight, Amazon, …
– No prior experience running production systems
Stefan Zier
– Lead Architect, first engineer
– ArcSight, Amazon,…
– No prior experience running production systems
Intro
2

3
Scaling
Spreading constructive beliefs and behavior
from the few to the many.
Robert I. Sutton
Scaling up Excellence: Getting to More Without Settling for Less

Petabyte scale log management platform
Big Data™, High Velocity, Human Real Time
Distributed
100% in AWS
Service Oriented Architecture
99% in Scala
Run by engineers
The Sumo Logic Service
5

Engineering Head Count
Sumo Logic Confidential8
0
10
20
30
40
50
60

The Challenge
9
Scaling Sumo Logic
– More confidence and uptime
– More operators
– More change
– More services

DevOps Culture
Spreading Knowledge
Control surfaces
How We Scaled
11

12
Culture
a shared, learned, system of values,
beliefs and attitudes that shapes and
influences perception and behavior — an
abstract “mental blueprint” or “mental
code.”

One week, 24/7 responsibility for
– Operational decision making
– Alert response
– Deploying the bits
– Configuration changes
Pair of people (primary, secondary)
– Social schedules & travel
– Training
– Relief after a noisy night
Being On Call
13

Sumo on Sumo
– Perfect dog fooding use case
Post mortems
– Drive improvements from incidents
Alerting
– Code I wrote yesterday just woke me up at 4am
Feedback Loops
14

Mandated for PCI compliance
– Change Management Board = Channel on Slack
– Change Request = JIRA ticket
– Audit trail = Paste slack conversation into JIRA
Actually helpful
– Good documentation
– Starts good discussions
– Makes change mindful
Change Management
15

Tactical
– Daily Standups
– Chat
– Playbooks
Strategic
– Mentoring
– “How the sausage is made” sessions
– Checklists
Spreading Knowledge
17

Playbooks
19
Linked to alert
– GitHub wikis
– URL in alert
Focused on MTTR
– Steps to restore service
– List of Subject Matter Experts to call
Continuously improved
– Boy Scout rule

Culture
Knowledge
Control surfaces
Three Pillars
Sumo Logic Confidential20

Checklists
21
Improve outcomes
– Ensure experts don’t miss any critical steps
– Prevent repeating mistakes
Well designed
– Coherent
– Living documents
– Concise, clear and require specific actions
– Need to be short and well-organized
– Are NOT step-by-step instructions

DevOps Friendly
24
Control Surfaces matter for scale
– Simplify complex operations
– Consistent view
– Built-in safety
Natural to use
– Easy to learn, discover
Natural to extend
– Every developer

dsh
26
dsh
– CLI
– Full stack
– Fast
– Safe
– Secure
– Proactive
– Discoverable

Model Driven
27
Creates consistency
Provides guard rails
Deployment
– Cluster
• Instance
– Assembly
Configured at all levels

28
daemon restart api:p:25,receiver:p:10

dsh
30
dsh
– Scala
– Model based
– Trivial to extend
– Specific to OUR needs
– Meaningful defaults
– Prevents mistakes

31
val filter = FilterBuilder.withCluster(“zk”).
withOnlyRunningInstances.build()
val instances = deployment.connect.describeInstances(filter)
instances.par.foreach {
instance =>
val ssh = instance.connectSSH
ssh.execute(“sudo service api restart”)
}

What would we do differently next time?
32
Upgrade the system less monolithic
Don’t ask UI developers do operations
Clearer guidelines on managers & operations

Next Experiments
33
Divide up big rotation
Bring India development team into rotation
Switch from 24/7 shifts to 12/7
Deploy smaller parts of the system more often
Bring full-time operations people into the mix

Thank You!
34
Christian Beedgen
@raychaser
Stefan Zier
@stefanzier
We’re hiring!
go.sumologic.com/jobs

Scaling a Start-up DevOps team to 10x while scaling the system 50x

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Scaling a Start-up DevOps team to 10x while scaling the system 50x

Similar to Scaling a Start-up DevOps team to 10x while scaling the system 50x (20)

Recently uploaded

Recently uploaded (20)

Scaling a Start-up DevOps team to 10x while scaling the system 50x

Editor's Notes