Why do Development teams
seem to be able to say no
more than Ops teams?
Stuff Ops deals with that Dev traditionally
• Live prod issues
• DDOS attacks
• 100% disk utilization
• access requests at 2:00 am
• opening up firewalls
Over the last 11 years, a lot of Dev teams
have been given permission to say,
“We’ve already done the planning for
this week. Your next opportunity to get
your request considered is next week.”
Saying “No” has been institutionalized by
This doesn’t work for Ops very well…
“Sorry - that live production ﬁx will
have to wait until next week”
What if instead of using timeframes to
say “No”, we used limits to say
40 Ops Engineers (SysAdmin, DBA, Network, Mon, Sec)
Tasked to build out/retrofit 6 data-centers
across 6 diff countries.
• keep the lights on four existing data centers
• build out a new platform architecture
• support live issues (on-call)
• roll out a new configuration management tool
• deploy a crap load of new features
• deal with 3 reorgs over a 6 month period
• Conflicting priorities
• risky dependencies
• interrupt-driven context switching
resulting in missed commitments.
• Confusion around the new org structure
#1 Can we keep up with the demand?
# days it took for ticket to go from created to closed
#2 Lead time – how long does it take
to get work done?
• access requests for systems, non-Zabbix monitor
• hardware investigation/verification/fixes
- vlan/port changes
- data retrieval (i.e. logs, network stats, etc)
• configuration triage - firewalls, load balancers,
• (small) capacity expansion
• verification of configs/services across shards
• database development consultation
• security compliance mitigation
Live Ops tasks
• Socialized wip limit idea over 6 months and
gradually lowered it from 10 to 7 – out of 18
guys, average is 5-7.
• Hired 4 more people, although 2 got stolen by
• Closed out all tickets with no activity > 90 days
• Started saying “No” to last minute requests.
Team SRE has a very large number of
changes scheduled for today already,
and an even larger number of requests
in our backlog that this request will
displace if moved to the front of the queue.
It would not be fair to other teams if we
jumped on this immediately while
planned work is pushed off.
Monitoring should be a requirement for a service to go live, not a last minute addition.
For us to fully support a live service, please implement monitoring before going live.
For future requests, please give us as much notice as possible, and make sure to create
a ticket (xxx.com) so we can prioritize and schedule the changes as necessary. Here's
the ticket for this work….
• Took time during standups to focus on kaizen
• Reduced validate state from 7 to 5 to 3 days.
• Found creative way to deal with walkups, and
work done via personal relationships
• 15 min daily sync up at 3pm instead of
• 5 min videos to present Ops review data to
fifa regulations ensure they have capacity.
“pg 8 e) ensuring the presence of a sufﬁcient number of
ground staff and security stewards to guarantee safety.”
Let’s not expect day shift workers
to also cover the night shift.
Honey-do list rules for saying no to the spouse…