Devops at scale is a hard problem challenges, insights and lessons learned

DevOps at Scale is a Hard Problem:
Challenges, Insights and Lessons Learned
Kishore Jalleda
Sr. Director, Production Engineering
Yahoo!
Oath: A Verizon company

Agenda (why I am here)
• My vision (problem I’ve been trying to solve)
• Challenges faced, Insights and Lessons learned:
– Directed Alerting
– Culture of Automation
– Continuous Delivery (CD)
– AWS/Public Cloud
• Closing thoughts
– Dev vs Ops – where are we headed?

My Vision
• Velocity (ship fast; fail fast; learn fast).
Don’t build something no one wants.
• Democratize Innovation
• Create Intrapreneurs

Let me tell you a story
Ash, an engineer at Yahoo, wanted to build a stock
recommender prototype using portfolios data on grid
using machine learning. He starts to do this on his own
personal time and gets a prototype working.
When I interviewed him and asked what is stopping him
from testing his idea (and several others he has) in a
bucket quickly, he responded, “I wish there were
fewer constraints at Yahoo”.

Barriers
- Too much Legacy stuff; operational burden from it.
- Codebase complex; monolith
- New hardware takes 6+ weeks to arrive and setup
(great incentive to hold onto hardware, and order
more than you need)
- Operational burden; toil
- Paranoid (security) approvals
- ACLs approvals (Hbase, grid)
- Monitoring setup is laborious
- Have to do it in my spare time
- No sudo
- Need product approvals for everything
- Etc.

How can you help people like Ash?
Engineers who come to work every
morning wanting to change the
world

How can you get more engineers to think like Ash?
The greatest joy is in building things
(quickly); getting feedback (quickly); iterating
on it (quickly).

That is when it struck me:
“DevOps” is really about eliminating (most)
Technical, Process and Cultural barriers
between Idea and Execution -- using
Software.

Our journey had begun
• I joined forces with a handful of people
who also wanted to do some big things.
• Aspirational stuff. I know. But, hey,
nothing wrong with that, right?

“DevOps” to us is about:
Culture Ownership ExcellenceEnable
Agile AutomatedEngineer Processes
Develop Tools(Re)Usable Self-Serve
a of &
&
&
to kick ass at…
Delivery Prevention Repair

Security & Privacy
Don’t be foolish about these; security
and privacy of your users is non-
negotiable.

We started executing - soon we hit
roadblocks; multiple roadblocks.

Initiative #1: Directed Alerting
Before we talk about this initiative, let me
ask you a question.

Which one would you pick?
• Option 1:
• Option 2:
Alerts Team A Team B Team C
Alerts Dev
Dev

Initiative #1: Directed Alerting
“You wrote it; you own it”
“You wrote it; you run it”
it’s about getting feedback from production
quickly and directly to Dev Teams.

Turns out, it is a hard problem
• After all, convincing four different
teams/functions to change the way
they operate is non-trivial

So, how do you solve this problem?
Any guesses?

Directed Alerting – First communicate your vision
• Page Dev teams directly
• Get to <2 alerts /shift (so, you can RC
each)
• All alerts are actionable; all alerts require
human intelligence

Directed Alerting - Leverage Outages
• Conduct Postmortems.
• Ask thought provoking and
uncomfortable questions.

Directed Alerting - Find your Allies
Who believe in your vision. You will
find them; you just have to go look.

Directed Alerting – Buy in from your team(s).
I got asked:
– “we have been told they are tier 1 & 2 for the
whole company”; “is that even possible? ”
– “Can we handle the alert volume?”
– “Can we still send low priority alerts to them?”

Directed Alerting – Buy in from Dev team(s)
I got asked:
– “Are you sure your team can handle?”
– “We will have no one else to blame”
– “This is a big change; don’t f*** things up”.

Directed Alerting - Buy in from Senior Leaders
• Leverage important meetings to talk
about your vision (Tech Council, Arch
Review, etc.)

Directed Alerting – Buy in from other teams
There will be conflict. If you can stand
firm, learn to say "no", the outcomes
can be pretty awesome!
Also, it's not enough if you --as the
leader-- say "no"; you must empower
all your teams to say "no".

“The most important skill any leader
— any person, really — can learn is
how and when to say “no.””
Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no

Wasn’t going too far; we wanted something
more dramatic; something that showed this
was possible at scale.

Directed Alerting –“Daily Fantasy” to the rescue
• New product; less baggage; high profile;
awesome, modern leader
• Great opportunity to show something as
radical as this is actually possible.

Directed Alerting – Launch, Yay!
• “Daily Fantasy” launched in a “DevOps”
model.
• Showcased the team and win to the
whole company: blogs, all hands, etc.
• Rolled out to more teams

Directed Alerting - Results
MTTD
(minutes)
Sharp drop

Directed Alerting – Insights and Lessons learned
• There will be stragglers (some just
don’t get it).
• You will piss some people off; people
called me a “troublemaker” 

Directed Alerting - Reduce ALERT noise (Process/Culture)
• Daily incident/alert reviews
• Weekly KPI meetings
• Public shaming 
• Peer Pressure 
• Budgets /incentives
• Ownership
• Tickets vs Alerts
• Dev On Call

Directed Alerting - Reduce ALERT noise (Tools)
• Alert aggregation and grouping
• Auto Remediation
• Logs vs Tickets vs Alerts
• Symptoms vs RC monitoring
• Avoid tight coupling with
abstractions; fix alerts closer to the
source

Initiative #2: Continuous Delivery (CD)
Before we talk about this initiative, let
me ask you a question.

Push/Deploy to Production
Which option would you pick?
Option 1: “No Humans Allowed”
Option 2: “Humans Allowed”

We picked option #2 at Yahoo.
Proved that “no-touch deploys”
to production is possible at scale
(1+ Billion users).

CD – As you (may) know
• Doesn’t happen overnight; with enough
iterations, you get to CD.
• Heart of CD is the certification plan (CD gates;
PR builds; etc).
• TDD is a culture that must be embraced.
• Shipping in small batches inherently reduces
risk; improves velocity and productivity.

CD – major push in 2013/2014
• Corp initiative/mandate; non-
negotiable. Goal was to change culture.
Tech Excellence.
• Marissa wanted Yahoo to be on CD. Top
down initiatives are a great way to get
traction.

But, CD at scale is hard
• Was scary for some teams.
• But, it’s the right thing to do; it’s how
modern software is built.

CD (at scale) – expect failures early on
• Took time for teams to take CD seriously
• Took time for teams to embrace TDD.
• Took time to do CD right.

CD – “Warp Drive” to the rescue
• A great program at Yahoo. An effective
way to bring about transformational
change at scale.
• Corp initiatives alone cannot make a
culture stick.

But, there will (always) be stragglers
• Things and conversations will get ugly.
Don’t give up; persist. Show why CD
matters.

CD - results
• Velocity increased dramatically;
developers more productive
• hundreds of man hours saved from
manual deploys.
• Teams have automated pushes to prod
daily. Yes, it is true! CD is possible at
scale.

CD – Insights and Lessons Learned
• You will fail more often than you succeed.
• Every team may not embrace CD’s/TDD’s spirit
• Training (TDD; CD) will never be enough.
• Incentivize teams to move to CD.
• Have CD advocates
• OK to push w/o 100% test coverage.

Initiative #3: Automation Culture
Same drill. Before we talk about this
initiative, let me ask you a question 

A Server/VM/Container is in a bad state
Which option would you pick?
Option 1: Wake up a human at 3 am; have
him/her take that resource OOR manually.
Option 2: Automatically take it OOR, (spin up a
new one automatically, run some diagnostics,
create a ticket, and assign to a human).

But wake someone up when
let’s say, 15% of the cluster is in a bad
state.

Initiative #3: Automation Culture; eliminate “toil”
“Let the machines do the heavy lifting; I
have better things to do”

Automation Culture – Challenges
People may ask “What about our
job security?”
After all, I was proposing that we take
some responsibilities away.

Automation Culture – Challenges
Making a strategic investment
means making some trade-offs.

Automation culture – My promise to my
team
Higher-value-add work: writing software;
infra; tooling; etc.
Trust me, “you can’t possibly automate
yourselves out of your job.”

“The misuse of talent in large organizations
is rampant today”
“Without the ability to say “no” to low-
level tasks in order to say “yes” to
groundbreaking ones, people stop
innovating”.
Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no

Automation culture – Tools built
• Auto Remediation (or auto fixes)
• Failure Discovery / Disruptive Testing
• Metrics based promotions
• Load testing Frameworks
• Product Health Visualization Dashboards
• Etc.

Automation Culture - Results
• 100s of auto remediations/hour in prod
• Hundreds of man hours saved
• Dramatic reduction in (repeat) incidents
• New bugs exposed through monitoring the
auto remediation.
• New bugs found in App through failure
discovery.

Automation Culture – Insights and Lessons
learned
• Tools that only Ops can use are not
really tools.
• Simply building a tool doesn’t mean Devs
will use it.

Initiative #4: Compute Platform
(AWS/Public Cloud)
One last question before we talk about this
initiative. Ready?
* - this is not an AWS endorsement. AWS did not pay me for this.

Where will you launch a new product (at
scale)?
Option 1: Public Cloud
Option 2: Data Center
Option 3: Hybrid

Well, obviously, there is no right or wrong
answer here. In our case, hybrid seemed to
make a lot of sense. We chose that option #3.

Why a strategic bet on AWS (or a public
cloud) for a 22-year old company like
Yahoo?

*Billable* - Compute platform
• If you don’t bill teams for using compute,
they will misuse it; no incentives to get
rid as new hardware takes long to arrive.

*Self-Serve* - Compute platform
Imagine having to talk to your ops
team to provision compute.

*On-demand* - Compute platform
I will use it when I need it

*Scalable* - Compute platform
I don’t want to worry about running
out of capacity.

*secure* - compute platform
Don’t have to worry too much about
security at the Infra/OS level.

AWS adoption at scale – Challenges
• Who pays the initial costs?
• Killer use cases?
• My app is working just fine; why should I
move to the cloud?

AWS adoption (at scale) – All about use cases.
• Failsafe/Fallback
• Load Testing
• Non-prod / Test frameworks
• Rapid Experimentation
• Launching New, new products (not much
dependencies on existing/legacy stacks)

AWS - Results
• Rapid Experimentation (many new products
prototyped in AWS)
• New, new Y! products are launched on AWS by
default (Kabana, Livetext, View, etc.)
• Failsafe/Fallback served from AWS; if all of
Yahoo’s data centers went down, we can still
serve (stale) content to our users
• More to possibly come in the future. Stay
tuned!

AWS – Insights and Lessons learned
Break the rules; “but, break them in broad
daylight”.
Hard to make long-term, strategic bets;
easy to deprioritize.

AWS – Insights and Lessons learned
• Get “consolidated billing” early on.
• Make it a corp initiative
• New apps should by default be built with
the cloud in mind.

“DevOps” at Scale - Summary

“DevOps” Transformation – Insights and Lessons
Learned
• Incentivize teams to automate; reward
good behaviors
• Learn to say “No” more; learn to say
“yes” less.
• Write down your thoughts; you can reach
a lot more people.

Learned
• Not everyone will know what “DevOps”
is about; they will interpret it as they see
fit.
• Reliability is overrated; no one needs five
nines availability; it’s OK to go down (not
the end of the world).

Learned
• Pick your battles; you cannot win them
all – and it’s OK.
• Invest in Dev training and tooling; often
an underinvested area.

Dev & Ops: do you have it backwards?
• Pushing code you do not own?
• Responding to alerts for products you don’t
own?
• Testing & debugging code you don’t own?
• Writing tests for code you do not own?
• Etc.
Ask yourself, is that the right thing to do?

A better model (call it “DevOps” or whatever)
Core Dev Teams own
build, test, deploy,
monitor, on call,
debugging, incident
response, capacity,
Postmortems, etc.
Non-core Dev & Ops
Teams own
infrastructure,
automation, tooling,
network, Developer
productivity, expert
services,
observability, etc

Do yourself a favor
• Read this awesome post by an SRE at
Google, JBD, @rakyll (spoiler alert: SRE
support is optional at Google)
*https://medium.com/@rakyll/the-sre-model-6e19376ef986
• Check out “Ten Persistent SRE Antipatterns:
Pitfalls on the Road to a Successful SRE
Program Like Netflix and Google” (spoiler
alert: “NOC it off”)
*https://www.usenix.org/conference/srecon17americas/program/presentation/horowitz

Reflect; soul search; ask tough questions
• How is my team providing value?
• Why does my team exist?
• Am I adding unnecessary abstractions?

Do the right thing
- Enable a Culture of Ownership.
- Engineer Automated & Agile Processes
(Iterative & Experimental)
- Develop Self-Serve & (Re)usable Tools.
Yes, there will be exceptions. But handful. Cash cows, for example.

Are you ready to say “No”?
Thank you!
(Questions?)
@KishoreJalleda or on LinkedIn.
(would appreciate/love feedback on my talk)

Devops at scale is a hard problem challenges, insights and lessons learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Devops at scale is a hard problem challenges, insights and lessons learned

Similar to Devops at scale is a hard problem challenges, insights and lessons learned (20)

Recently uploaded

Recently uploaded (20)

Devops at scale is a hard problem challenges, insights and lessons learned

Editor's Notes