SlideShare a Scribd company logo
1 of 81
DevOps at Scale is a Hard Problem:
Challenges, Insights and Lessons Learned
Kishore Jalleda
Sr. Director, Production Engineering
Yahoo!
Oath: A Verizon company
Agenda (why I am here)
• My vision (problem I’ve been trying to solve)
• Challenges faced, Insights and Lessons learned:
– Directed Alerting
– Culture of Automation
– Continuous Delivery (CD)
– AWS/Public Cloud
• Closing thoughts
– Dev vs Ops – where are we headed?
My Vision
• Velocity (ship fast; fail fast; learn fast).
Don’t build something no one wants.
• Democratize Innovation
• Create Intrapreneurs
Let me tell you a story
Ash, an engineer at Yahoo, wanted to build a stock
recommender prototype using portfolios data on grid
using machine learning. He starts to do this on his own
personal time and gets a prototype working.
When I interviewed him and asked what is stopping him
from testing his idea (and several others he has) in a
bucket quickly, he responded, “I wish there were
fewer constraints at Yahoo”.
Barriers
- Too much Legacy stuff; operational burden from it.
- Codebase complex; monolith
- New hardware takes 6+ weeks to arrive and setup
(great incentive to hold onto hardware, and order
more than you need)
- Operational burden; toil
- Paranoid (security) approvals
- ACLs approvals (Hbase, grid)
- Monitoring setup is laborious
- Have to do it in my spare time
- No sudo
- Need product approvals for everything
- Etc.
How can you help people like Ash?
Engineers who come to work every
morning wanting to change the
world
How can you get more engineers to think like Ash?
The greatest joy is in building things
(quickly); getting feedback (quickly); iterating
on it (quickly).
That is when it struck me:
“DevOps” is really about eliminating (most)
Technical, Process and Cultural barriers
between Idea and Execution -- using
Software.
Our journey had begun
• I joined forces with a handful of people
who also wanted to do some big things.
• Aspirational stuff. I know. But, hey,
nothing wrong with that, right?
“DevOps” to us is about:
Culture Ownership ExcellenceEnable
Agile AutomatedEngineer Processes
Develop Tools(Re)Usable Self-Serve
a of &
&
&
to kick ass at…
Delivery Prevention Repair
Security & Privacy
Don’t be foolish about these; security
and privacy of your users is non-
negotiable.
We started executing - soon we hit
roadblocks; multiple roadblocks.
Initiative #1: Directed Alerting
Before we talk about this initiative, let me
ask you a question.
Which one would you pick?
• Option 1:
• Option 2:
Alerts Team A Team B Team C
Alerts Dev
Dev
We picked option #2
Initiative #1: Directed Alerting
“You wrote it; you own it”
“You wrote it; you run it”
it’s about getting feedback from production
quickly and directly to Dev Teams.
Turns out, it is a hard problem
• After all, convincing four different
teams/functions to change the way
they operate is non-trivial
So, how do you solve this problem?
Any guesses?
Directed Alerting – First communicate your vision
• Page Dev teams directly
• Get to <2 alerts /shift (so, you can RC
each)
• All alerts are actionable; all alerts require
human intelligence
Directed Alerting - Leverage Outages
• Conduct Postmortems.
• Ask thought provoking and
uncomfortable questions.
Directed Alerting - Find your Allies
Who believe in your vision. You will
find them; you just have to go look.
Directed Alerting – Buy in from your team(s).
I got asked:
– “we have been told they are tier 1 & 2 for the
whole company”; “is that even possible? ”
– “Can we handle the alert volume?”
– “Can we still send low priority alerts to them?”
Directed Alerting – Buy in from Dev team(s)
I got asked:
– “Are you sure your team can handle?”
– “We will have no one else to blame”
– “This is a big change; don’t f*** things up”.
Directed Alerting - Buy in from Senior Leaders
• Leverage important meetings to talk
about your vision (Tech Council, Arch
Review, etc.)
Directed Alerting – Buy in from other teams
There will be conflict. If you can stand
firm, learn to say "no", the outcomes
can be pretty awesome!
Also, it's not enough if you --as the
leader-- say "no"; you must empower
all your teams to say "no".
“The most important skill any leader
— any person, really — can learn is
how and when to say “no.””
Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no
Wasn’t going too far; we wanted something
more dramatic; something that showed this
was possible at scale.
Directed Alerting –“Daily Fantasy” to the rescue
• New product; less baggage; high profile;
awesome, modern leader
• Great opportunity to show something as
radical as this is actually possible.
Directed Alerting – Launch, Yay!
• “Daily Fantasy” launched in a “DevOps”
model.
• Showcased the team and win to the
whole company: blogs, all hands, etc.
• Rolled out to more teams
Directed Alerting - Results
MTTD
(minutes)
Sharp drop
Directed Alerting – Insights and Lessons learned
• There will be stragglers (some just
don’t get it).
• You will piss some people off; people
called me a “troublemaker” 
Directed Alerting - Reduce ALERT noise (Process/Culture)
• Daily incident/alert reviews
• Weekly KPI meetings
• Public shaming 
• Peer Pressure 
• Budgets /incentives
• Ownership
• Tickets vs Alerts
• Dev On Call
Directed Alerting - Reduce ALERT noise (Tools)
• Alert aggregation and grouping
• Auto Remediation
• Logs vs Tickets vs Alerts
• Symptoms vs RC monitoring
• Avoid tight coupling with
abstractions; fix alerts closer to the
source
Initiative #2: Continuous Delivery (CD)
Before we talk about this initiative, let
me ask you a question.
Push/Deploy to Production
Which option would you pick?
Option 1: “No Humans Allowed”
Option 2: “Humans Allowed”
We picked option #2 at Yahoo.
Proved that “no-touch deploys”
to production is possible at scale
(1+ Billion users).
CD – As you (may) know
• Doesn’t happen overnight; with enough
iterations, you get to CD.
• Heart of CD is the certification plan (CD gates;
PR builds; etc).
• TDD is a culture that must be embraced.
• Shipping in small batches inherently reduces
risk; improves velocity and productivity.
CD – major push in 2013/2014
• Corp initiative/mandate; non-
negotiable. Goal was to change culture.
Tech Excellence.
• Marissa wanted Yahoo to be on CD. Top
down initiatives are a great way to get
traction.
But, CD at scale is hard
• Was scary for some teams.
• But, it’s the right thing to do; it’s how
modern software is built.
CD (at scale) – expect failures early on
• Took time for teams to take CD seriously
• Took time for teams to embrace TDD.
• Took time to do CD right.
CD – “Warp Drive” to the rescue
• A great program at Yahoo. An effective
way to bring about transformational
change at scale.
• Corp initiatives alone cannot make a
culture stick.
But, there will (always) be stragglers
• Things and conversations will get ugly.
Don’t give up; persist. Show why CD
matters.
CD - results
• Velocity increased dramatically;
developers more productive
• hundreds of man hours saved from
manual deploys.
• Teams have automated pushes to prod
daily. Yes, it is true! CD is possible at
scale.
CD – Insights and Lessons Learned
• You will fail more often than you succeed.
• Every team may not embrace CD’s/TDD’s spirit
• Training (TDD; CD) will never be enough.
• Incentivize teams to move to CD.
• Have CD advocates
• OK to push w/o 100% test coverage.
Initiative #3: Automation Culture
Same drill. Before we talk about this
initiative, let me ask you a question 
A Server/VM/Container is in a bad state
Which option would you pick?
Option 1: Wake up a human at 3 am; have
him/her take that resource OOR manually.
Option 2: Automatically take it OOR, (spin up a
new one automatically, run some diagnostics,
create a ticket, and assign to a human).
But wake someone up when
let’s say, 15% of the cluster is in a bad
state.
We picked option #2.
Initiative #3: Automation Culture; eliminate “toil”
“Let the machines do the heavy lifting; I
have better things to do”
Automation Culture – Challenges
People may ask “What about our
job security?”
After all, I was proposing that we take
some responsibilities away.
Automation Culture – Challenges
Making a strategic investment
means making some trade-offs.
Automation culture – My promise to my
team
Higher-value-add work: writing software;
infra; tooling; etc.
Trust me, “you can’t possibly automate
yourselves out of your job.”
“The misuse of talent in large organizations
is rampant today”
“Without the ability to say “no” to low-
level tasks in order to say “yes” to
groundbreaking ones, people stop
innovating”.
Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no
Automation culture – Tools built
• Auto Remediation (or auto fixes)
• Failure Discovery / Disruptive Testing
• Metrics based promotions
• Load testing Frameworks
• Product Health Visualization Dashboards
• Etc.
Automation Culture - Results
• 100s of auto remediations/hour in prod
• Hundreds of man hours saved
• Dramatic reduction in (repeat) incidents
• New bugs exposed through monitoring the
auto remediation.
• New bugs found in App through failure
discovery.
Automation Culture – Insights and Lessons
learned
• Tools that only Ops can use are not
really tools.
• Simply building a tool doesn’t mean Devs
will use it.
Initiative #4: Compute Platform
(AWS/Public Cloud)
One last question before we talk about this
initiative. Ready?
* - this is not an AWS endorsement. AWS did not pay me for this.
Where will you launch a new product (at
scale)?
Option 1: Public Cloud
Option 2: Data Center
Option 3: Hybrid
Well, obviously, there is no right or wrong
answer here. In our case, hybrid seemed to
make a lot of sense. We chose that option #3.
Why a strategic bet on AWS (or a public
cloud) for a 22-year old company like
Yahoo?
*Billable* - Compute platform
• If you don’t bill teams for using compute,
they will misuse it; no incentives to get
rid as new hardware takes long to arrive.
*Self-Serve* - Compute platform
Imagine having to talk to your ops
team to provision compute.
*On-demand* - Compute platform
I will use it when I need it
*Scalable* - Compute platform
I don’t want to worry about running
out of capacity.
*secure* - compute platform
Don’t have to worry too much about
security at the Infra/OS level.
AWS adoption at scale – Challenges
• Who pays the initial costs?
• Killer use cases?
• My app is working just fine; why should I
move to the cloud?
AWS adoption (at scale) – All about use cases.
• Failsafe/Fallback
• Load Testing
• Non-prod / Test frameworks
• Rapid Experimentation
• Launching New, new products (not much
dependencies on existing/legacy stacks)
AWS - Results
• Rapid Experimentation (many new products
prototyped in AWS)
• New, new Y! products are launched on AWS by
default (Kabana, Livetext, View, etc.)
• Failsafe/Fallback served from AWS; if all of
Yahoo’s data centers went down, we can still
serve (stale) content to our users
• More to possibly come in the future. Stay
tuned!
AWS – Insights and Lessons learned
Break the rules; “but, break them in broad
daylight”.
Hard to make long-term, strategic bets;
easy to deprioritize.
AWS – Insights and Lessons learned
• Get “consolidated billing” early on.
• Make it a corp initiative
• New apps should by default be built with
the cloud in mind.
“DevOps” at Scale - Summary
“DevOps” Transformation – Insights and Lessons
Learned
• Incentivize teams to automate; reward
good behaviors
• Learn to say “No” more; learn to say
“yes” less.
• Write down your thoughts; you can reach
a lot more people.
“DevOps” Transformation – Insights and Lessons
Learned
• Not everyone will know what “DevOps”
is about; they will interpret it as they see
fit.
• Reliability is overrated; no one needs five
nines availability; it’s OK to go down (not
the end of the world).
“DevOps” Transformation – Insights and Lessons
Learned
• Pick your battles; you cannot win them
all – and it’s OK.
• Invest in Dev training and tooling; often
an underinvested area.
Closing thoughts
Dev & Ops: do you have it backwards?
• Pushing code you do not own?
• Responding to alerts for products you don’t
own?
• Testing & debugging code you don’t own?
• Writing tests for code you do not own?
• Etc.
Ask yourself, is that the right thing to do?
A better model (call it “DevOps” or whatever)
Core Dev Teams own
build, test, deploy,
monitor, on call,
debugging, incident
response, capacity,
Postmortems, etc.
Non-core Dev & Ops
Teams own
infrastructure,
automation, tooling,
network, Developer
productivity, expert
services,
observability, etc
Do yourself a favor
• Read this awesome post by an SRE at
Google, JBD, @rakyll (spoiler alert: SRE
support is optional at Google)
*https://medium.com/@rakyll/the-sre-model-6e19376ef986
• Check out “Ten Persistent SRE Antipatterns:
Pitfalls on the Road to a Successful SRE
Program Like Netflix and Google” (spoiler
alert: “NOC it off”)
*https://www.usenix.org/conference/srecon17americas/program/presentation/horowitz
Reflect; soul search; ask tough questions
• How is my team providing value?
• Why does my team exist?
• Am I adding unnecessary abstractions?
Do the right thing
- Enable a Culture of Ownership.
- Engineer Automated & Agile Processes
(Iterative & Experimental)
- Develop Self-Serve & (Re)usable Tools.
Yes, there will be exceptions. But handful. Cash cows, for example.
Are you ready to say “No”?
Thank you!
(Questions?)
@KishoreJalleda or on LinkedIn.
(would appreciate/love feedback on my talk)

More Related Content

What's hot

Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018Codemotion
 
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014Mary Poppendieck: The Aware Organization - Lean IT Summit 2014
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014Institut Lean France
 
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?Bimodal IT: Shortcut to Innovation or Path to Dysfunction?
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?dev2ops
 
Can We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile AdoptionCan We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile AdoptionTechWell
 
Continuous Delivery (The newest)
Continuous Delivery (The newest)Continuous Delivery (The newest)
Continuous Delivery (The newest)Eduards Sizovs
 
Software Craftsmanship Essentials
Software Craftsmanship EssentialsSoftware Craftsmanship Essentials
Software Craftsmanship EssentialsEduards Sizovs
 
8 Things That Make Continuous Delivery Go Nuts
8 Things That Make Continuous Delivery Go Nuts8 Things That Make Continuous Delivery Go Nuts
8 Things That Make Continuous Delivery Go NutsEduards Sizovs
 
Agile 2008 Retrospective
Agile 2008 RetrospectiveAgile 2008 Retrospective
Agile 2008 RetrospectiveCraig Smith
 
Left Hackathon 4.0
Left Hackathon 4.0Left Hackathon 4.0
Left Hackathon 4.0John Lyotier
 
40 Agile Methods in 40 Minutes
40 Agile Methods in 40 Minutes40 Agile Methods in 40 Minutes
40 Agile Methods in 40 MinutesCraig Smith
 
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh Vardhrajan
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh VardhrajanLean Kanban India 2015 | Kanban - Myths or Facts | Mahesh Vardhrajan
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh VardhrajanLeanKanbanIndia
 
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...Jon Stevens-Hall
 
"Creating a testing culture" by Mark Striebeck
"Creating a testing culture" by Mark Striebeck"Creating a testing culture" by Mark Striebeck
"Creating a testing culture" by Mark StriebeckOperae Partners
 
Leadership Without Management: Scaling Organizations by Scaling Engineers
Leadership Without Management: Scaling Organizations by Scaling EngineersLeadership Without Management: Scaling Organizations by Scaling Engineers
Leadership Without Management: Scaling Organizations by Scaling Engineersbcantrill
 
The promise and peril of Agile and Lean practices
The promise and peril of Agile and Lean practicesThe promise and peril of Agile and Lean practices
The promise and peril of Agile and Lean practicesmtoppa
 
Scaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesScaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesLorenzo Alberton
 
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...Bonnie Aumann
 
Why usability problems go unfixed - UX Bristol 2012
Why usability problems go unfixed - UX Bristol 2012Why usability problems go unfixed - UX Bristol 2012
Why usability problems go unfixed - UX Bristol 2012Francis Rowland
 
Bowman walter
Bowman walterBowman walter
Bowman walterNASAPMC
 
Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Bill Scott
 

What's hot (20)

Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
 
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014Mary Poppendieck: The Aware Organization - Lean IT Summit 2014
Mary Poppendieck: The Aware Organization - Lean IT Summit 2014
 
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?Bimodal IT: Shortcut to Innovation or Path to Dysfunction?
Bimodal IT: Shortcut to Innovation or Path to Dysfunction?
 
Can We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile AdoptionCan We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile Adoption
 
Continuous Delivery (The newest)
Continuous Delivery (The newest)Continuous Delivery (The newest)
Continuous Delivery (The newest)
 
Software Craftsmanship Essentials
Software Craftsmanship EssentialsSoftware Craftsmanship Essentials
Software Craftsmanship Essentials
 
8 Things That Make Continuous Delivery Go Nuts
8 Things That Make Continuous Delivery Go Nuts8 Things That Make Continuous Delivery Go Nuts
8 Things That Make Continuous Delivery Go Nuts
 
Agile 2008 Retrospective
Agile 2008 RetrospectiveAgile 2008 Retrospective
Agile 2008 Retrospective
 
Left Hackathon 4.0
Left Hackathon 4.0Left Hackathon 4.0
Left Hackathon 4.0
 
40 Agile Methods in 40 Minutes
40 Agile Methods in 40 Minutes40 Agile Methods in 40 Minutes
40 Agile Methods in 40 Minutes
 
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh Vardhrajan
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh VardhrajanLean Kanban India 2015 | Kanban - Myths or Facts | Mahesh Vardhrajan
Lean Kanban India 2015 | Kanban - Myths or Facts | Mahesh Vardhrajan
 
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
 
"Creating a testing culture" by Mark Striebeck
"Creating a testing culture" by Mark Striebeck"Creating a testing culture" by Mark Striebeck
"Creating a testing culture" by Mark Striebeck
 
Leadership Without Management: Scaling Organizations by Scaling Engineers
Leadership Without Management: Scaling Organizations by Scaling EngineersLeadership Without Management: Scaling Organizations by Scaling Engineers
Leadership Without Management: Scaling Organizations by Scaling Engineers
 
The promise and peril of Agile and Lean practices
The promise and peril of Agile and Lean practicesThe promise and peril of Agile and Lean practices
The promise and peril of Agile and Lean practices
 
Scaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesScaling Teams, Processes and Architectures
Scaling Teams, Processes and Architectures
 
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...
Philly ETE - Are Your Developers Bull$h!tt!ng You? And why that's the wrong q...
 
Why usability problems go unfixed - UX Bristol 2012
Why usability problems go unfixed - UX Bristol 2012Why usability problems go unfixed - UX Bristol 2012
Why usability problems go unfixed - UX Bristol 2012
 
Bowman walter
Bowman walterBowman walter
Bowman walter
 
Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)
 

Similar to Devops at scale is a hard problem challenges, insights and lessons learned

How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014gdusbabek
 
Agile for Me- CodeStock 2009
Agile for Me- CodeStock 2009Agile for Me- CodeStock 2009
Agile for Me- CodeStock 2009Adrian Carr
 
What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?Bill Holtshouser
 
Get Faster - While You're Getting Better
Get Faster - While You're Getting BetterGet Faster - While You're Getting Better
Get Faster - While You're Getting Betterantoineg
 
The Three Pillars of Continuous Delivery - Boston Continuous Delivery Event
The Three Pillars of Continuous Delivery - Boston Continuous Delivery EventThe Three Pillars of Continuous Delivery - Boston Continuous Delivery Event
The Three Pillars of Continuous Delivery - Boston Continuous Delivery EventXebiaLabs
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent CerveauTheFamily
 
10 bezcennych lekcji dla software developera stającego się szefem firmy
10 bezcennych lekcji dla software developera stającego się szefem firmy10 bezcennych lekcji dla software developera stającego się szefem firmy
10 bezcennych lekcji dla software developera stającego się szefem firmyWojciech Seliga
 
Ten lessons I painfully learnt while moving from software developer to entrep...
Ten lessons I painfully learnt while moving from software developer to entrep...Ten lessons I painfully learnt while moving from software developer to entrep...
Ten lessons I painfully learnt while moving from software developer to entrep...Wojciech Seliga
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous DeploymentBrian Henerey
 
DrupalCon 2013 Making Support Fun & Profitable
DrupalCon 2013 Making Support Fun & ProfitableDrupalCon 2013 Making Support Fun & Profitable
DrupalCon 2013 Making Support Fun & ProfitablePromet Source
 
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...David Walker
 
Creating a culture for Continuous Delivery
Creating a culture for Continuous DeliveryCreating a culture for Continuous Delivery
Creating a culture for Continuous DeliveryChef Software, Inc.
 
Making Support Fun & Profitable: DrupalCon Portland
Making Support Fun & Profitable: DrupalCon Portland Making Support Fun & Profitable: DrupalCon Portland
Making Support Fun & Profitable: DrupalCon Portland Anne Stefanyk
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
 
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting LeftDevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting LeftDevSecCon
 
Automation Without Exposure.pptx
Automation Without Exposure.pptxAutomation Without Exposure.pptx
Automation Without Exposure.pptxSiddhartha
 
Agile Overview
Agile OverviewAgile Overview
Agile OverviewAndy Birds
 
Ten lessons I painfully learnt while moving from software developer
to entrep...
Ten lessons I painfully learnt while moving from software developer
to entrep...Ten lessons I painfully learnt while moving from software developer
to entrep...
Ten lessons I painfully learnt while moving from software developer
to entrep...Wojciech Seliga
 
Real World DevOps - Jeff Geerling's NEDCamp 2018 Keynote
Real World DevOps - Jeff Geerling's NEDCamp 2018 KeynoteReal World DevOps - Jeff Geerling's NEDCamp 2018 Keynote
Real World DevOps - Jeff Geerling's NEDCamp 2018 KeynoteJeff Geerling
 

Similar to Devops at scale is a hard problem challenges, insights and lessons learned (20)

How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014
 
Agile for Me- CodeStock 2009
Agile for Me- CodeStock 2009Agile for Me- CodeStock 2009
Agile for Me- CodeStock 2009
 
What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?
 
Get Faster - While You're Getting Better
Get Faster - While You're Getting BetterGet Faster - While You're Getting Better
Get Faster - While You're Getting Better
 
The Three Pillars of Continuous Delivery - Boston Continuous Delivery Event
The Three Pillars of Continuous Delivery - Boston Continuous Delivery EventThe Three Pillars of Continuous Delivery - Boston Continuous Delivery Event
The Three Pillars of Continuous Delivery - Boston Continuous Delivery Event
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
10 bezcennych lekcji dla software developera stającego się szefem firmy
10 bezcennych lekcji dla software developera stającego się szefem firmy10 bezcennych lekcji dla software developera stającego się szefem firmy
10 bezcennych lekcji dla software developera stającego się szefem firmy
 
Ten lessons I painfully learnt while moving from software developer to entrep...
Ten lessons I painfully learnt while moving from software developer to entrep...Ten lessons I painfully learnt while moving from software developer to entrep...
Ten lessons I painfully learnt while moving from software developer to entrep...
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
DrupalCon 2013 Making Support Fun & Profitable
DrupalCon 2013 Making Support Fun & ProfitableDrupalCon 2013 Making Support Fun & Profitable
DrupalCon 2013 Making Support Fun & Profitable
 
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...
Devops & Agility - Build the Culture, Get the Tools, Win the Day - Dundee Tec...
 
Creating a culture for Continuous Delivery
Creating a culture for Continuous DeliveryCreating a culture for Continuous Delivery
Creating a culture for Continuous Delivery
 
Making Support Fun & Profitable: DrupalCon Portland
Making Support Fun & Profitable: DrupalCon Portland Making Support Fun & Profitable: DrupalCon Portland
Making Support Fun & Profitable: DrupalCon Portland
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
 
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting LeftDevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
 
Automation Without Exposure.pptx
Automation Without Exposure.pptxAutomation Without Exposure.pptx
Automation Without Exposure.pptx
 
Agile Overview
Agile OverviewAgile Overview
Agile Overview
 
Ten lessons I painfully learnt while moving from software developer
to entrep...
Ten lessons I painfully learnt while moving from software developer
to entrep...Ten lessons I painfully learnt while moving from software developer
to entrep...
Ten lessons I painfully learnt while moving from software developer
to entrep...
 
Real World DevOps - Jeff Geerling's NEDCamp 2018 Keynote
Real World DevOps - Jeff Geerling's NEDCamp 2018 KeynoteReal World DevOps - Jeff Geerling's NEDCamp 2018 Keynote
Real World DevOps - Jeff Geerling's NEDCamp 2018 Keynote
 

Recently uploaded

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 

Recently uploaded (20)

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 

Devops at scale is a hard problem challenges, insights and lessons learned

  • 1. DevOps at Scale is a Hard Problem: Challenges, Insights and Lessons Learned Kishore Jalleda Sr. Director, Production Engineering Yahoo! Oath: A Verizon company
  • 2. Agenda (why I am here) • My vision (problem I’ve been trying to solve) • Challenges faced, Insights and Lessons learned: – Directed Alerting – Culture of Automation – Continuous Delivery (CD) – AWS/Public Cloud • Closing thoughts – Dev vs Ops – where are we headed?
  • 3. My Vision • Velocity (ship fast; fail fast; learn fast). Don’t build something no one wants. • Democratize Innovation • Create Intrapreneurs
  • 4. Let me tell you a story Ash, an engineer at Yahoo, wanted to build a stock recommender prototype using portfolios data on grid using machine learning. He starts to do this on his own personal time and gets a prototype working. When I interviewed him and asked what is stopping him from testing his idea (and several others he has) in a bucket quickly, he responded, “I wish there were fewer constraints at Yahoo”.
  • 5. Barriers - Too much Legacy stuff; operational burden from it. - Codebase complex; monolith - New hardware takes 6+ weeks to arrive and setup (great incentive to hold onto hardware, and order more than you need) - Operational burden; toil - Paranoid (security) approvals - ACLs approvals (Hbase, grid) - Monitoring setup is laborious - Have to do it in my spare time - No sudo - Need product approvals for everything - Etc.
  • 6. How can you help people like Ash? Engineers who come to work every morning wanting to change the world
  • 7. How can you get more engineers to think like Ash? The greatest joy is in building things (quickly); getting feedback (quickly); iterating on it (quickly).
  • 8. That is when it struck me: “DevOps” is really about eliminating (most) Technical, Process and Cultural barriers between Idea and Execution -- using Software.
  • 9. Our journey had begun • I joined forces with a handful of people who also wanted to do some big things. • Aspirational stuff. I know. But, hey, nothing wrong with that, right?
  • 10. “DevOps” to us is about: Culture Ownership ExcellenceEnable Agile AutomatedEngineer Processes Develop Tools(Re)Usable Self-Serve a of & & & to kick ass at… Delivery Prevention Repair
  • 11. Security & Privacy Don’t be foolish about these; security and privacy of your users is non- negotiable.
  • 12. We started executing - soon we hit roadblocks; multiple roadblocks.
  • 13. Initiative #1: Directed Alerting Before we talk about this initiative, let me ask you a question.
  • 14. Which one would you pick? • Option 1: • Option 2: Alerts Team A Team B Team C Alerts Dev Dev
  • 16. Initiative #1: Directed Alerting “You wrote it; you own it” “You wrote it; you run it” it’s about getting feedback from production quickly and directly to Dev Teams.
  • 17. Turns out, it is a hard problem • After all, convincing four different teams/functions to change the way they operate is non-trivial
  • 18. So, how do you solve this problem? Any guesses?
  • 19. Directed Alerting – First communicate your vision • Page Dev teams directly • Get to <2 alerts /shift (so, you can RC each) • All alerts are actionable; all alerts require human intelligence
  • 20. Directed Alerting - Leverage Outages • Conduct Postmortems. • Ask thought provoking and uncomfortable questions.
  • 21. Directed Alerting - Find your Allies Who believe in your vision. You will find them; you just have to go look.
  • 22. Directed Alerting – Buy in from your team(s). I got asked: – “we have been told they are tier 1 & 2 for the whole company”; “is that even possible? ” – “Can we handle the alert volume?” – “Can we still send low priority alerts to them?”
  • 23. Directed Alerting – Buy in from Dev team(s) I got asked: – “Are you sure your team can handle?” – “We will have no one else to blame” – “This is a big change; don’t f*** things up”.
  • 24. Directed Alerting - Buy in from Senior Leaders • Leverage important meetings to talk about your vision (Tech Council, Arch Review, etc.)
  • 25. Directed Alerting – Buy in from other teams There will be conflict. If you can stand firm, learn to say "no", the outcomes can be pretty awesome! Also, it's not enough if you --as the leader-- say "no"; you must empower all your teams to say "no".
  • 26. “The most important skill any leader — any person, really — can learn is how and when to say “no.”” Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no
  • 27. Wasn’t going too far; we wanted something more dramatic; something that showed this was possible at scale.
  • 28. Directed Alerting –“Daily Fantasy” to the rescue • New product; less baggage; high profile; awesome, modern leader • Great opportunity to show something as radical as this is actually possible.
  • 29. Directed Alerting – Launch, Yay! • “Daily Fantasy” launched in a “DevOps” model. • Showcased the team and win to the whole company: blogs, all hands, etc. • Rolled out to more teams
  • 30. Directed Alerting - Results MTTD (minutes) Sharp drop
  • 31. Directed Alerting – Insights and Lessons learned • There will be stragglers (some just don’t get it). • You will piss some people off; people called me a “troublemaker” 
  • 32. Directed Alerting - Reduce ALERT noise (Process/Culture) • Daily incident/alert reviews • Weekly KPI meetings • Public shaming  • Peer Pressure  • Budgets /incentives • Ownership • Tickets vs Alerts • Dev On Call
  • 33. Directed Alerting - Reduce ALERT noise (Tools) • Alert aggregation and grouping • Auto Remediation • Logs vs Tickets vs Alerts • Symptoms vs RC monitoring • Avoid tight coupling with abstractions; fix alerts closer to the source
  • 34. Initiative #2: Continuous Delivery (CD) Before we talk about this initiative, let me ask you a question.
  • 35. Push/Deploy to Production Which option would you pick? Option 1: “No Humans Allowed” Option 2: “Humans Allowed”
  • 36. We picked option #2 at Yahoo. Proved that “no-touch deploys” to production is possible at scale (1+ Billion users).
  • 37. CD – As you (may) know • Doesn’t happen overnight; with enough iterations, you get to CD. • Heart of CD is the certification plan (CD gates; PR builds; etc). • TDD is a culture that must be embraced. • Shipping in small batches inherently reduces risk; improves velocity and productivity.
  • 38. CD – major push in 2013/2014 • Corp initiative/mandate; non- negotiable. Goal was to change culture. Tech Excellence. • Marissa wanted Yahoo to be on CD. Top down initiatives are a great way to get traction.
  • 39. But, CD at scale is hard • Was scary for some teams. • But, it’s the right thing to do; it’s how modern software is built.
  • 40. CD (at scale) – expect failures early on • Took time for teams to take CD seriously • Took time for teams to embrace TDD. • Took time to do CD right.
  • 41. CD – “Warp Drive” to the rescue • A great program at Yahoo. An effective way to bring about transformational change at scale. • Corp initiatives alone cannot make a culture stick.
  • 42. But, there will (always) be stragglers • Things and conversations will get ugly. Don’t give up; persist. Show why CD matters.
  • 43. CD - results • Velocity increased dramatically; developers more productive • hundreds of man hours saved from manual deploys. • Teams have automated pushes to prod daily. Yes, it is true! CD is possible at scale.
  • 44. CD – Insights and Lessons Learned • You will fail more often than you succeed. • Every team may not embrace CD’s/TDD’s spirit • Training (TDD; CD) will never be enough. • Incentivize teams to move to CD. • Have CD advocates • OK to push w/o 100% test coverage.
  • 45. Initiative #3: Automation Culture Same drill. Before we talk about this initiative, let me ask you a question 
  • 46. A Server/VM/Container is in a bad state Which option would you pick? Option 1: Wake up a human at 3 am; have him/her take that resource OOR manually. Option 2: Automatically take it OOR, (spin up a new one automatically, run some diagnostics, create a ticket, and assign to a human).
  • 47. But wake someone up when let’s say, 15% of the cluster is in a bad state.
  • 49. Initiative #3: Automation Culture; eliminate “toil” “Let the machines do the heavy lifting; I have better things to do”
  • 50. Automation Culture – Challenges People may ask “What about our job security?” After all, I was proposing that we take some responsibilities away.
  • 51. Automation Culture – Challenges Making a strategic investment means making some trade-offs.
  • 52. Automation culture – My promise to my team Higher-value-add work: writing software; infra; tooling; etc. Trust me, “you can’t possibly automate yourselves out of your job.”
  • 53. “The misuse of talent in large organizations is rampant today” “Without the ability to say “no” to low- level tasks in order to say “yes” to groundbreaking ones, people stop innovating”. Source: https://hbr.org/2017/06/help-your-team-stop-overcommitting-by-empowering-them-to-say-no
  • 54. Automation culture – Tools built • Auto Remediation (or auto fixes) • Failure Discovery / Disruptive Testing • Metrics based promotions • Load testing Frameworks • Product Health Visualization Dashboards • Etc.
  • 55. Automation Culture - Results • 100s of auto remediations/hour in prod • Hundreds of man hours saved • Dramatic reduction in (repeat) incidents • New bugs exposed through monitoring the auto remediation. • New bugs found in App through failure discovery.
  • 56. Automation Culture – Insights and Lessons learned • Tools that only Ops can use are not really tools. • Simply building a tool doesn’t mean Devs will use it.
  • 57. Initiative #4: Compute Platform (AWS/Public Cloud) One last question before we talk about this initiative. Ready? * - this is not an AWS endorsement. AWS did not pay me for this.
  • 58. Where will you launch a new product (at scale)? Option 1: Public Cloud Option 2: Data Center Option 3: Hybrid
  • 59. Well, obviously, there is no right or wrong answer here. In our case, hybrid seemed to make a lot of sense. We chose that option #3.
  • 60. Why a strategic bet on AWS (or a public cloud) for a 22-year old company like Yahoo?
  • 61. *Billable* - Compute platform • If you don’t bill teams for using compute, they will misuse it; no incentives to get rid as new hardware takes long to arrive.
  • 62. *Self-Serve* - Compute platform Imagine having to talk to your ops team to provision compute.
  • 63. *On-demand* - Compute platform I will use it when I need it
  • 64. *Scalable* - Compute platform I don’t want to worry about running out of capacity.
  • 65. *secure* - compute platform Don’t have to worry too much about security at the Infra/OS level.
  • 66. AWS adoption at scale – Challenges • Who pays the initial costs? • Killer use cases? • My app is working just fine; why should I move to the cloud?
  • 67. AWS adoption (at scale) – All about use cases. • Failsafe/Fallback • Load Testing • Non-prod / Test frameworks • Rapid Experimentation • Launching New, new products (not much dependencies on existing/legacy stacks)
  • 68. AWS - Results • Rapid Experimentation (many new products prototyped in AWS) • New, new Y! products are launched on AWS by default (Kabana, Livetext, View, etc.) • Failsafe/Fallback served from AWS; if all of Yahoo’s data centers went down, we can still serve (stale) content to our users • More to possibly come in the future. Stay tuned!
  • 69. AWS – Insights and Lessons learned Break the rules; “but, break them in broad daylight”. Hard to make long-term, strategic bets; easy to deprioritize.
  • 70. AWS – Insights and Lessons learned • Get “consolidated billing” early on. • Make it a corp initiative • New apps should by default be built with the cloud in mind.
  • 72. “DevOps” Transformation – Insights and Lessons Learned • Incentivize teams to automate; reward good behaviors • Learn to say “No” more; learn to say “yes” less. • Write down your thoughts; you can reach a lot more people.
  • 73. “DevOps” Transformation – Insights and Lessons Learned • Not everyone will know what “DevOps” is about; they will interpret it as they see fit. • Reliability is overrated; no one needs five nines availability; it’s OK to go down (not the end of the world).
  • 74. “DevOps” Transformation – Insights and Lessons Learned • Pick your battles; you cannot win them all – and it’s OK. • Invest in Dev training and tooling; often an underinvested area.
  • 76. Dev & Ops: do you have it backwards? • Pushing code you do not own? • Responding to alerts for products you don’t own? • Testing & debugging code you don’t own? • Writing tests for code you do not own? • Etc. Ask yourself, is that the right thing to do?
  • 77. A better model (call it “DevOps” or whatever) Core Dev Teams own build, test, deploy, monitor, on call, debugging, incident response, capacity, Postmortems, etc. Non-core Dev & Ops Teams own infrastructure, automation, tooling, network, Developer productivity, expert services, observability, etc
  • 78. Do yourself a favor • Read this awesome post by an SRE at Google, JBD, @rakyll (spoiler alert: SRE support is optional at Google) *https://medium.com/@rakyll/the-sre-model-6e19376ef986 • Check out “Ten Persistent SRE Antipatterns: Pitfalls on the Road to a Successful SRE Program Like Netflix and Google” (spoiler alert: “NOC it off”) *https://www.usenix.org/conference/srecon17americas/program/presentation/horowitz
  • 79. Reflect; soul search; ask tough questions • How is my team providing value? • Why does my team exist? • Am I adding unnecessary abstractions?
  • 80. Do the right thing - Enable a Culture of Ownership. - Engineer Automated & Agile Processes (Iterative & Experimental) - Develop Self-Serve & (Re)usable Tools. Yes, there will be exceptions. But handful. Cash cows, for example.
  • 81. Are you ready to say “No”? Thank you! (Questions?) @KishoreJalleda or on LinkedIn. (would appreciate/love feedback on my talk)

Editor's Notes

  1. To talk about our “DevOps” Journey. The talk is not about perfection; it’s about progress. Hopefully, there will be takeaways for all of you.
  2. I was at IMVU for 6+ years. I have seen revenue grow 10X just by making it easier ….. I have seen it work.
  3. If you have ever built software, you know this. Agree? How many have written software here.
  4. Software part if imp; “software is eating the world” (just in case you have been living under a rock and just showed up; anyone who have been living under a rock ;)?).
  5. Every day when you come to work… evaluate your choices! You are paving the road to Lean/DevOps with your decisions, your efforts, your interactions.
  6. Before you go too far and do crazy shit.
  7. Show by raise of hands, “how many have alerts routed/directed towards dev teams?”
  8. You are all right. Pretend that option #2 is right. For the purposes of this meting.
  9. I am like, Dude, you are missing the point.
  10. Also good to leverage people who make some important tech decisions at your company.
  11. If you want a drop like this; you know what to do. Right?
  12. When you see large or small numbers next to your name, it can be pretty frightening.
  13. How many do CD here? (show by raise of hands)
  14. How many do TDD? How many push every commit to prod?
  15. Let me ask you a question: “If a machine is in a bad state or unhealthy, what do you do?”
  16. Let me ask you a question: “If a machine is in a bad state or unhealthy, what do you do?”
  17. The kinds who ask this are the ones who may not evolve as the company does.
  18. If you are good at what you do; you can almost never, ever be able to automate yourself out of your job.
  19. That doesn’t exist. If you *truly* do, the world need folks like you.
  20. We made the investments
  21. In fact, at Zynga. Anyone who worked at Zynga here?
  22. Talk about Zynga and the public cloud.
  23. This is going to be a bit philosophical
  24. Some have this backwards. Encourage innovation.
  25. Ownership is key. Some of these are not staffed properly at companies especially Infra and dev tools.
  26. How many people have asked “why does my team exist?”