15. @arupchak
“The Heist”
•What we sold to the teams
–Business Success depends on Innovation
–Innovation depends on Rate of Change
–We will increase the Rate of Change by having
engineers own more of the stack
20. @arupchak
The Job Changed
• For previously Dev focused people
– The now owned the full vertical stack
– Code it. Ship it. Own it.
• For previously Ops focused people
– Had to empower others to do their previous job
– Make the right thing the easy thing
23. @arupchak
What we Built
• Infrastructure Tools
– Self Service Server Provisioning
– Self Service Metrics and Telemetry
– Self Service Deployment
– Self Service Common Infrastructure
• Documentation
– Where we could not automate easily/quickly
27. @arupchak
The Hard Part of Leadership
• Some changes are not for everyone
• Some people who thrived in the old way, will struggle in
the new ways
• They are not trying to be jerks
• Expect attrition
38. @arupchak
“The Heist”
•What we sold to the teams
–Business Success depends on Innovation
–Innovation depends on Rate of Change
–We will increase the Rate of Change by having
engineers own more of the stack
My name is Arup, and I am really honored and humbled to be speaking here again.
This is my second time DevOpsDays in Tel Aviv, so glad to be back here. Great to see some folks that I met last time.
Organizers have done a great job so far. Thank you.
Lay out some context first about myself and PagerDuty.
Then dive into cover where we were at the beginning of 2016.
Then walk through some of the changes that we have made through out 2016, including some of the technical and org changes we made.
Then I will walk through some of the validation we did to make sure we were doing something right.
Also, these slides will be on SpeakerDeck, so do not freak out if you do not get everything.
Talk yesterday on DevOps and DICOM and HIPAA, so glad that Gil is doing this stuff and not me.
Amazon – Used and New
Netflix
PagerDuty
I work with some very smart people and I am lucky and privileged to work with the engineers I get to work with.
While I care about this stuff, no one person can do all of what I am going to walk through.
It takes teams of highly aligned people.
Please do not copy this verbatim.
Do not goto your leaders and say “This guy from PagerDuty told me to do this!”
That will get me into trouble, and quite frankly, there are benefits that I have that you do not.
Likewise, there are weaknesses that I have that you do not.
It is on each of you to process this information and to figure out what is relevant to you. That is what good leaders do. Do not be a lazy leader.
Quick 15 second pitch
Starting with January 2016. Go back in time a bit. Fewer gray hairs. Maybe one or two fewer wrinkles.
Remember the Zika virus?
Google Code was discontinued
Google AMP was introduced
Typical Operations
Reliability was not like SRE, but that is what we called the team
Books written on what to centralize or not, but this centralization was holding us back.
God help you if you wanted to create a new service that required a new database and some backend logic and some dynamic webpage.
We were holding the business back and we could not deliver quickly enough
Typically associated with armored caravans when they were transporting goods in the 1800’s
Apple introduced the smaller iPhone SE cause everyone was so mad about the big iphone 6. Also they introduced the ipad pro behemonth. Facebook live is launched, and everyone was accidentally posting videos for status updates.
What caused these problems?
Anxiety with Change
How to calm down that anxiety. Or at least how we tried to.
We often talk too much about tooling, and I want to remind everyone that People are what make up your company, not tools. As engineers, our first bias is to think about what I can build to solve a problem, and we need to be aware of our bias there.
The last part is what most leaders get scared about, and rightfully so. This does not happen overnight.
Introducing change is hard enough, managing through it is even harder.
Did it work?
Now for some validation. I wish I had been tracking all of these metrics throughout the change.
Median pull request time went from just over 24 hours to under 8 hours.
Story of engineer deploying changes to production and pull requests.
Each of you probably have a codebase like this, where only certain people are allowed to touch it. See what happens when you open it up, do more people touch the thing they used to be afraid of?
While the overall number did go up, when we compared it against the number of changes, it actually went down, and we had shorter incidents as well. What is the big spike, this is was happens when you automation runs amok and sets every alert to go off.
I have to hide the numbers or else my legal team will kill me
My biggest mistake was underestimating the amount of cross training we had to be doing. Tooling is great. Documentation is good. But you need to have dedicated training sessions as well. We are starting to get better at this, especially for our databases.
My biggest surprise was the willingness of engineers to jump in and at least try it. While some did leave, they all gave it a shot.
What made it work was conviction. There were and continue to be dark moments, but I still have conviction that we are on the a better path and are working in a better way now.