When Ops Swallows Dev
Scott Prugh, Chief Architect & VP Software Development & Operations, CSG International
Erica Morrison, Director, Software Development, CSG International
CSG has been on an Agile and Lean journey to continually shorten feedback loops in its SDLC and Operations Processes. This began with moving from waterfall to agile and deploying cross functional dev teams. Today, we have taken this transformation further by deploying cross functional product delivery teams that Design, Build, Test and Run their products. Join us to discover the things that went as expected and the surprises we discovered in this journey.
DevOps Enterprise Summit San Francisco 2016
01:20
-Thanks.
-3rd time at DOES; first with op duties
-Current chapter: “When Ops Swallows Dev”. Past 7 months brought together dev and ops; felt like doing more ops than dev as we work through challenges
-Overall theme: Accelerate Feedback & Learning to enable Understanding, Accountability and engineering
-First: Org Archetype walkthrough ending in Creation of Service Delivery Teams; Erica: What we learned; Scott: What is next
-Panel: How to do DevOps with traditional IT Orgs. I’ll cover that today. You will either be excited or scared. Hope it’s the former and have the courage to work through the later.
00:30
-Superquick: CSG is the largest Saas Provider of CC & Billing for the Cable industry
-40 dev teams & 1000 practioners
-Our platform(ACP) crosses 20 technologies: JS to mainframe assembly and is delivered as an integrated suite
-Same challenges as everyone else. Current challenge: Operational Quality
-We also run the largest print operation in the US and product 70M statements/month
01:00
-Set context fist about the problem large enterprises have been battling.
-Picture in my head. Needed to get it out.
-Dev on left; ops on right connected by a precarious path to production. Leaders yelling at each other.
-Agree on 3 things: 1) CRQs/Change sucks. Environment blows up; 2) Path to production is paved on the herculean efforts of others(CM, RM, Prod Ops, PMO); 3) Customers want quality and speed
-The good thing: It doesn’t have to be this way. With DevOps we can make the path to production smooth and deliver much higher quality at speed.
01:20
-EH had a great presentation last year. Highly recommend it. So many great points but two I pulled out here
-1st: Courage and how you need to experiment with re-drawing team lines/boundaries(reform teams). Story: Head of QA’s first act was to get rid of the QA org. Realized it was de-optimizing the system.
-2nd the concept of feedback and how important it is to enable learning
-Lower left: standard PDCA loop. Faster loops mean faster learning
-Above in colored text. Different types of testing practices that help enable quality.
-To the right; colored loops represent feedback from these tests stretched out over time.
-The lower the latency; the faster we learn and the faster we can deliver quality software
-One loop missed. I’ll add it now. Operational quality
-We will use these loops to look at the different org archetypes and how we have optimized our org structures to reduce the latency in our feedback
01:40
-First org archetype: Functional. Recognize this as waterfall. This was CSG pre-2012. My hope is that no one is doing this now…
-You will the concept of queues in this picture. Each org processes its work and passes it downstream via a queue.
-We know from two popular works how queues scale. Larman’s Scaling Lean and Agile Dev. Phoenix Project formula.
-Queues degrade exponentially and latency for feedback increases exponentially as this type of org gets busy
-Even more evil than the queues themselves is what this does to your people shape. Creates I-Shaped resources. Good at one thing only.
-Look at our feedback loops. Note how far out a loop like operational quality is. Across 5 orgs to get that feedback. Not that some of these orgs will even have queues inside them.
01:15
-Next: Agile Archetype
-In 2012 we led a re-org to get rid of role specific orgs in development. Realized that these orgs were de-optimizing the flow of value and quality.
-Rolled out SAFe and also implemented many other techniques…
-This type of structure has a lot fewer queues and encourages little t resources. People who understand the entire SDLC value chain and interface more directly with ops
-Now look at our feedback loops. Lower latency because many more are within the context of a team that is executing in a two week cadence. They learn faster.
01:00
-Results of this are well know. Previously presented and also in DevOps handbook.
-From 2014 to current these changes improved quality almost 10x and reduced TTM by 50%
00:40
-Now lets take a current look at overall quality end to end
-Blue is release quality that we optimized away. Orange is NonRelease
-Left is dev; right is ops.
-Simple facts are: 98%; 92%
-These leaders are still yelling at each other. They are probably saying something like…
-Dev guy hangs his head…
-So. Lets look at what we needed to do next.
01:10
-Next Archetype is the Market or Service
-On the left you see a rep of our org structure prior to march 2016. This is a traditional Dev & IT ops model. Dev builds. Ops runs. Platform provides infra.
-Orange lines represent the flow of handoffs
-Observations we had
-From this we had a few hypotheses as to what was happeneing
01:00
-March 2016 I was asked to take over prod operations at CSG
-My first act was to sit down for a few days with my leaders.
-Gave them a challenge. How do we bring these teams together to create teams that build and run their software.
-On the right you can see how we collapsed down several orgs into combined SD or DevOps teams. We still interface in the traditional way with platform
-Note green line. The future is that our platform will connect via our CI system to auto-provvision infrastructure
01:30
-Now the team level view.
-See how the feedback loops are now contained within one team. Learning is accelerated as feedback is gained on all concerns withing a sprint
-This encourages Big T resource shapes that understand the majority of the value chain
-So why bring Dev and Ops together: Understanding; Accountability; Engineering; Other
-Now I will turn it over to Erica to go through what we have learned in the past 7 months going through this transtion
Need to get here in 13min(13:00 now)
11:00 for Erica. Puts us at 24:00
Reorg
Understood ops
This View
When we started this reorganization journey, I thought I knew all about DevOps and what it was like to do operations. I had several teams already doing “you build it, you run it.” This is what my view of DevOps was – a tranquil best-practice of how we want to develop software.
http://www.forbes.com/sites/benkerschberg/2015/02/04/why-devops-integration-and-continuous-delivery-hold-the-key-to-enterprise-mobile-app-dev/#26390d3279bf
Previous DevOps Teams
Gained pure Ops
Extremely education
Extremely challenging
Solidified DevOps
Framework and methodologies
Not overnight thing
However, when we got going…..it felt a lot more like this.
My previous DevOps teams were internal teams support internal products (build system, telemetry).
BigIP introduced me a lot more to the pure operations world and the operational teams for all of our different products and all the things they put up with.
It has been an amazing educational experience and given me a whole new perspective on the Ops world.
While I’m much better for this journey, it’s been a tough one that has had many bumps along the way.
A lot of outages right out of the gate
Thrown into a world of chaos – very disruptive
It’s solidified the value of many DevOps principles as I’ve seen what it’s like to be in the Ops world without many of these concepts in place.
Fixing these issues requires you to roll up your sleeves and put in a ton of hard work to change culture and implement best practices.
I’d like to talk about my specific experiences with one team I’ve been deeply involved with, and expand those learnings to the larger org. Our journey is by no means done, but we have made substantial progress.
I thought I’d start out by sharing some thoughts coworkers shared with me regarding this presentation which give some good insight into my year
https://thinkingmonster.files.wordpress.com/2015/08/devops.jpeg?w=300&h=300
As I got more involved with the BigIP team, I quickly started to feel like I was in the Phoenix Project. While I had seen many parallels to the book in previous work experiences, it was nothing like this.
Invisible work/work in multiple systems
One system for stories, another for incidents, another for CRQs, another for new requests. And TONS of email. And some stuff wasn’t in any system Lack of work visibility. My brain was exploding trying to track it all
No single pane of glass
Impossible to follow up with teams – whoever screamed the loudest
So many #1 priorities, not a good way to prioritize and work. Squeaky wheel getting the grease. Sometimes other important things dropped
WIP/Overutilizarion
Tech debt
Vendor upgrades
Standards – lack of standards. When we do have them, not universally applied and rolled out to production
Taken on capabilities that probably belong within the development side that have added complexity and caused maintenance challenges, as well as prod outages.
Brent to the nth degree. Pulled in to firefight every issue, no time to make things better even those he’s the visionary for the product
Poor visibility into specific changes
Details/impacts of changes not always understood
No easy way to track what had been done
Manual configuration
Pretty much all changes done by hand
We’ve taken a look at the numerous challenges facing this team and attacked them with DevOps concepts
John Shook from Toyota talks about making culture changes by changing what we do in order to change the underlying belief system and culture. We’ve done that here
Biggest culture change happens by seeing the changes in action. Just do it and become a believer by seeing how it pays off
Resistance to change at varying degrees.
Feeling of just another fad or we’ve said this before.
Resource changes
Brought developers and architects in to supplement the network/operational engineers and help drive culture change
Automated reporting
Traffic reports, device reports
Jenkins jobs to orchestrate
ACL changes/jobs
Writing to StatHub
Deployment mechanism
Config as code
iApp FW, auto test and deploy FW
Automation of cert renewals
Dev best practices
Source control
CI
Automated testing
Automation of deployments
Peer reviews
VE’s locally
Work tracking
Got everything in JIRA. Automation written to pull from several different tools
Includes changes so we can easily see what went in, what’s going in
Workload management
Dev time vs ops time
Ops rotation
Implemented change review for every single change with team and DevOps partners
On top of peer review
Through all of this work we are
Achieving culture changes
Setting ourselves up for the long term vision
Ability to safely make changes
Reduced impact to customers
Self service
Before I turn it back over to Scott to talk about where we are going next in our DevOps journey, I’d like to share some thoughts on Ops from a Development Perspective
As our Dev org has moved to a system focus, we have brought dev best practices to the Ops world. However, the bottom line is the product still needs to continue to run and be configured to meet customer needs. Outside our wheelhouse
Through our DevOps journey, some things have gone as expected, but others have been surprises.
Probably the biggest surprise is the extent to which Ops is hard. It’s one thing to read it in a book, but another entirely to experience it firsthand.
Those from Ops world are looking at me and thinking “duh”
Change process was not visible to dev world
I didn’t know any of the details of our change process
I now deal with it on a daily basis
Process is cumbersome and time consuming
Overwhelming volume of changes and how change is such a big element of what the team does
Change can be scary!
As a developer you want to get your changes into prod as soon as possible.
But in Ops, as you are the one responsible for implementing the change and dealing with fallout, you understand each change comes with risk, especially when it’s manually and without an automated test framework.
That causes a natural reaction to want to go slower. We know we can’t do that, but we now understand the perspective
Need for application architecture
Historically, dev developed with architecture in mind
Ops developed with a tactical focus
Ops world needs architecture just like dev does. It’s maybe different than traditional architecture but requires some of the same skills and rigor
Tactical focus vs strategic. Driven by constant workload and culture
We’ve treated Ops as a commodity. It needs to be treated as something that needs to be designed, just like anything else
We haven’t always enabled Ops to be successful
Examples: # of resources and training
All hours support
I’m not attached at the hip to my phone
Sleep with it next to me with volume on
Get woken up more than I would like
Vaca boating example
Truly a 24x7 gig where you are never off
Dedicated group of people doing this
We ask a lot of our Ops teams. They’ve been given crappy equipment and in some cases, not the right training
https://memegenerator.net/instance/22605665
01:30(24:00-25:30)
-Next things: Process: Lead Time; Bridging ITIL/SDLC; Impact Reduction; More Centers of Enablement
-Technology: More engineering; Mainframe; CD v2.0
-People: Vital to grow an engineering culture and cross skilling
-Bonus slide. Couldn’t resist.
-Very busy.. Our goal for changes is 99.5% success rate… The yellow line represents where this quality line falls..
-The buckets in the graph reprent lead time windows that changes were put in.
-Very simply: changes that go in < 24h are 99.83% successful
-Greater than 24h drop to 98.72%.
-Reinersten: Schedule buffers transform uncertain lateness into certain lateness; Transform uncertain failure into more certain failure. At least 600% more in this case.
-Policy dictates 5 day lead time in order to provide for coordination and external review.
-Teams know most about the change and the state of the system is best known when the change is commissioned. The change does not get better with more review and more time.
-This is something we will be working on next.
-That’s it. Thanks