Slides used for https://www.devopsdays.org/events/2017-toronto/program/andreas-grabner/
In 2011 we delivered 2 major releases of our on premise enterprise software. Market, technology and customer requirements forced us to change that in order to remain competitive.
Now – in 2017 - we are deploying and providing feature releases every 2 weeks for both our on premise and SaaS-based offering. We deploy 170 SaaS production changes per day and have a DevOps pipeline that allows us to deploy a code change within 1h if necessary.
To increase quality, we built and provide a DevOps pipeline that currently executes 31000 Unit & Integration Tests per Hour as well as 60h UI Tests per Build. Our application teams are responsible end-to-end for their features and use production monitoring to validate their deployments which allows them to find 93% of bugs in production before it impacts our end users.
In this session I explain how this transformation worked from both “Top Down” as well as “Bottom Up” in our organization. A key component was the 4 people strong DevOps Team who developed and “sell” their DevOps Pipeline to the globally distributed application teams. I will give insights into how our pipeline enables application teams to design, code, test and run a new feature for our user base.
I will also talk about the “dark moments” as change is never without friction. Both internally as well as with our customers who also had to get used to more rapid changes.
14. Lesson #1: Velocity uncovers new bottlenecks!
•Going from 6 to 1 Month Cycles
•Offered to: On-Premise Customers + SaaS-Deployments
• Challenge: 1GB Monolithic Download
• Impact: Error prone updates
• Solution: Componentize, Automate Rollout/Rollback
Capability, A/B Rollout Model
15. Lesson #2: Need to Increase Sprint Quality
• Sprint Reviews Done on “dynaSprint“
• Daily Builds get deployed on “dynaDay“. Sprint builds to “dynaSprint“
• If you can only show it “on your dev machine“ its NOT DONE!
• Deploy Sprint Builds into our internal Production Environment
• We monitor Website, Support, Licensing, Community ... With Dynatrace
• If we break our own back office software we ALL feel the pain right away
17. Lesson #3: Essential End User Feedback Loop
• Which Features to Optimize? Which Features to „Phase Out“?
• Allows Reducing Technical and Business Debt
• Allowed us to “Call Out Sales!” for requested features nobody used!
18. Lesson #4: Automated Error Analysis
• Birth of “ARCHIE” our “Automated Log Archive Analyzer” integrated with JIRA
19. Lesson #5: We started to understand “The Cloud”
• What Cloud Services to use for which tasks!
• It SCALES but it AIN’T CHEAP if you make a mistake!
4x $$$ to IaaS
20. Step #3: Incubation on New Stack
Keep Innovating on Enterprise Stack
Incubate “Start Up” on New Stack
21. Redefining the DevOps Team’s Role
Acting as engineers
& production
managers
Dynatrace Managed/SaaS
Orchestration Layer
DynatracePipeline Visualization
Deployment Timeline
Log Overview
using Dynatrace Log API
JIRA Integrations
22. Monitoring as Pipeline & Platform Feature
Dev Perf/Test Ops Biz
Faster Innovation with Quality Gates
Faster Acting on Feedback
Unit Perf
Cont. Perf
New Deploy
New Capability
CI CD Remove/Promote
Triage/Optimize
Update Tests
Innovate/Design
$$$
Lower Costs
Happy Users
23. Lesson #6: Pipeline quality + 10 min Builds
https://github.com/Dynatrace/ufo
25. Dev: Shift-Left - Architectural Regression Decisions
= Capturing Application Metrics
+ # of Images, # of JS, Load Time …
+ # of SQL, # of Logs, # of API Calls, # of Excepts ...
== Functional Passed / Failed
31k
Unit/Int-Tests / hour
60h
UI-Tests / Build
26. Dev: Shift-Left - Architectural Regression Decisions
Regression
Baseline Every Metric of every Test Stop the Pipeline Early!
https://github.com/Dynatrace/ufo
27. Perf / Test: Continuous Performance Validation
“Performance Signature”
for Build Nov 16
“Performance Signature”
for Build Nov 17
37. Step #4: Bringing it back together
Merging Development Teams
Applying things “that work” on each
others side!
38. confidential
Dynatrace Transformation by the numbers
26
500
Feature Releases / Year
Deployments / Day
31000 60h
Unit & Int Tests / hour UI Tests per Build
More Quality
~120 340
Code commits / day Stories per sprint
More Agile
93%
Production bugs found by Dev
More Stability 450 99.998%
Global EC2 Instances Global Availability
39. Raffle for DevOps Handbook + Echo
Tweet Creative Ideas for UFO Usage to
@grabnerandi
http://www.dynatrace.com
40. Confidential, Dynatrace, LLC
From 6 Months Waterfall to 1h Code Deploys
“It was a long journey!”
Andreas Grabner - May 2017
@grabnerandi
THANKS
Editor's Notes
My analogy for Waterfall:
Putting many features into a single release
Ship it to some other entity who does quality control
Final product comes back very late -> hard to remember which features / fotos we created. Often we realize its not what we wanted
This is the new way of delivering software: Continuously – with small batch updates
I use the analogy on how my girlfriend takes pictures:
One at a time
Quality Control and Optimization is in her own hands thanks to software that is “part of the delivery chain” (foto app)
She also controls what to push into production -> post it on Instagram / Facebook
She wants to make her users (friends & family) happy – she is hoping for LIKES!
If she gets dislikes she can remove an image
If she gets comments she can take another picture and deploy it within seconds -> that is Continuous User Driven Innovation
This is where we were in 2011
Our CTO had a vision and pressure from customers and the market. He set the goal to be able to do 1h Code Deploys
The biggest challenge was the cultural change – but our CTO always believed in the Mission Impossible
Here is what we have achieved since then before I go into details about how we got there
We tried to take our OnPremise Product and “Lift & Shift” it to a SaaS Model. Using our existing NOC Team
We learned that we had a lot of processes in place that made frequent updates very painful -> Change Request Meetings every week …
Biggest challenge with that is that no developer wanted to take responsibility for Operations
13
14
15
17
18
19
We ran our own startup within our Company
Our DevOps Team – initially 7 people – now only 3 – are
Responsible for “The Delivery Pipeline and the DevOps Tool Chain”
Their Customers: The different Dev Teams that want to push features through the pipeline into production
Our Own Transformation + what we hear from customers and the market tells us
EVERYONE WANTS to CHANGE – but the biggest challenge is Org / Culture not Technology
More Resources
DevOps Webinar with Bernd Greifeneder (CTO): https://info.dynatrace.com/apm_dtm_ops_17q3_wc_from_enterprise_tocloud_native_na_registration.html
DevOps Webinar with Anita Engleder (DevOps Manager): https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html
Key Lessons Learned: Raise the awareness of quality and the impact of each individual developer on the bottom line -> which is quality in production
“Eat our own dogfood” aka “Drink our own Champagne” -> we install sprint builds into our internal systems
Visualize Build and Pipeline Quality via UFOs
Make Devs Look into production as well
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power
Dynatrace can look at key resource, performance, scalability and architectural metrics and trend it from build-to-build. If Dynatrace detects a regression it can notify the build pipeline (Jenkins, Bamboo, TFS, …) that the current code change should not be promoted to the next phase
Screenshot from Dynatrace AppMon
Continuous Performance Testing or Continuous Performance Validation is a good Pipeline Phase to have before deploying into a Production Environment. It is an envioronment running under continuous load. New builds of individual services or complete applications get deployed on a regular basis. The question is whether a new version of a service, application or component shows any degradation in performance, scalability or resrouce consumption. If so it should not be promoted to the next phase before closer examination
Dynatrace automatically understands applications but more importantly services. Dynatrace also integrates with testing tools so that traffic on certain services can be associated to certain test scenarios you run in your continuous performance environment. Based on this information it is possible to see any regressions between builds or different loads. In the example above it is easy to spot that the build from Nov 17 shows a significant performance regression. Instead of allowing this build into production it is better to look into the differences between Build Nov 16 and Build Nov 17
After a deployment we see an issue with network connectivity and CPU utilization – impacting our end users
Dynatrace not only detects that issue but shows us the complete problem evolution path which allows us to then see which change actually caused that issue to happen and how to remediate it!
Key Lessons Learned: Raise the awareness of quality and the impact of each individual developer on the bottom line -> which is quality in production
“Eat our own dogfood” aka “Drink our own Champagne” -> we install sprint builds into our internal systems
Visualize Build and Pipeline Quality via UFOs
Make Devs Look into production as well
The next slides show a scenario that happened in our organization. This dashboard is used by our marketing and business teams to see how well frequented our website is (total numbers in top chart), how user experience plays out (top chart with green/yellow/red) and how many people sign up for our free trial offering (conversion rate)
May 1st was a push of a new release and a marketing campaign started that promoted these features and tried to get people to sign up
Seems everything was working as expected
Day 2 started good but we also saw that slower web site performance (due to the heavy load) was impacting our end user experience and also conversion rate
The Dev Team provided a hotfix to make the sign up for faster
#1: It got deployed around noon
#2: Fix had negative impact as it broke the whole website due to a javascript problem on certain browsers
#3: problem was immediately visible to both business (drop in conversion) and dev (they looked at the reported JavaScript problems and user experience)
Due to the fast feedback from Production the Dev Team immediately fixed that regression – bringing the system back to where they wanted it to be in the first place
36
We learned that we need to have self-service in our pipeline. Intuitive Dashboards, Chat Ops and Voice Ops to allow developers to pro-actively react on feedback from the pipeline
We ran our own startup within our Company
Number of EC2 Instances:exact number 15th May 2017: 439 instances in dev+sprint+prod(157) – prod instances increased by ~ 25. In parallel we reduced instances in dev and sprint for cost saving purpose. So in sum we still have ~450 instances as we already had end of 2016.
Deployments per working day:5680 SaaS and 2410 Managed deployments within 28 days(=20 working days): 8090/20 = 404 deployments/working day 400/8h = 50 deployments/working hour50/60min = 0,8 deployment per minute = ever 1m18s a new deployment
Prod-found bugs by dev:1.1.2017 – 15.5.2017= 92,52 % ~ 93%
Global Availability:
1.1.2017 – 15.5.2017= 100%
1.1.2016 – 15.5.2017= 99,9956 %
Commits/Day: Since move to Gits commits per day went down from 200 to about 120 commits per day. Reason is that devs first collect commits on their private branch before pushing it to master.