Who Am I
http://pascoal.net
DevOps at Microsoft
Data: Internal Microsoft engineering system activity, August 2018
372k
Pull Requests per
month
2m
Git commits per month
78,000Deployments per day
4.4m
Builds per month
500m
Test executions per day
500k
Work items updated
per day
5m
Work items viewed per
day
Azure DevOps is the toolchain of choice for Microsoft engineering with over 90,000 internal users
https://aka.ms/DevOpsAtMicrosoft
3,500
The Developer Division at Microsoft
800
The Azure DevOps team… spread out across 40 feature teams
3 weeks
Team Foundation Server (TFS)
Azure DevOps (formerly VSTS)
We are delivering value to customers and an
increased velocity.
• More features in the 2016 calendar year (262 features)…
• Than the previous 4 years combined (256 features).
• 364 features in the 2017 calendar year!
https://www.visualstudio.com/en-us/articles/news/features-timeline
22
58 65
111
249
364
97
0
50
100
150
200
250
300
350
400
450
2012 2013 2014 2015 2016 2017 2018
Features delivered per year
Sprint 1
Aug 2010
VSTS Preview
Sprint 29
Jun 2012
VSTS GA
Sprint 64
Apr 2014
1ES
Sprint 67
Jun 2014
GVFS
Sprint 102
Jun 2016
Sprint 136
Jun 2018
What did it look
like before?
2 years
Planning M1 M2
Planning M1 M2
Specs
We knew exactly what to build…
and we knew it was right!
Photo by Jose Antonio Gallego Vázquez on Unsp
Code Test & Stabilize
Code
Complete
We wrote all the code months before
we shipped.
Planning M1 M2
We had a perfect schedule and knew
exactly when it would be ready!
Planning
Customer feedback – we should
change the way a feature works. We
didn’t get it quite right…
… but we’re booked solid already.
M1
“Great feedback. Thanks! We’ll take a
look in planning for the next release. We
should get it to you….
in a few years.”
Culture eats strategy for breakfast.“ ”
Peter Drucker
Cross discipline
10-12 people
Self managing
Clear charter and goals
Intact for 12-18 months
Physical team rooms
Own features in production
Own deployment of features
Employee choice, not
manager driven
Typically <20%
change, but 100% get
to make a choice
Cross-pollinate talent
and micro-culture
Sticky Note Exercise - Self Forming Teams
We started off trying to set up a
small anarchist community, but
people wouldn't obey the rules.
“
”
Alan Bennett
Let’s try to give our teams three things….
Autonomy, Mastery, and Purpose.
Intrinsic
vs
extrinsic motivators
https://www.youtube.com/watch?v=u6XAPnuFjJc
Autonomy
Alignment
Too much
alignment
Too much
autonomy
Alignment
Autonomy
Autonomy
Alignment
Methodology?
How Teams work?
A customer can have a car
painted any color he wants
as long as it’s black
“
”
Henry Ford
Autonomy
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3Week 2 Week 3
Sprint 135Sprint 134 Sprint 136
S1 S2 S3 S4 S5 Stabilization S6
A
B
“Let’s do this Agile thing… but we should probably
reserve some time to stabilize things.”
Seemed Like a good idea at the time……
( famous last words)
Code Test & Stabilize Code Test & Stabilize
Code
Complete
Planning
engineers on
your team# 5 ?x =
We all follow a simple rule we call the “Bug Cap”:
We all follow a simple rule we call the “Bug Cap”:
Rule: If your bug count exceeds your bug cap… stop working
on new features until you’re back under the cap.
5 50x =10
What we track
Live Site Health/Debt
Time to Detect, Time To Mitigate
Incident prevention items
Aging live site problems
Customer support metrics (SLA, MPI, top
drivers)
Engineering Health/Debt
Bug cap per engineer
Aging bugs in important categories
Pass rate & coverage
Velocity
Time to build
Time to self test
Time to deploy
Time to learn (Telemetry pipe)
• Team burndown
• Team velocity
• Original estimate
• Completed hours
• Team capacity
• # of bugs found
Things we don’t watch
It is more about impact than activity
3 weeks
SpringFallSpring Fall
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3Week 2 Week 3
Sprint 135Sprint 134 Sprint 136
At the end of a sprint, all teams send a “sprint mail” … communicating what they’ve
accomplished in the sprint, and what they’re planning to accomplish in the next sprint.
Value delivered
during the sprint
Video demonstrating
the value
What the team is
planning to accomplish
in the next sprint
6 month plan
Each team comes in and reviews with leadership three things:
1. What is the plan for the next 3-sprints?
2. Is the team healthy?
3. Any risks or issues to highlight?
• Storyboard of the customer
experience
• High level execution plan – sprints,
not hours
• Feedback, feedback, feedback
Dwight Eisenhower
Plans are worthless, but planning is
everything.
“
”
Strategy
12 months
Plan
3 sprints
3
Sprint
3 weeks
1
Season
6 months
6
Teams are responsible for the detail
Leadership is responsible
for the big picture
Strategy
Features
Stories
Tasks
Alignment
The big picture in light of our
business goals
Autonomy
The detail about what we’ll deliver
to achieve our business goals
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Strategy
FY18
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
FY18 H1
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Strategy
Strategy
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
6 month plan
Day in the
Life of an
Engineer
Photo by Goh Rhy Yan on Unsplash
Master
Week 3Week 2Week 1
Sprint Previous Sprint Next
175 commits/day
into Master
Release: Current Sprint x
Release: Sprint Previous x
https://aka.ms/releaseflow
Policies to keep master
branch healthy (green)
• Required reviewers
• Build must pass
• Security plugins
(opt-in) Run functional
tests in the cloud
Fast and reliable signals
All unit tests (L0/L1) run
in Pull Request
CI runs functional (L2) test
suites
Test reliability is actively
managed
Tests are trusted
Quality ownership
Photo by Sebastian Grochowicz on Unspla
Program Management Dev Test
Program Management Engineering
Program Management is responsible for:
WHAT we’re building, and
WHY we’re building it
Engineering is responsible for
HOW we’re building it, and that
we’re building it with QUALITY
Master is
always
shippable
Over 22 hours for nightly run and 2 days for the full run
Only ~60% of P0 runs passed 100%; Each NAR suite had many
failures
Test failure analysis was too costly
Took days to sift through failures before deployment could start
Tests should be written at the lowest level
possible
Write once, run anywhere including
production system
Product is designed for testability
Test code is product code, only reliable tests
survive
Testing infrastructure is a shared Service
Test ownership follows product ownership
Shared Platform Services (SPS)
North Central
TFS SU1
North Central
AT
AT
AT
JA
JA
JA
Blob
TFS SU7
Australia
TFS SU0
West Central
Containerized Services
• All code is deployed, but feature flags control exposure
– Reduces integration debt
• Flags provide runtime control down to individual user
• Users can be added or removed with no redeployment
• Mechanism for progressive experimentation & refinement
• Enables dark launch
Application Insights
Analytics (Project Kusto)
for
• text search and queries over
structured and semi-structured
data
• high volume ingestion
• fast queries over very large data
sets
•
•
•
•
•
•
•
•
•
 Double blind test
 Full disclosure at or near end
vs.
 Share tactics & lessons learned
 Continued evolution
Assume Breach - Use War Games to the learn attacks and practice response
3-week sprints
Vertical teams
Team rooms
Continual Planning & Learning
PM & Engineering
Continual customer engagement
Everyone in master
8-12 person teams
Publicly shared roadmap
Zero debt
Specs in PPT
Open source
Flattened organization hierarchy
User satisfaction determines success
Features shipped every sprint
4-6 month milestones
Horizontal teams
Personal offices
Long planning cycles
PM, Dev, Test
Yearly customer engagement
Feature branches
20+ person teams
Secret roadmap
Bug debt
100 page spec documents
Private repositories
Deep organizational hierarchy
Success is a measure of install numbers
Features shipped once a year
•
•
•
•
•
How MS Does Devops - DevOps Days Berlin 2018

How MS Does Devops - DevOps Days Berlin 2018

  • 2.
  • 3.
    DevOps at Microsoft Data:Internal Microsoft engineering system activity, August 2018 372k Pull Requests per month 2m Git commits per month 78,000Deployments per day 4.4m Builds per month 500m Test executions per day 500k Work items updated per day 5m Work items viewed per day Azure DevOps is the toolchain of choice for Microsoft engineering with over 90,000 internal users https://aka.ms/DevOpsAtMicrosoft
  • 4.
  • 5.
    800 The Azure DevOpsteam… spread out across 40 feature teams
  • 6.
    3 weeks Team FoundationServer (TFS) Azure DevOps (formerly VSTS)
  • 7.
    We are deliveringvalue to customers and an increased velocity. • More features in the 2016 calendar year (262 features)… • Than the previous 4 years combined (256 features). • 364 features in the 2017 calendar year! https://www.visualstudio.com/en-us/articles/news/features-timeline 22 58 65 111 249 364 97 0 50 100 150 200 250 300 350 400 450 2012 2013 2014 2015 2016 2017 2018 Features delivered per year
  • 8.
    Sprint 1 Aug 2010 VSTSPreview Sprint 29 Jun 2012 VSTS GA Sprint 64 Apr 2014 1ES Sprint 67 Jun 2014 GVFS Sprint 102 Jun 2016 Sprint 136 Jun 2018
  • 9.
    What did itlook like before?
  • 10.
  • 11.
  • 12.
    Planning M1 M2 Specs Weknew exactly what to build… and we knew it was right! Photo by Jose Antonio Gallego Vázquez on Unsp
  • 13.
    Code Test &Stabilize Code Complete We wrote all the code months before we shipped.
  • 14.
    Planning M1 M2 Wehad a perfect schedule and knew exactly when it would be ready!
  • 15.
    Planning Customer feedback –we should change the way a feature works. We didn’t get it quite right… … but we’re booked solid already. M1
  • 16.
    “Great feedback. Thanks!We’ll take a look in planning for the next release. We should get it to you…. in a few years.”
  • 17.
    Culture eats strategyfor breakfast.“ ” Peter Drucker
  • 18.
    Cross discipline 10-12 people Selfmanaging Clear charter and goals Intact for 12-18 months Physical team rooms Own features in production Own deployment of features
  • 19.
    Employee choice, not managerdriven Typically <20% change, but 100% get to make a choice Cross-pollinate talent and micro-culture Sticky Note Exercise - Self Forming Teams
  • 20.
    We started offtrying to set up a small anarchist community, but people wouldn't obey the rules. “ ” Alan Bennett
  • 21.
    Let’s try togive our teams three things…. Autonomy, Mastery, and Purpose. Intrinsic vs extrinsic motivators https://www.youtube.com/watch?v=u6XAPnuFjJc
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    A customer canhave a car painted any color he wants as long as it’s black “ ” Henry Ford Autonomy
  • 28.
    Week 1 Week2 Week 3 Week 1 Week 2 Week 3Week 2 Week 3 Sprint 135Sprint 134 Sprint 136
  • 29.
    S1 S2 S3S4 S5 Stabilization S6 A B “Let’s do this Agile thing… but we should probably reserve some time to stabilize things.” Seemed Like a good idea at the time…… ( famous last words)
  • 30.
    Code Test &Stabilize Code Test & Stabilize Code Complete Planning
  • 32.
    engineers on your team#5 ?x = We all follow a simple rule we call the “Bug Cap”:
  • 33.
    We all followa simple rule we call the “Bug Cap”: Rule: If your bug count exceeds your bug cap… stop working on new features until you’re back under the cap. 5 50x =10
  • 34.
    What we track LiveSite Health/Debt Time to Detect, Time To Mitigate Incident prevention items Aging live site problems Customer support metrics (SLA, MPI, top drivers) Engineering Health/Debt Bug cap per engineer Aging bugs in important categories Pass rate & coverage Velocity Time to build Time to self test Time to deploy Time to learn (Telemetry pipe) • Team burndown • Team velocity • Original estimate • Completed hours • Team capacity • # of bugs found Things we don’t watch It is more about impact than activity
  • 35.
  • 36.
    Week 1 Week2 Week 3 Week 1 Week 2 Week 3Week 2 Week 3 Sprint 135Sprint 134 Sprint 136 At the end of a sprint, all teams send a “sprint mail” … communicating what they’ve accomplished in the sprint, and what they’re planning to accomplish in the next sprint.
  • 37.
    Value delivered during thesprint Video demonstrating the value What the team is planning to accomplish in the next sprint
  • 38.
    6 month plan Eachteam comes in and reviews with leadership three things: 1. What is the plan for the next 3-sprints? 2. Is the team healthy? 3. Any risks or issues to highlight?
  • 39.
    • Storyboard ofthe customer experience • High level execution plan – sprints, not hours • Feedback, feedback, feedback
  • 40.
    Dwight Eisenhower Plans areworthless, but planning is everything. “ ”
  • 41.
    Strategy 12 months Plan 3 sprints 3 Sprint 3weeks 1 Season 6 months 6 Teams are responsible for the detail Leadership is responsible for the big picture
  • 42.
    Strategy Features Stories Tasks Alignment The big picturein light of our business goals Autonomy The detail about what we’ll deliver to achieve our business goals
  • 43.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 Strategy FY18
  • 44.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan FY18 H1 Strategy
  • 45.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan Strategy
  • 46.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan Strategy
  • 47.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan Strategy
  • 48.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan Strategy
  • 49.
    Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan Strategy
  • 50.
    Strategy Q1 Q2 Q3Q4 Q1 Q2 Q3 Q4 6 month plan
  • 51.
    Day in the Lifeof an Engineer Photo by Goh Rhy Yan on Unsplash
  • 53.
    Master Week 3Week 2Week1 Sprint Previous Sprint Next 175 commits/day into Master Release: Current Sprint x Release: Sprint Previous x https://aka.ms/releaseflow
  • 54.
    Policies to keepmaster branch healthy (green) • Required reviewers • Build must pass • Security plugins (opt-in) Run functional tests in the cloud
  • 55.
    Fast and reliablesignals All unit tests (L0/L1) run in Pull Request
  • 56.
    CI runs functional(L2) test suites Test reliability is actively managed Tests are trusted
  • 57.
    Quality ownership Photo bySebastian Grochowicz on Unspla
  • 58.
  • 59.
  • 60.
    Program Management isresponsible for: WHAT we’re building, and WHY we’re building it Engineering is responsible for HOW we’re building it, and that we’re building it with QUALITY
  • 61.
  • 62.
    Over 22 hoursfor nightly run and 2 days for the full run Only ~60% of P0 runs passed 100%; Each NAR suite had many failures Test failure analysis was too costly Took days to sift through failures before deployment could start
  • 63.
    Tests should bewritten at the lowest level possible Write once, run anywhere including production system Product is designed for testability Test code is product code, only reliable tests survive Testing infrastructure is a shared Service Test ownership follows product ownership
  • 66.
    Shared Platform Services(SPS) North Central TFS SU1 North Central AT AT AT JA JA JA Blob TFS SU7 Australia TFS SU0 West Central Containerized Services
  • 67.
    • All codeis deployed, but feature flags control exposure – Reduces integration debt • Flags provide runtime control down to individual user • Users can be added or removed with no redeployment • Mechanism for progressive experimentation & refinement • Enables dark launch
  • 68.
    Application Insights Analytics (ProjectKusto) for • text search and queries over structured and semi-structured data • high volume ingestion • fast queries over very large data sets
  • 69.
  • 70.
     Double blindtest  Full disclosure at or near end vs.  Share tactics & lessons learned  Continued evolution Assume Breach - Use War Games to the learn attacks and practice response
  • 72.
    3-week sprints Vertical teams Teamrooms Continual Planning & Learning PM & Engineering Continual customer engagement Everyone in master 8-12 person teams Publicly shared roadmap Zero debt Specs in PPT Open source Flattened organization hierarchy User satisfaction determines success Features shipped every sprint 4-6 month milestones Horizontal teams Personal offices Long planning cycles PM, Dev, Test Yearly customer engagement Feature branches 20+ person teams Secret roadmap Bug debt 100 page spec documents Private repositories Deep organizational hierarchy Success is a measure of install numbers Features shipped once a year
  • 74.

Editor's Notes

  • #4 Enterprise scale
  • #5 We use what we sell & we sell what we use!"
  • #8 Moving to 3 week sprint cadence has allowed us to deliver many more features, much more often. As you can see here in the chart, we delivered more features in 2016 then we did in the preceding 4 years combined. Last year we managed to deliver a staggering 364 features. If you’d like to see a list of these features, you can visit the features timeline using the URL shown. If you open the features timeline and choose one of the sprint release notes on the left. Scroll to the bottom and you’ll see different names from sprint to sprint. That's because different people author those release notes. Now to be honest with you, they don't write all those words, but all of those words come from the program managers on our team who write them for their features. We nominate one person per Sprint to be responsible to do wordsmithing, make sure it's in a consistent tone and style and put your name behind it because we believe in it and it’s real. The timeline is a great way to see that we actually do ship every 3 weeks. This time line goes all the way back in time and it's a very, very powerful story when we can go all the way back in time and see what we were talking about at the time and what the focus of our work was at that time.
  • #29 As previously mentioned, we consistently run 3 week sprints. In the slide, you can see an extra week tagged on the end of each sprint. This is not a fourth week of a sprint. This is just when the value out of this Sprint is being deployed which happens in parallel to the next Sprint happening. Our deployments take from beginning to end, probably a week or two to get through all scale units nowadays. The key that I point out is that this is not a period of testing, so you're not doing testing there. This is rolling out into production environments at that point. This is just happening in parallel to the next Sprint starting so we're in Sprint 136 right now and in parallel Sprint 135 is being deployed.
  • #30 When we first started our own agile transformation 4 years ago….
  • #37  One of the activities that we do to help stay connected is that every single team sends a mail at the Sprint boundary or around the Sprint boundary. It usually happens the week after the Sprint. What that mail encompasses is, when you talk about what the team is accomplished in the previous Sprint and what they're going to do in the next Sprint
  • #40 Need an example / screen
  • #54 (process tax) Take a look at our branch structure back in Dev12 High process tax Engineers work at the leaves, interior nodes are for integration, green ones ship Collaboration across branches difficult Promotes big-bang style integrations If it is that bad, why do it? Survival. Isolation. Story: – Buck’s story long lived branch w/75 people many FIs before RI – crazy. 2010, What did we do? ----------- Traditional way of working, deep branch hierarchy and significant engineering costs to code flow We had a full time engineer to push code through structures like this Reading this… focus on ALM node on left side… We (VSTS) were relatively lucky to work in ALM – very early TFS releases were out of deeper structure on right – rest of DevDiv where VS shipped FTeam node under ALM means that WIT and TFVC work in different branches Builds on floor. “Others” break product, quality gates don’t catch it, can’t work. Another problem – engineering teams have little direct control over execution Form of throw it over the wall to get it into a release vehicle – someone else’s problem, mostly, but we’re accountable == frustration Ex: ALM and VS crashing together in an integration branch – weeks to work through events like that – full time engineer
  • #55 (process tax reduction) (feature engineer – create short-lived topic branch off of master) This is where we’re at today – 175 commits/day into Master… build breaks perhaps 1 / month Short-lived release branches Many people, large tree, flat branch structure… how? Shift-left: Controlling build breaks… frequent small check-ins, shift-left - PR workflow helped here Controlling product breaks… shift-left quality journey Move to Git Spring ‘14 – helped a few different ways… PR workflow - first class support for build validation First-class cherry-pick workflow – easier to cherry-pick and put it where it belongs than to merge code you didn’t write Git allows for "powerful local experimentation“ – idea that local branches are empowering ----- Did not change overnight – several versions, including push while we were still under TFVC
  • #56 (feature engineer – ready to deliver – create a PR) Centrally managed policies applied – build and functional quality gate Offload functional test burden to PR pipeline to free up box ---- Before Build breakages – build policy helps; freshness Engineers couldn’t trust quality signals in past - high friction in running tests before checkin. Tests took too long in past and our longest running suite was 22 hours (It is <2 hours to run all tests!). Don’t want to hog my machine to run tests (test run in parallel in the cloud).
  • #57 Lets see build running in pull request validation It builds full source tree and runs all unit tests (60K). All our tests run everytime and it helps in keeping test healthy and avoid stale problem Lightening fast tests and highly parallelized test system helps in getting signals very fast. Munil will cover details about design principles to achieve speed L1 are unit tests for SQL code and we run it within containers for isolation and keep host machine clean for subsequent builds Each change goes through this validation and this helps in keeping master green
  • #58 Optional tests are no longer optional once we commit to master... CI runs the battery of tests to asses quality Previously took 22 hours and we had an army of test engineers to analyze. Speed is critical for continuous deployment goal and today it takes < 2 hours and when a mail comes, I take action because I trust quality signal --------------------- Test is product code and follow the culture of rich telemetry. All test environments and production are hooked up with same telemetry repository. There is single way of analyzing production issue and test failure We don’t encourage engineers to connect to machines and use debugger. Rather improve telemetry and instrumentation to diagnose test failures.
  • #68 VMs – PaaS web and worker roles and moving to Containers App tiers – serve web UI, web service endpoints Job agents – background processing like scheduled builds, clean up, commit processing, etc. DB – only metadata in SQL Azure, multi-tenant Blob – file data in Azure Storage SU1 was the first only originally…no incremental roll out when there’s only one! Then SPS Then SU0 Then more scale units in the US and around the world Organized in deployment rings Health check runs after each ring is deployed Today we have four rings with outer rings having multiple scale units in them Each service has scale units organized in rings