Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BBC Digital Platform Media Services
Rachel Evans
rachel.evans@bbc.co.uk
Destruction, Decapods,
and Doughnuts
@rvedotrc
Con...
☹
A few years ago, the system that handled video publication for iPlayer was unreliable. Programmes were often missing, or...
All we had to do was build a completely new replacement system, and we had a little over a year in which to do it.

The tr...
This was the situation we had deliberately, knowingly placed ourselves in. We had just over a year not only to build a
com...
“Publish all BBC AV media
produced for IP platforms”
My name’s Rachel Evans, I’m a Principal Software Engineer in Media Se...
This, then, is my team’s story: the story of how we changed the way we make software, and how that enabled us to
successfu...
Media Services:
a history
Part I
I joined the BBC in 2007, and I’ve been with Media Services since 2009, but this story re...
Summer 2010
This was basically the low point for us as a team – when we were at our least effective.
Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you kno...
Absolutely no
automated tests
Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit te...
Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long tim...
Not-really-unit tests: 90 minutes
Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minute...
Not-really-unit tests: 90 minutes
Code coverage: killed after 4 days
Video On Demand
Our video-on-demand product did have ...
Patch, patch, patch, …
We didn’t always build and deploy things cleanly, because building and deploying were slow.
So, all...
“Patch Club”
So we called it “Patch Club”.
The first rule of
Patch Club is,
you don’t talk
about Patch Club.
Of course, the first rule of Patch Club is, you don’t tal...
Patch Club
Patch Club
Patch Club
Err, OK. Better not talk about “Patch Club” in public.
P***h C**b
So in team online chats, we started calling it “P***h C**b”. And the joke then became, “What could P…h C..b mea...
Pooch Comb
Nah. That doesn’t feel right.
P***h C**b
Plush Crab
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our failure to ...
Plush Crab“Tyler”
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our fail...
Let’s ship it!
Eventually, a few times a year, we decided that we had so many undeployed commits that it was time to relea...
What’s in release 10.6?
This is us, in July 2010, trying to work out what’s new in this release that we’re about to deploy...
simplify bumper, bumper_in (howet03)
added drop table statements for repairing fuckup with db backup (howet03)
moved view ...
removed (howet03)
Changes by Tom H: - support passing plugin name explicitly on the command line
- use wfe env vars when c...
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run t...
swearing: 1
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you shoul...
swearing: 1
no message: 4
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (w...
swearing: 1
no message: 4
/fix.*test/:15
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fi...
202But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.
But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.
We have a problem.
We have no idea
what we are
deploying
We have a problem.
No standard
deployment procedure
Someone goes into a trance-like state and experiments with substances (coffee) and tar fi...
cc by 2.0;
And whenever we deploy, something catches fire.
bit.ly/1btCODY
Or sometimes, everything catches fire.
Or at least that’s what it felt like.
Bad tests
Terrible code
Slow development
Huge, infrequent releases
Deployment is slow and unreliable
Followed by days repa...
[various suppliers of
doughnuts are
available]
We rewarded ourselves with doughnuts!
Releases took a long time to create, ...
Delivery = doughnuts!
We learnt that delivery == doughnuts.
Autumn 2012
Summer 2012: We’re creating a new workflow in the cloud to put iPlayer content onto Sky set-top boxes, and the ...
Continuous Delivery
We know what we need to do: smaller releases, more often. Continuous Delivery.
We’d just had the London 2012 Olympics. You could tell, because the branding was everywhere in our building. Just in case
...
The Olympics – branding was everywhere.
The Olympics – branding was everywhere.
But then suddenly, the Olympics branding was gone, and it was replaced by this: our Top 5 Priorities, writ large on the ve...
Continuous Destruction
So semi-jokingly in the team, we called it Continuous Destruction.
We were scared of this.
Continuous Disaster
“Continuous Disaster” was another name we used.
Delivery = doughnuts
But then we started to rationalise it. We’ve already learned that delivery == doughnuts, so maybe…
Continuous Doughnuts
It’s Continuous Doughnuts.
YESOK. Let’s do this.
cc by-nc-sa/2.0;
Summer 2013
Summer 2013: Video Factory is ready. 25 microservices instead of the previous monolith. 

Deployment by now is...
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
So, now we can deploy quite a lot - not ju...
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
So, now we can deploy quite a lot - no...
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
38
So, now we can deploy quite a lot -...
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
38
26
So, now we can deploy quite a lo...
0
10
20
30
40
50
60
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Deployments by day of week
(total for 10 week...
Summer 2014
Summer 2014: we’re now up to 75 microservices – three times bigger. We’ve gone from barely being able to make ...
Automating the build
wasn’t enough
“Just adding build automation wasn’t enough – deploying was still hard, which meant we ...
Continuous Discovery
Part II
This is a process of Continuous Discovery: this is simply what we’ve done, and learnt, so far...
It takes the whole team
The whole team is involved in delivery – that’s why the team is who it is.

For us, that means pro...
Automated checking
Hopefully obvious. We had it already, before Continuous Delivery, but it’s even more critical afterward...
Being the masters
of our own destiny
We needed to be in control of our own destiny

Hence can’t have someone else telling ...
“Do you want
support with that?”
We choose our own level of support.

For us in Media Services, almost everything we do ne...
Media Services Support
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.
Media Services Support
1. The BBC’s central 24/7 operations team
In-hours: team, to team, to team.

Out of hours: team, to...
Media Services Support
1. The BBC’s central 24/7 operations team
2. Our own support team
In-hours: team, to team, to team....
Media Services Support
1. The BBC’s central 24/7 operations team
2. Our own support team
3. The development team
In-hours:...
Challenges
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re will...
Challenges
How many people opt in?
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ...
Challenges
Understanding the product
Our product estate is big. Audio simulcast is rather different from video on-demand, f...
Challenges
What do I do if I get called?
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore smal...
Code changes
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, t...
Code changes
Roll back
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll ...
Code changes
Roll back
Retry it
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (sm...
Code changes
Roll back
Retry it
Kill it
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll ...
Code changes
Roll back
Retry it
Kill it
Scale up or down
What might we do, for out-of-hours support?

Specifically: NOT cod...
Decisions, Decisions …
Part III
Including the whole team, with close communication; automated checking; being in control; ...
I have never used
the in-house
hosting platform
A caveat / confession.
The BBC does have an in-house hosting platform, but...
The documentation describing the old (pre-Continuous-Delivery) process for doing software on the in-house hosting
platform...
(goes on for 2880 words)
The (unofficial) new Pipeline:
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional...
The (unofficial) new Pipeline:
1. Get an AWS account
My unofficial take on the new, simplified mandatory steps.
Actually, e...
The (unofficial) new Pipeline:
1. Get an AWS account
2. Get Infosec approval
My unofficial take on the new, simplified mand...
The (unofficial) new Pipeline:
1. Get an AWS account
2. Get Infosec approval
3. Go live
My unofficial take on the new, simp...
Optional extras
However, although almost everything is now optional, you might want to do them anyway. It’s up to you.
You...
Optional extras
Decent architecture
However, although almost everything is now optional, you might want to do them anyway....
Following your Technical
Architect’s advice will make
your product more successful.
Ditto for:
decent engineering,
decent product management,
etc
( )
Optional extras
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
Continuous Integration
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
Continuous Integration
Repeatable builds
More things you don’t have to do. Y...
Optional extras
Using the standard build tool
Continuous Integration
Repeatable builds
Builds
More things you don’t have t...
An efficient, repeatable build chain
makes your product more reliable.
Optional extras
You don’t have to do these things…
Optional extras
Run Book
You don’t have to do these things…
Optional extras
Run Book
Demos
You don’t have to do these things…
Optional extras
Run Book
Demos
Telling anybody anything
You don’t have to do these things…
If you help the support team,
they can help you.
If you tell people what your product is, does, how it works, etc. then th...
Optional extras
You don’t have to do these!
Optional extras
Out-of-hours support
You don’t have to do these!
Optional extras
Out-of-hours support
In-hours support
You don’t have to do these!
Optional extras
Out-of-hours support
In-hours support
Giving a damn about anything
You don’t have to do these!
If you care about your product,
other people will care too.
But you might want to :-)
Optional extras
You don’t have to do these things either…
Optional extras
Monitoring
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
Component testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
Component testing
Unit testing
You don’t have to do these thin...
Optional extras
Any concept of success whatsoever
If you care about something,
monitor / test it.
Monitoring ftw.
Test-Driven
Development
In fact, let’s take this one step further.

If we think that TDD is a good thing…
Monitoring
Load testing
Integration testing
Component testing
Unit testing
And these four things are automated checks for ...
Monitoring
Load testing
Integration testing
Component testing
Unit testing
}
And these four things are automated checks fo...
Monitoring
Load testing
Integration testing
Component testing
Unit testing
}
And these four things are automated checks fo...
Monitoring-Driven
Development
Monitoring-Driven Development: define how you’re going to monitor this behaviour that you wan...
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it lea...
Responsibility
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they...
Responsibility
Power
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sur...
Responsibility
Power
Better product
As a team you have extra responsibility to ensure things happen.
But also the extra po...
Responsibility
Power
Better product
Take the credit
As a team you have extra responsibility to ensure things happen.
But a...
Tyler now represents not our failure, but our innovation.

He has a new, innovative use, keeping hold of our video adapter...
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
Continuous Delivery
sounded scary
In 2010, we were making large, infrequent releases, and after every release we’d spend d...
Change the team.
We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team...
Change the team.
Be in control of your product.
We had to change. That change didn’t happen overnight, and it wouldn't hav...
Smaller, safer changes.
Change the team.
Be in control of your product.
Deployment is now literally an every day occurrenc...
Smaller, safer changes.
Rapid feedback.
Change the team.
Be in control of your product.
Deployment is now literally an eve...
Smaller, safer changes.
Rapid feedback.
Change the team.
Be in control of your product.
Smaller, safer changes.
Having a m...
Continuous Delivery enabled us
to create Video Factory.
Continuous Delivery enabled us to create Video Factory – new archi...
What could it enable
you to create?
What could it enable you to create?
Thank you
Rachel Evans
rachel.evans@bbc.co.uk
@rvedotrc
Digital
Platform Media Services
Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory
Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory
Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory
Upcoming SlideShare
Loading in …5
×

Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

745 views

Published on

As presented at the BBC Digital Open Day, London, 2015-04-27.

In 2012 we committed to killing the system that processed all the video for BBC iPlayer, because it was unsustainable. We gave ourselves 12 months to build a complete replacement. And along the way, we had to re-learn how to develop, test, deploy and support software.

This is the story of how we adapted to Continuous Delivery, and a plea for you to adapt Continuous Delivery to suit *your* product.

Published in: Software
  • Be the first to comment

Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

  1. 1. BBC Digital Platform Media Services Rachel Evans rachel.evans@bbc.co.uk Destruction, Decapods, and Doughnuts @rvedotrc Continuous Delivery for Audio & Video Factory
  2. 2. ☹ A few years ago, the system that handled video publication for iPlayer was unreliable. Programmes were often missing, or published late. What to do? We committed to killing it. We committed so much, we deliberately declined to renew the third-party contract which iPlayer relied upon. The system’s fate was sealed: on 1st October 2013, it would stop working.
  3. 3. All we had to do was build a completely new replacement system, and we had a little over a year in which to do it. The trouble was, it was a half-million line codebase, and we were in the habit of only releasing two or three times a year, and every time we released, things broke.
  4. 4. This was the situation we had deliberately, knowingly placed ourselves in. We had just over a year not only to build a complete replacement, but to re-learn how to develop, test, release, and support software.
  5. 5. “Publish all BBC AV media produced for IP platforms” My name’s Rachel Evans, I’m a Principal Software Engineer in Media Services, part of the BBC Digital Platform. Our mission is to “Publish all BBC Audio/Video media produced for IP Platforms”. So if you’ve ever watched a BBC News clip online, or listened to a BBC podcast, or if you watched the Olympics online in 2012, or watched iPlayer, either live or on- demand, or listened to iPlayer Radio – if you’ve done any of those things, then you’ve used our products.
  6. 6. This, then, is my team’s story: the story of how we changed the way we make software, and how that enabled us to successfully launch Video Factory and Audio Factory – the media processing systems that now power iPlayer. And what it all has to do with a toy crab who lives in a silver trophy.
  7. 7. Media Services: a history Part I I joined the BBC in 2007, and I’ve been with Media Services since 2009, but this story really starts in Summer 2010.
  8. 8. Summer 2010 This was basically the low point for us as a team – when we were at our least effective.
  9. 9. Audio On Demand This session is about Testing… Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering, you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
  10. 10. Absolutely no automated tests Audio On Demand This session is about Testing… Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering, you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
  11. 11. Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  12. 12. Not-really-unit tests: 90 minutes Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  13. 13. Not-really-unit tests: 90 minutes Code coverage: killed after 4 days Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  14. 14. Patch, patch, patch, … We didn’t always build and deploy things cleanly, because building and deploying were slow. So, all too often, we’d apply patches, sometimes directly to the live system. We knew this was a bad thing, a bad habit to get into. But we did it anyway. It was our team’s dirty open secret.
  15. 15. “Patch Club” So we called it “Patch Club”.
  16. 16. The first rule of Patch Club is, you don’t talk about Patch Club. Of course, the first rule of Patch Club is, you don’t talk about Patch Club. Everyone knows that!
  17. 17. Patch Club
  18. 18. Patch Club
  19. 19. Patch Club Err, OK. Better not talk about “Patch Club” in public.
  20. 20. P***h C**b So in team online chats, we started calling it “P***h C**b”. And the joke then became, “What could P…h C..b mean?” One day our colleague Mike brought in two mysterious artefacts…
  21. 21. Pooch Comb Nah. That doesn’t feel right.
  22. 22. P***h C**b
  23. 23. Plush Crab Our team’s resident decapod, and his name is Tyler. Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
  24. 24. Plush Crab“Tyler” Our team’s resident decapod, and his name is Tyler. Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
  25. 25. Let’s ship it! Eventually, a few times a year, we decided that we had so many undeployed commits that it was time to release them.
  26. 26. What’s in release 10.6? This is us, in July 2010, trying to work out what’s new in this release that we’re about to deploy. We don’t know for sure: we make our best guess.
  27. 27. simplify bumper, bumper_in (howet03) added drop table statements for repairing fuckup with db backup (howet03) moved view to view file (howet03) fix wmv test and make it stable when config changes via local yml file (howet03) fix transcode bumpers test (howet03) 2057-Console-transcode-task-page-showing-status-as-o (murrac21) bump (howet03) Fix for AutoQCPassed test, from R10.4B (alexb) 2076-Console-version-page-Add-Asset-button (murrac21) https://jira.dev.bbc.co.uk/browse/NEWWORLD-2061 https://jira.dev.bbc.co.uk/ browse/NEWWORLD-2076 Merging asset changes (dbennett) Removed hard-wired TERMs, for those of us not on TERM=xterm (evansd17) Added "db" ops script, and use this in the other scripts (evansd17) MySQL query optimisation for monitor_schedule_item (evansd17) Merge of test fixes and qc domain hack round potential for verified asset records lacking mtimes. (alexb) The change log.
  28. 28. removed (howet03) Changes by Tom H: - support passing plugin name explicitly on the command line - use wfe env vars when connecting to the db (weyt03) WORKFLOWENGINE-83 Delta worklist filtering on profile (evansd17) Give up after 2 days (mary) Self polling set to five minutes to avoid thrashing under heavy asset registration load (dbennett) add index db deltas for asset_file (marcus) swop constraints (howet03) merge from R10.5A.22 (marcus) merge of 10.5A changes into trunk (howet03) add indecies to various ingest_metadata and ingest_task columns (marcus) 148-Seeding-Bug - rework seed creation in scheduler (howet03) To recover from unexpected PIPs outages (mary) Merge of Andy's 209-Fix_MAD_for_3G branch. (alexb) New cut of R10.6 from trunk (including new seeding fixes, and 209 - Fix MAD for 3G) (alexb) The change log.
  29. 29. 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  30. 30. swearing: 1 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  31. 31. swearing: 1 no message: 4 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  32. 32. swearing: 1 no message: 4 /fix.*test/:15 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  33. 33. 202But in total, 202 commits! That’s a lot. And we’re not even sure that that’s right.
  34. 34. But in total, 202 commits! That’s a lot. And we’re not even sure that that’s right.
  35. 35. We have a problem.
  36. 36. We have no idea what we are deploying We have a problem.
  37. 37. No standard deployment procedure Someone goes into a trance-like state and experiments with substances (coffee) and tar files for a day. Different people deployed the product in different ways, giving inconsistent results.
  38. 38. cc by 2.0; And whenever we deploy, something catches fire.
  39. 39. bit.ly/1btCODY Or sometimes, everything catches fire. Or at least that’s what it felt like.
  40. 40. Bad tests Terrible code Slow development Huge, infrequent releases Deployment is slow and unreliable Followed by days repairing the damage Summary. So with all this failure as a team, what did we do?
  41. 41. [various suppliers of doughnuts are available] We rewarded ourselves with doughnuts! Releases took a long time to create, test, deploy, and extinguish, so we must have done a good job, right? Doughnuts.
  42. 42. Delivery = doughnuts! We learnt that delivery == doughnuts.
  43. 43. Autumn 2012 Summer 2012: We’re creating a new workflow in the cloud to put iPlayer content onto Sky set-top boxes, and the start of Video Factory. Better software, better architecture, better practices, More Jenkins. More BDD, some automated testing. But deployment is still hard, so we’re still not deploying often enough.
  44. 44. Continuous Delivery We know what we need to do: smaller releases, more often. Continuous Delivery.
  45. 45. We’d just had the London 2012 Olympics. You could tell, because the branding was everywhere in our building. Just in case we forgot, like.
  46. 46. The Olympics – branding was everywhere.
  47. 47. The Olympics – branding was everywhere.
  48. 48. But then suddenly, the Olympics branding was gone, and it was replaced by this: our Top 5 Priorities, writ large on the very walls. There it was, right in front of us: Priority: Continuous Delivery. But we were scared of this. Every time we deploy, things catch fire. And you want us to deploy more often? Uh, huh.
  49. 49. Continuous Destruction So semi-jokingly in the team, we called it Continuous Destruction. We were scared of this.
  50. 50. Continuous Disaster “Continuous Disaster” was another name we used.
  51. 51. Delivery = doughnuts But then we started to rationalise it. We’ve already learned that delivery == doughnuts, so maybe…
  52. 52. Continuous Doughnuts It’s Continuous Doughnuts.
  53. 53. YESOK. Let’s do this.
  54. 54. cc by-nc-sa/2.0;
  55. 55. Summer 2013 Summer 2013: Video Factory is ready. 25 microservices instead of the previous monolith. Deployment by now is easy, we’ve worked closely with another team in the BBC to help them develop the deployment system. I’ll talk about some of the other changes we made to help achieve this in a moment.
  56. 56. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: So, now we can deploy quite a lot - not just 2 or 3 times per year.
  57. 57. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  58. 58. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 38 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  59. 59. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 38 26 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  60. 60. 0 10 20 30 40 50 60 Monday Tuesday Wednesday Thursday Friday Saturday Sunday Deployments by day of week (total for 10 weeks, divided by 10) We peak at just over 50 deployments per day. (The majority is on the int environment, because there, the deploy happens whenever we commit something).
  61. 61. Summer 2014 Summer 2014: we’re now up to 75 microservices – three times bigger. We’ve gone from barely being able to make any changes without things catching fire, to threefold growth in a year. Sustainable growth via better tooling.
  62. 62. Automating the build wasn’t enough “Just adding build automation wasn’t enough – deploying was still hard, which meant we didn’t deploy very often, which meant that each deployment was big, and was therefore risky.” “To reduce the latency and risk, we needed to be able to deploy more quickly, more often: Continuous Delivery. But what did that mean for us as a team?”
  63. 63. Continuous Discovery Part II This is a process of Continuous Discovery: this is simply what we’ve done, and learnt, so far. But we’re still learning, still adapting. Going to focus on just a few areas.
  64. 64. It takes the whole team The whole team is involved in delivery – that’s why the team is who it is. For us, that means product owners, project managers, architects, software engineers, testers, support. Everyone has their part to play in CD. Everyone needs to adapt, and benefits. Earlier, continuous communication within the team. Quicker feedback from each small step, so we know what step to take next.
  65. 65. Automated checking Hopefully obvious. We had it already, before Continuous Delivery, but it’s even more critical afterwards.
  66. 66. Being the masters of our own destiny We needed to be in control of our own destiny Hence can’t have someone else telling us when we can or can’t deploy Can’t have someone else doing the deployments for us Therefore we needed to perform the deployments ourselves There was some inertia within the organisation that we had to overcome.
  67. 67. “Do you want support with that?” We choose our own level of support. For us in Media Services, almost everything we do needs 24/7 support – downtime is never OK. But, for each service, evaluate the required level of support for that service.
  68. 68. Media Services Support In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  69. 69. Media Services Support 1. The BBC’s central 24/7 operations team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  70. 70. Media Services Support 1. The BBC’s central 24/7 operations team 2. Our own support team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  71. 71. Media Services Support 1. The BBC’s central 24/7 operations team 2. Our own support team 3. The development team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  72. 72. Challenges Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in. Some do, some don’t. If not enough people opt in, support might not be viable. For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good to know that someone’s there if needs be.
  73. 73. Challenges How many people opt in? Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in. Some do, some don’t. If not enough people opt in, support might not be viable. For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good to know that someone’s there if needs be.
  74. 74. Challenges Understanding the product Our product estate is big. Audio simulcast is rather different from video on-demand, for example. Mitigating factors: - lots of common patterns, which help us make educated guesses - we only allow ourselves simple operations out-of-hours
  75. 75. Challenges What do I do if I get called?
  76. 76. What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  77. 77. Code changes What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  78. 78. Code changes Roll back What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  79. 79. Code changes Roll back Retry it What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  80. 80. Code changes Roll back Retry it Kill it What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  81. 81. Code changes Roll back Retry it Kill it Scale up or down What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  82. 82. Decisions, Decisions … Part III Including the whole team, with close communication; automated checking; being in control; adapting to support. That’s what it meant for us. But what might it mean for you? Continuous Delivery at the BBC means flexibility. Pick and choose what’s right for you.
  83. 83. I have never used the in-house hosting platform A caveat / confession. The BBC does have an in-house hosting platform, but none of my products have ever used it. But I’m going to compare the steps for getting something live before, vs after, Continuous Delivery.
  84. 84. The documentation describing the old (pre-Continuous-Delivery) process for doing software on the in-house hosting platform. Lots of mandatory steps.
  85. 85. (goes on for 2880 words)
  86. 86. The (unofficial) new Pipeline: My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  87. 87. The (unofficial) new Pipeline: 1. Get an AWS account My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  88. 88. The (unofficial) new Pipeline: 1. Get an AWS account 2. Get Infosec approval My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  89. 89. The (unofficial) new Pipeline: 1. Get an AWS account 2. Get Infosec approval 3. Go live My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  90. 90. Optional extras However, although almost everything is now optional, you might want to do them anyway. It’s up to you. You don’t have to have a decent architecture…
  91. 91. Optional extras Decent architecture However, although almost everything is now optional, you might want to do them anyway. It’s up to you. You don’t have to have a decent architecture…
  92. 92. Following your Technical Architect’s advice will make your product more successful.
  93. 93. Ditto for: decent engineering, decent product management, etc ( )
  94. 94. Optional extras More things you don’t have to do. Your call.
  95. 95. Optional extras Using the standard build tool More things you don’t have to do. Your call.
  96. 96. Optional extras Using the standard build tool Continuous Integration More things you don’t have to do. Your call.
  97. 97. Optional extras Using the standard build tool Continuous Integration Repeatable builds More things you don’t have to do. Your call.
  98. 98. Optional extras Using the standard build tool Continuous Integration Repeatable builds Builds More things you don’t have to do. Your call.
  99. 99. An efficient, repeatable build chain makes your product more reliable.
  100. 100. Optional extras You don’t have to do these things…
  101. 101. Optional extras Run Book You don’t have to do these things…
  102. 102. Optional extras Run Book Demos You don’t have to do these things…
  103. 103. Optional extras Run Book Demos Telling anybody anything You don’t have to do these things…
  104. 104. If you help the support team, they can help you. If you tell people what your product is, does, how it works, etc. then they can help you, for example when things go wrong.
  105. 105. Optional extras You don’t have to do these!
  106. 106. Optional extras Out-of-hours support You don’t have to do these!
  107. 107. Optional extras Out-of-hours support In-hours support You don’t have to do these!
  108. 108. Optional extras Out-of-hours support In-hours support Giving a damn about anything You don’t have to do these!
  109. 109. If you care about your product, other people will care too. But you might want to :-)
  110. 110. Optional extras You don’t have to do these things either…
  111. 111. Optional extras Monitoring You don’t have to do these things either…
  112. 112. Optional extras Monitoring Load testing You don’t have to do these things either…
  113. 113. Optional extras Monitoring Load testing Integration testing You don’t have to do these things either…
  114. 114. Optional extras Monitoring Load testing Integration testing Component testing You don’t have to do these things either…
  115. 115. Optional extras Monitoring Load testing Integration testing Component testing Unit testing You don’t have to do these things either…
  116. 116. Optional extras Any concept of success whatsoever
  117. 117. If you care about something, monitor / test it. Monitoring ftw.
  118. 118. Test-Driven Development In fact, let’s take this one step further. If we think that TDD is a good thing…
  119. 119. Monitoring Load testing Integration testing Component testing Unit testing And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  120. 120. Monitoring Load testing Integration testing Component testing Unit testing } And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  121. 121. Monitoring Load testing Integration testing Component testing Unit testing } And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  122. 122. Monitoring-Driven Development Monitoring-Driven Development: define how you’re going to monitor this behaviour that you want, when it’s live. How do you know if it’s working? Create the alarm first, before you create the behaviour. The alarm goes red, unhappy. Good: now you know that you need to do some work to create that behaviour. Do that, release it. Then the alarm clears, is happy. So straight away, you know that this thing is monitored, from day 1, and that the monitoring works.
  123. 123. As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  124. 124. Responsibility As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  125. 125. Responsibility Power As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  126. 126. Responsibility Power Better product As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  127. 127. Responsibility Power Better product Take the credit As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  128. 128. Tyler now represents not our failure, but our innovation. He has a new, innovative use, keeping hold of our video adapter. He lives in that silver cup, the BBC Digital Platform Innovation Award.
  129. 129. You can see it engraved there with our team name. Yes, they got the year wrong.
  130. 130. You can see it engraved there with our team name. Yes, they got the year wrong.
  131. 131. You can see it engraved there with our team name. Yes, they got the year wrong.
  132. 132. You can see it engraved there with our team name. Yes, they got the year wrong.
  133. 133. Continuous Delivery sounded scary In 2010, we were making large, infrequent releases, and after every release we’d spend days or weeks putting fires out. Over the next 2 or 3 years we adopted Continuous Delivery. It sounded scary at first; we were afraid we'd just break things more often. But the feared “Continuous Destruction” never happened. In fact, it turned out to be absolutely critical to Video Factory's success.
  134. 134. Change the team. We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being involved. As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what level of support our product needs.
  135. 135. Change the team. Be in control of your product. We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being involved. As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what level of support our product needs.
  136. 136. Smaller, safer changes. Change the team. Be in control of your product. Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer; and we have much quicker feedback of results from each stage.
  137. 137. Smaller, safer changes. Rapid feedback. Change the team. Be in control of your product. Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer; and we have much quicker feedback of results from each stage.
  138. 138. Smaller, safer changes. Rapid feedback. Change the team. Be in control of your product. Smaller, safer changes. Having a more stable product of course gives a better experience for the audience; but additionally, adopting Continuous Delivery has helped to make working in the team be more enjoyable. And with the faster feedback, we’re able to deliver features, and fixes, to deliver value, more quickly, and more reliably. We change things more often, and more safely. But also, we can experiment; we can try stuff out; we can innovate.
  139. 139. Continuous Delivery enabled us to create Video Factory. Continuous Delivery enabled us to create Video Factory – new architecture, new code, new platform – in just 12 months.
  140. 140. What could it enable you to create? What could it enable you to create?
  141. 141. Thank you Rachel Evans rachel.evans@bbc.co.uk @rvedotrc Digital Platform Media Services

×