SlideShare a Scribd company logo
1 of 144
Download to read offline
BBC Digital Platform Media Services
Rachel Evans
rachel.evans@bbc.co.uk
Destruction, Decapods,
and Doughnuts
@rvedotrc
Continuous Delivery for Audio & Video Factory
☹
A few years ago, the system that handled video publication for iPlayer was unreliable. Programmes were often missing, or
published late.

What to do?

We committed to killing it.

We committed so much, we deliberately declined to renew the third-party contract which iPlayer relied upon. The system’s
fate was sealed: on 1st October 2013, it would stop working.
All we had to do was build a completely new replacement system, and we had a little over a year in which to do it.

The trouble was, it was a half-million line codebase, and we were in the habit of only releasing two or three times a year, and
every time we released, things broke.
This was the situation we had deliberately, knowingly placed ourselves in. We had just over a year not only to build a
complete replacement, but to re-learn how to develop, test, release, and support software.
“Publish all BBC AV media
produced for IP platforms”
My name’s Rachel Evans, I’m a Principal Software Engineer in Media Services, part of the BBC Digital Platform.

Our mission is to “Publish all BBC Audio/Video media produced for IP Platforms”. So if you’ve ever watched a BBC News
clip online, or listened to a BBC podcast, or if you watched the Olympics online in 2012, or watched iPlayer, either live or on-
demand, or listened to iPlayer Radio – if you’ve done any of those things, then you’ve used our products.
This, then, is my team’s story: the story of how we changed the way we make software, and how that enabled us to
successfully launch Video Factory and Audio Factory – the media processing systems that now power iPlayer.

And what it all has to do with a toy crab who lives in a silver trophy.
Media Services:
a history
Part I
I joined the BBC in 2007, and I’ve been with Media Services since 2009, but this story really starts in Summer 2010.
Summer 2010
This was basically the low point for us as a team – when we were at our least effective.
Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering,
you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no
way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
Absolutely no
automated tests
Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering,
you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no
way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant
that people got lazy, and didn’t always bother to run the tests.
And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still
hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
Not-really-unit tests: 90 minutes
Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant
that people got lazy, and didn’t always bother to run the tests.
And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still
hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
Not-really-unit tests: 90 minutes
Code coverage: killed after 4 days
Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant
that people got lazy, and didn’t always bother to run the tests.
And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still
hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
Patch, patch, patch, …
We didn’t always build and deploy things cleanly, because building and deploying were slow.
So, all too often, we’d apply patches, sometimes directly to the live system.
We knew this was a bad thing, a bad habit to get into. But we did it anyway.
It was our team’s dirty open secret.
“Patch Club”
So we called it “Patch Club”.
The first rule of
Patch Club is,
you don’t talk
about Patch Club.
Of course, the first rule of Patch Club is, you don’t talk about Patch Club. Everyone knows that!
Patch Club
Patch Club
Patch Club
Err, OK. Better not talk about “Patch Club” in public.
P***h C**b
So in team online chats, we started calling it “P***h C**b”. And the joke then became, “What could P…h C..b mean?”
One day our colleague Mike brought in two mysterious artefacts…
Pooch Comb
Nah. That doesn’t feel right.
P***h C**b
Plush Crab
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
Plush Crab“Tyler”
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
Let’s ship it!
Eventually, a few times a year, we decided that we had so many undeployed commits that it was time to release them.
What’s in release 10.6?
This is us, in July 2010, trying to work out what’s new in this release that we’re about to deploy. We don’t know for sure: we
make our best guess.
simplify bumper, bumper_in (howet03)
added drop table statements for repairing fuckup with db backup (howet03)
moved view to view file (howet03)
fix wmv test and make it stable when config changes via local yml file (howet03)
fix transcode bumpers test (howet03)
2057-Console-transcode-task-page-showing-status-as-o (murrac21)
bump (howet03)
Fix for AutoQCPassed test, from R10.4B (alexb)
2076-Console-version-page-Add-Asset-button (murrac21)
https://jira.dev.bbc.co.uk/browse/NEWWORLD-2061 https://jira.dev.bbc.co.uk/
browse/NEWWORLD-2076 Merging asset changes (dbennett)
Removed hard-wired TERMs, for those of us not on TERM=xterm (evansd17)
Added "db" ops script, and use this in the other scripts (evansd17)
MySQL query optimisation for monitor_schedule_item (evansd17)
Merge of test fixes and qc domain hack round potential for verified asset records
lacking mtimes. (alexb)
The change log.
removed (howet03)
Changes by Tom H: - support passing plugin name explicitly on the command line
- use wfe env vars when connecting to the db (weyt03)
WORKFLOWENGINE-83 Delta worklist filtering on profile (evansd17)
Give up after 2 days (mary)
Self polling set to five minutes to avoid thrashing under heavy asset registration
load (dbennett)
add index db deltas for asset_file (marcus)
swop constraints (howet03)
merge from R10.5A.22 (marcus)
merge of 10.5A changes into trunk (howet03)
add indecies to various ingest_metadata and ingest_task columns (marcus)
148-Seeding-Bug - rework seed creation in scheduler (howet03)
To recover from unexpected PIPs outages (mary)
Merge of Andy's 209-Fix_MAD_for_3G branch. (alexb)
New cut of R10.6 from trunk (including new seeding fixes, and 209 - Fix MAD for
3G) (alexb)
The change log.
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know
that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
swearing: 1
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know
that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
swearing: 1
no message: 4
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know
that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
swearing: 1
no message: 4
/fix.*test/:15
1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know
that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
202But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.
But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.
We have a problem.
We have no idea
what we are
deploying
We have a problem.
No standard
deployment procedure
Someone goes into a trance-like state and experiments with substances (coffee) and tar files for a day.
Different people deployed the product in different ways, giving inconsistent results.
cc by 2.0;
And whenever we deploy, something catches fire.
bit.ly/1btCODY
Or sometimes, everything catches fire.
Or at least that’s what it felt like.
Bad tests
Terrible code
Slow development
Huge, infrequent releases
Deployment is slow and unreliable
Followed by days repairing the damage
Summary.
So with all this failure as a team, what did we do?
[various suppliers of
doughnuts are
available]
We rewarded ourselves with doughnuts!
Releases took a long time to create, test, deploy, and extinguish, so we must have done a good job, right?
Doughnuts.
Delivery = doughnuts!
We learnt that delivery == doughnuts.
Autumn 2012
Summer 2012: We’re creating a new workflow in the cloud to put iPlayer content onto Sky set-top boxes, and the start of
Video Factory.

Better software, better architecture, better practices,

More Jenkins. More BDD, some automated testing.

But deployment is still hard, so we’re still not deploying often enough.
Continuous Delivery
We know what we need to do: smaller releases, more often. Continuous Delivery.
We’d just had the London 2012 Olympics. You could tell, because the branding was everywhere in our building. Just in case
we forgot, like.
The Olympics – branding was everywhere.
The Olympics – branding was everywhere.
But then suddenly, the Olympics branding was gone, and it was replaced by this: our Top 5 Priorities, writ large on the very
walls.
There it was, right in front of us: Priority: Continuous Delivery.
But we were scared of this. Every time we deploy, things catch fire. And you want us to deploy more often? Uh, huh.
Continuous Destruction
So semi-jokingly in the team, we called it Continuous Destruction.
We were scared of this.
Continuous Disaster
“Continuous Disaster” was another name we used.
Delivery = doughnuts
But then we started to rationalise it. We’ve already learned that delivery == doughnuts, so maybe…
Continuous Doughnuts
It’s Continuous Doughnuts.
YESOK. Let’s do this.
cc by-nc-sa/2.0;
Summer 2013
Summer 2013: Video Factory is ready. 25 microservices instead of the previous monolith. 

Deployment by now is easy, we’ve worked closely with another team in the BBC to help them develop the deployment
system.

I’ll talk about some of the other changes we made to help achieve this in a moment.
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
So, now we can deploy quite a lot - not just 2 or 3 times per year.
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
So, now we can deploy quite a lot - not just 2 or 3 times per year.
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
38
So, now we can deploy quite a lot - not just 2 or 3 times per year.
Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
140
38
26
So, now we can deploy quite a lot - not just 2 or 3 times per year.
0
10
20
30
40
50
60
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Deployments by day of week
(total for 10 weeks, divided by 10)
We peak at just over 50 deployments per day. (The majority is on the int environment, because there, the deploy
happens whenever we commit something).
Summer 2014
Summer 2014: we’re now up to 75 microservices – three times bigger. We’ve gone from barely being able to make any
changes without things catching fire, to threefold growth in a year.

Sustainable growth via better tooling.
Automating the build
wasn’t enough
“Just adding build automation wasn’t enough – deploying was still hard, which meant we didn’t deploy very often,
which meant that each deployment was big, and was therefore risky.”
“To reduce the latency and risk, we needed to be able to deploy more quickly, more often: Continuous Delivery. But
what did that mean for us as a team?”
Continuous Discovery
Part II
This is a process of Continuous Discovery: this is simply what we’ve done, and learnt, so far. But we’re still learning, still
adapting.

Going to focus on just a few areas.
It takes the whole team
The whole team is involved in delivery – that’s why the team is who it is.

For us, that means product owners, project managers, architects, software engineers, testers, support.

Everyone has their part to play in CD. Everyone needs to adapt, and benefits.

Earlier, continuous communication within the team. Quicker feedback from each small step, so we know what step to take
next.
Automated checking
Hopefully obvious. We had it already, before Continuous Delivery, but it’s even more critical afterwards.
Being the masters
of our own destiny
We needed to be in control of our own destiny

Hence can’t have someone else telling us when we can or can’t deploy

Can’t have someone else doing the deployments for us

Therefore we needed to perform the deployments ourselves

There was some inertia within the organisation that we had to overcome.
“Do you want
support with that?”
We choose our own level of support.

For us in Media Services, almost everything we do needs 24/7 support – downtime is never OK. But, for each service,
evaluate the required level of support for that service.
Media Services Support
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.
Media Services Support
1. The BBC’s central 24/7 operations team
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.
Media Services Support
1. The BBC’s central 24/7 operations team
2. Our own support team
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.
Media Services Support
1. The BBC’s central 24/7 operations team
2. Our own support team
3. The development team
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.
Challenges
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in.
Some do, some don’t. If not enough people opt in, support might not be viable.
For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good
to know that someone’s there if needs be.
Challenges
How many people opt in?
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in.
Some do, some don’t. If not enough people opt in, support might not be viable.
For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good
to know that someone’s there if needs be.
Challenges
Understanding the product
Our product estate is big. Audio simulcast is rather different from video on-demand, for example.

Mitigating factors:

- lots of common patterns, which help us make educated guesses

- we only allow ourselves simple operations out-of-hours
Challenges
What do I do if I get called?
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Code changes
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Code changes
Roll back
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Code changes
Roll back
Retry it
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Code changes
Roll back
Retry it
Kill it
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Code changes
Roll back
Retry it
Kill it
Scale up or down
What might we do, for out-of-hours support?

Specifically: NOT code changes.

Roll back (small roll forward, therefore small roll back). 

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can fix quite a lot.
Decisions, Decisions …
Part III
Including the whole team, with close communication; automated checking; being in control; adapting to support.
That’s what it meant for us.

But what might it mean for you?

Continuous Delivery at the BBC means flexibility. Pick and choose what’s right for you.
I have never used
the in-house
hosting platform
A caveat / confession.
The BBC does have an in-house hosting platform, but none of my products have ever used it. But I’m going to compare
the steps for getting something live before, vs after, Continuous Delivery.
The documentation describing the old (pre-Continuous-Delivery) process for doing software on the in-house hosting
platform.
Lots of mandatory steps.
(goes on for 2880 words)
The (unofficial) new Pipeline:
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
The (unofficial) new Pipeline:
1. Get an AWS account
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
The (unofficial) new Pipeline:
1. Get an AWS account
2. Get Infosec approval
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
The (unofficial) new Pipeline:
1. Get an AWS account
2. Get Infosec approval
3. Go live
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
Optional extras
However, although almost everything is now optional, you might want to do them anyway. It’s up to you.
You don’t have to have a decent architecture…
Optional extras
Decent architecture
However, although almost everything is now optional, you might want to do them anyway. It’s up to you.
You don’t have to have a decent architecture…
Following your Technical
Architect’s advice will make
your product more successful.
Ditto for:
decent engineering,
decent product management,
etc
( )
Optional extras
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
Continuous Integration
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
Continuous Integration
Repeatable builds
More things you don’t have to do. Your call.
Optional extras
Using the standard build tool
Continuous Integration
Repeatable builds
Builds
More things you don’t have to do. Your call.
An efficient, repeatable build chain
makes your product more reliable.
Optional extras
You don’t have to do these things…
Optional extras
Run Book
You don’t have to do these things…
Optional extras
Run Book
Demos
You don’t have to do these things…
Optional extras
Run Book
Demos
Telling anybody anything
You don’t have to do these things…
If you help the support team,
they can help you.
If you tell people what your product is, does, how it works, etc. then they can help you, for example when things go wrong.
Optional extras
You don’t have to do these!
Optional extras
Out-of-hours support
You don’t have to do these!
Optional extras
Out-of-hours support
In-hours support
You don’t have to do these!
Optional extras
Out-of-hours support
In-hours support
Giving a damn about anything
You don’t have to do these!
If you care about your product,
other people will care too.
But you might want to :-)
Optional extras
You don’t have to do these things either…
Optional extras
Monitoring
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
Component testing
You don’t have to do these things either…
Optional extras
Monitoring
Load testing
Integration testing
Component testing
Unit testing
You don’t have to do these things either…
Optional extras
Any concept of success whatsoever
If you care about something,
monitor / test it.
Monitoring ftw.
Test-Driven
Development
In fact, let’s take this one step further.

If we think that TDD is a good thing…
Monitoring
Load testing
Integration testing
Component testing
Unit testing
And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things
we often do TDD on…

Then why did we miss out Monitoring?

Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive
our development process in the same way?
Monitoring
Load testing
Integration testing
Component testing
Unit testing
}
And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things
we often do TDD on…

Then why did we miss out Monitoring?

Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive
our development process in the same way?
Monitoring
Load testing
Integration testing
Component testing
Unit testing
}
And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things
we often do TDD on…

Then why did we miss out Monitoring?

Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive
our development process in the same way?
Monitoring-Driven
Development
Monitoring-Driven Development: define how you’re going to monitor this behaviour that you want, when it’s live. How do you
know if it’s working?

Create the alarm first, before you create the behaviour. The alarm goes red, unhappy. Good: now you know that you need to
do some work to create that behaviour.

Do that, release it.

Then the alarm clears, is happy. So straight away, you know that this thing is monitored, from day 1, and that the monitoring
works.
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.
Responsibility
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.
Responsibility
Power
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.
Responsibility
Power
Better product
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.
Responsibility
Power
Better product
Take the credit
As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.
Tyler now represents not our failure, but our innovation.

He has a new, innovative use, keeping hold of our video adapter.

He lives in that silver cup, the BBC Digital Platform Innovation Award.
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
You can see it engraved there with our team name.

Yes, they got the year wrong.
Continuous Delivery
sounded scary
In 2010, we were making large, infrequent releases, and after every release we’d spend days or weeks putting fires out.

Over the next 2 or 3 years we adopted Continuous Delivery. It sounded scary at first; we were afraid we'd just break things
more often. But the feared “Continuous Destruction” never happened. In fact, it turned out to be absolutely critical to Video
Factory's success.
Change the team.
We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being
involved.

As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what
level of support our product needs.
Change the team.
Be in control of your product.
We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being
involved.

As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what
level of support our product needs.
Smaller, safer changes.
Change the team.
Be in control of your product.
Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer;

and we have much quicker feedback of results from each stage.
Smaller, safer changes.
Rapid feedback.
Change the team.
Be in control of your product.
Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer;

and we have much quicker feedback of results from each stage.
Smaller, safer changes.
Rapid feedback.
Change the team.
Be in control of your product.
Smaller, safer changes.
Having a more stable product of course gives a better experience for the audience; but additionally, adopting Continuous
Delivery has helped to make working in the team be more enjoyable.

And with the faster feedback, we’re able to deliver features, and fixes, to deliver value, more quickly, and more reliably. We
change things more often, and more safely. But also, we can experiment; we can try stuff out; we can innovate.
Continuous Delivery enabled us
to create Video Factory.
Continuous Delivery enabled us to create Video Factory – new architecture, new code, new platform – in just 12 months.
What could it enable
you to create?
What could it enable you to create?
Thank you
Rachel Evans
rachel.evans@bbc.co.uk
@rvedotrc
Digital
Platform Media Services

More Related Content

Similar to Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

Media evaluation q6 (tech)
Media evaluation q6 (tech)Media evaluation q6 (tech)
Media evaluation q6 (tech)Ann Paget
 
No, we can't do continuous delivery
No, we can't do continuous deliveryNo, we can't do continuous delivery
No, we can't do continuous deliveryKris Buytaert
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up LookoutLookout
 
Design and Evolution of cyber-dojo
Design and Evolution of cyber-dojoDesign and Evolution of cyber-dojo
Design and Evolution of cyber-dojoJon Jagger
 
A Tale of Two Workflows - ChefConf 2014
A Tale of Two Workflows - ChefConf 2014A Tale of Two Workflows - ChefConf 2014
A Tale of Two Workflows - ChefConf 2014Pete Cheslock
 
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...Disrupting Documentation: Using Content Strategy to Change Corporate Communic...
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...David Ryan
 
2014 pablo ruiz tuenti webrtc
2014 pablo ruiz tuenti webrtc2014 pablo ruiz tuenti webrtc
2014 pablo ruiz tuenti webrtcVOIP2DAY
 
Test Driven Development on Android (Kotlin Kenya)
Test Driven Development on Android (Kotlin Kenya)Test Driven Development on Android (Kotlin Kenya)
Test Driven Development on Android (Kotlin Kenya)Danny Preussler
 
Post esst slides v18
Post esst slides v18Post esst slides v18
Post esst slides v18Scott Carrey
 
2012 - A Release Odyssey
2012 - A Release Odyssey2012 - A Release Odyssey
2012 - A Release OdysseyErnest Mueller
 
Luke media coursework
Luke media coursework Luke media coursework
Luke media coursework Lukemedia
 
Delivery Pipelines with Docker (GDC 2016, Riot Games)
Delivery Pipelines with Docker (GDC 2016, Riot Games)Delivery Pipelines with Docker (GDC 2016, Riot Games)
Delivery Pipelines with Docker (GDC 2016, Riot Games)Josiah Kiehl
 
Productive data engineer speaker notes
Productive data engineer speaker notesProductive data engineer speaker notes
Productive data engineer speaker notesRafał Wojdyła
 
Github github-github
Github github-githubGithub github-github
Github github-githubfusion2011
 
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot
 
Building a worship streaming system
Building a worship streaming systemBuilding a worship streaming system
Building a worship streaming systemPaul Richards
 
Evaluation - Question 6
Evaluation - Question 6Evaluation - Question 6
Evaluation - Question 6penfolddolly1
 
Question 4 new
Question 4 newQuestion 4 new
Question 4 newalhasan17
 

Similar to Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory (20)

Media evaluation q6 (tech)
Media evaluation q6 (tech)Media evaluation q6 (tech)
Media evaluation q6 (tech)
 
Source Control 101
Source Control 101Source Control 101
Source Control 101
 
No, we can't do continuous delivery
No, we can't do continuous deliveryNo, we can't do continuous delivery
No, we can't do continuous delivery
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up Lookout
 
Design and Evolution of cyber-dojo
Design and Evolution of cyber-dojoDesign and Evolution of cyber-dojo
Design and Evolution of cyber-dojo
 
A Tale of Two Workflows - ChefConf 2014
A Tale of Two Workflows - ChefConf 2014A Tale of Two Workflows - ChefConf 2014
A Tale of Two Workflows - ChefConf 2014
 
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...Disrupting Documentation: Using Content Strategy to Change Corporate Communic...
Disrupting Documentation: Using Content Strategy to Change Corporate Communic...
 
2014 pablo ruiz tuenti webrtc
2014 pablo ruiz tuenti webrtc2014 pablo ruiz tuenti webrtc
2014 pablo ruiz tuenti webrtc
 
Test Driven Development on Android (Kotlin Kenya)
Test Driven Development on Android (Kotlin Kenya)Test Driven Development on Android (Kotlin Kenya)
Test Driven Development on Android (Kotlin Kenya)
 
Post esst slides v18
Post esst slides v18Post esst slides v18
Post esst slides v18
 
2012 - A Release Odyssey
2012 - A Release Odyssey2012 - A Release Odyssey
2012 - A Release Odyssey
 
Q6
Q6Q6
Q6
 
Luke media coursework
Luke media coursework Luke media coursework
Luke media coursework
 
Delivery Pipelines with Docker (GDC 2016, Riot Games)
Delivery Pipelines with Docker (GDC 2016, Riot Games)Delivery Pipelines with Docker (GDC 2016, Riot Games)
Delivery Pipelines with Docker (GDC 2016, Riot Games)
 
Productive data engineer speaker notes
Productive data engineer speaker notesProductive data engineer speaker notes
Productive data engineer speaker notes
 
Github github-github
Github github-githubGithub github-github
Github github-github
 
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
 
Building a worship streaming system
Building a worship streaming systemBuilding a worship streaming system
Building a worship streaming system
 
Evaluation - Question 6
Evaluation - Question 6Evaluation - Question 6
Evaluation - Question 6
 
Question 4 new
Question 4 newQuestion 4 new
Question 4 new
 

Recently uploaded

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 

Recently uploaded (20)

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 

Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

  • 1. BBC Digital Platform Media Services Rachel Evans rachel.evans@bbc.co.uk Destruction, Decapods, and Doughnuts @rvedotrc Continuous Delivery for Audio & Video Factory
  • 2. ☹ A few years ago, the system that handled video publication for iPlayer was unreliable. Programmes were often missing, or published late. What to do? We committed to killing it. We committed so much, we deliberately declined to renew the third-party contract which iPlayer relied upon. The system’s fate was sealed: on 1st October 2013, it would stop working.
  • 3. All we had to do was build a completely new replacement system, and we had a little over a year in which to do it. The trouble was, it was a half-million line codebase, and we were in the habit of only releasing two or three times a year, and every time we released, things broke.
  • 4. This was the situation we had deliberately, knowingly placed ourselves in. We had just over a year not only to build a complete replacement, but to re-learn how to develop, test, release, and support software.
  • 5. “Publish all BBC AV media produced for IP platforms” My name’s Rachel Evans, I’m a Principal Software Engineer in Media Services, part of the BBC Digital Platform. Our mission is to “Publish all BBC Audio/Video media produced for IP Platforms”. So if you’ve ever watched a BBC News clip online, or listened to a BBC podcast, or if you watched the Olympics online in 2012, or watched iPlayer, either live or on- demand, or listened to iPlayer Radio – if you’ve done any of those things, then you’ve used our products.
  • 6. This, then, is my team’s story: the story of how we changed the way we make software, and how that enabled us to successfully launch Video Factory and Audio Factory – the media processing systems that now power iPlayer. And what it all has to do with a toy crab who lives in a silver trophy.
  • 7. Media Services: a history Part I I joined the BBC in 2007, and I’ve been with Media Services since 2009, but this story really starts in Summer 2010.
  • 8. Summer 2010 This was basically the low point for us as a team – when we were at our least effective.
  • 9. Audio On Demand This session is about Testing… Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering, you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
  • 10. Absolutely no automated tests Audio On Demand This session is about Testing… Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering, you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.
  • 11. Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  • 12. Not-really-unit tests: 90 minutes Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  • 13. Not-really-unit tests: 90 minutes Code coverage: killed after 4 days Video On Demand Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant that people got lazy, and didn’t always bother to run the tests. And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.
  • 14. Patch, patch, patch, … We didn’t always build and deploy things cleanly, because building and deploying were slow. So, all too often, we’d apply patches, sometimes directly to the live system. We knew this was a bad thing, a bad habit to get into. But we did it anyway. It was our team’s dirty open secret.
  • 15. “Patch Club” So we called it “Patch Club”.
  • 16. The first rule of Patch Club is, you don’t talk about Patch Club. Of course, the first rule of Patch Club is, you don’t talk about Patch Club. Everyone knows that!
  • 19. Patch Club Err, OK. Better not talk about “Patch Club” in public.
  • 20. P***h C**b So in team online chats, we started calling it “P***h C**b”. And the joke then became, “What could P…h C..b mean?” One day our colleague Mike brought in two mysterious artefacts…
  • 21. Pooch Comb Nah. That doesn’t feel right.
  • 23. Plush Crab Our team’s resident decapod, and his name is Tyler. Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
  • 24. Plush Crab“Tyler” Our team’s resident decapod, and his name is Tyler. Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.
  • 25. Let’s ship it! Eventually, a few times a year, we decided that we had so many undeployed commits that it was time to release them.
  • 26. What’s in release 10.6? This is us, in July 2010, trying to work out what’s new in this release that we’re about to deploy. We don’t know for sure: we make our best guess.
  • 27. simplify bumper, bumper_in (howet03) added drop table statements for repairing fuckup with db backup (howet03) moved view to view file (howet03) fix wmv test and make it stable when config changes via local yml file (howet03) fix transcode bumpers test (howet03) 2057-Console-transcode-task-page-showing-status-as-o (murrac21) bump (howet03) Fix for AutoQCPassed test, from R10.4B (alexb) 2076-Console-version-page-Add-Asset-button (murrac21) https://jira.dev.bbc.co.uk/browse/NEWWORLD-2061 https://jira.dev.bbc.co.uk/ browse/NEWWORLD-2076 Merging asset changes (dbennett) Removed hard-wired TERMs, for those of us not on TERM=xterm (evansd17) Added "db" ops script, and use this in the other scripts (evansd17) MySQL query optimisation for monitor_schedule_item (evansd17) Merge of test fixes and qc domain hack round potential for verified asset records lacking mtimes. (alexb) The change log.
  • 28. removed (howet03) Changes by Tom H: - support passing plugin name explicitly on the command line - use wfe env vars when connecting to the db (weyt03) WORKFLOWENGINE-83 Delta worklist filtering on profile (evansd17) Give up after 2 days (mary) Self polling set to five minutes to avoid thrashing under heavy asset registration load (dbennett) add index db deltas for asset_file (marcus) swop constraints (howet03) merge from R10.5A.22 (marcus) merge of 10.5A changes into trunk (howet03) add indecies to various ingest_metadata and ingest_task columns (marcus) 148-Seeding-Bug - rework seed creation in scheduler (howet03) To recover from unexpected PIPs outages (mary) Merge of Andy's 209-Fix_MAD_for_3G branch. (alexb) New cut of R10.6 from trunk (including new seeding fixes, and 209 - Fix MAD for 3G) (alexb) The change log.
  • 29. 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  • 30. swearing: 1 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  • 31. swearing: 1 no message: 4 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  • 32. swearing: 1 no message: 4 /fix.*test/:15 1 commit with swearing, 4 commits with no message, 15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know that the tests took 90 minutes to run, so it’s not surprising that people got lazy).
  • 33. 202But in total, 202 commits! That’s a lot. And we’re not even sure that that’s right.
  • 34. But in total, 202 commits! That’s a lot. And we’re not even sure that that’s right.
  • 35. We have a problem.
  • 36. We have no idea what we are deploying We have a problem.
  • 37. No standard deployment procedure Someone goes into a trance-like state and experiments with substances (coffee) and tar files for a day. Different people deployed the product in different ways, giving inconsistent results.
  • 38. cc by 2.0; And whenever we deploy, something catches fire.
  • 39. bit.ly/1btCODY Or sometimes, everything catches fire. Or at least that’s what it felt like.
  • 40. Bad tests Terrible code Slow development Huge, infrequent releases Deployment is slow and unreliable Followed by days repairing the damage Summary. So with all this failure as a team, what did we do?
  • 41. [various suppliers of doughnuts are available] We rewarded ourselves with doughnuts! Releases took a long time to create, test, deploy, and extinguish, so we must have done a good job, right? Doughnuts.
  • 42. Delivery = doughnuts! We learnt that delivery == doughnuts.
  • 43. Autumn 2012 Summer 2012: We’re creating a new workflow in the cloud to put iPlayer content onto Sky set-top boxes, and the start of Video Factory. Better software, better architecture, better practices, More Jenkins. More BDD, some automated testing. But deployment is still hard, so we’re still not deploying often enough.
  • 44. Continuous Delivery We know what we need to do: smaller releases, more often. Continuous Delivery.
  • 45. We’d just had the London 2012 Olympics. You could tell, because the branding was everywhere in our building. Just in case we forgot, like.
  • 46.
  • 47.
  • 48.
  • 49. The Olympics – branding was everywhere.
  • 50. The Olympics – branding was everywhere.
  • 51. But then suddenly, the Olympics branding was gone, and it was replaced by this: our Top 5 Priorities, writ large on the very walls. There it was, right in front of us: Priority: Continuous Delivery. But we were scared of this. Every time we deploy, things catch fire. And you want us to deploy more often? Uh, huh.
  • 52. Continuous Destruction So semi-jokingly in the team, we called it Continuous Destruction. We were scared of this.
  • 54. Delivery = doughnuts But then we started to rationalise it. We’ve already learned that delivery == doughnuts, so maybe…
  • 58. Summer 2013 Summer 2013: Video Factory is ready. 25 microservices instead of the previous monolith. Deployment by now is easy, we’ve worked closely with another team in the BBC to help them develop the deployment system. I’ll talk about some of the other changes we made to help achieve this in a moment.
  • 59. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: So, now we can deploy quite a lot - not just 2 or 3 times per year.
  • 60. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  • 61. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 38 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  • 62. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live: 140 38 26 So, now we can deploy quite a lot - not just 2 or 3 times per year.
  • 63. 0 10 20 30 40 50 60 Monday Tuesday Wednesday Thursday Friday Saturday Sunday Deployments by day of week (total for 10 weeks, divided by 10) We peak at just over 50 deployments per day. (The majority is on the int environment, because there, the deploy happens whenever we commit something).
  • 64. Summer 2014 Summer 2014: we’re now up to 75 microservices – three times bigger. We’ve gone from barely being able to make any changes without things catching fire, to threefold growth in a year. Sustainable growth via better tooling.
  • 65. Automating the build wasn’t enough “Just adding build automation wasn’t enough – deploying was still hard, which meant we didn’t deploy very often, which meant that each deployment was big, and was therefore risky.” “To reduce the latency and risk, we needed to be able to deploy more quickly, more often: Continuous Delivery. But what did that mean for us as a team?”
  • 66. Continuous Discovery Part II This is a process of Continuous Discovery: this is simply what we’ve done, and learnt, so far. But we’re still learning, still adapting. Going to focus on just a few areas.
  • 67. It takes the whole team The whole team is involved in delivery – that’s why the team is who it is. For us, that means product owners, project managers, architects, software engineers, testers, support. Everyone has their part to play in CD. Everyone needs to adapt, and benefits. Earlier, continuous communication within the team. Quicker feedback from each small step, so we know what step to take next.
  • 68. Automated checking Hopefully obvious. We had it already, before Continuous Delivery, but it’s even more critical afterwards.
  • 69. Being the masters of our own destiny We needed to be in control of our own destiny Hence can’t have someone else telling us when we can or can’t deploy Can’t have someone else doing the deployments for us Therefore we needed to perform the deployments ourselves There was some inertia within the organisation that we had to overcome.
  • 70. “Do you want support with that?” We choose our own level of support. For us in Media Services, almost everything we do needs 24/7 support – downtime is never OK. But, for each service, evaluate the required level of support for that service.
  • 71. Media Services Support In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  • 72. Media Services Support 1. The BBC’s central 24/7 operations team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  • 73. Media Services Support 1. The BBC’s central 24/7 operations team 2. Our own support team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  • 74. Media Services Support 1. The BBC’s central 24/7 operations team 2. Our own support team 3. The development team In-hours: team, to team, to team. Out of hours: team, to individual, to individual.
  • 75. Challenges Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in. Some do, some don’t. If not enough people opt in, support might not be viable. For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good to know that someone’s there if needs be.
  • 76. Challenges How many people opt in? Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in. Some do, some don’t. If not enough people opt in, support might not be viable. For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good to know that someone’s there if needs be.
  • 77. Challenges Understanding the product Our product estate is big. Audio simulcast is rather different from video on-demand, for example. Mitigating factors: - lots of common patterns, which help us make educated guesses - we only allow ourselves simple operations out-of-hours
  • 78. Challenges What do I do if I get called?
  • 79. What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 80. Code changes What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 81. Code changes Roll back What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 82. Code changes Roll back Retry it What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 83. Code changes Roll back Retry it Kill it What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 84. Code changes Roll back Retry it Kill it Scale up or down What might we do, for out-of-hours support? Specifically: NOT code changes. Roll back (small roll forward, therefore small roll back). Retry (e.g. a failed message). Kill it (killing things is normal, meh. Chaos Monkey). Failover. Scale it. With those few options, turns out you can fix quite a lot.
  • 85. Decisions, Decisions … Part III Including the whole team, with close communication; automated checking; being in control; adapting to support. That’s what it meant for us. But what might it mean for you? Continuous Delivery at the BBC means flexibility. Pick and choose what’s right for you.
  • 86. I have never used the in-house hosting platform A caveat / confession. The BBC does have an in-house hosting platform, but none of my products have ever used it. But I’m going to compare the steps for getting something live before, vs after, Continuous Delivery.
  • 87. The documentation describing the old (pre-Continuous-Delivery) process for doing software on the in-house hosting platform. Lots of mandatory steps.
  • 88. (goes on for 2880 words)
  • 89. The (unofficial) new Pipeline: My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  • 90. The (unofficial) new Pipeline: 1. Get an AWS account My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  • 91. The (unofficial) new Pipeline: 1. Get an AWS account 2. Get Infosec approval My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  • 92. The (unofficial) new Pipeline: 1. Get an AWS account 2. Get Infosec approval 3. Go live My unofficial take on the new, simplified mandatory steps. Actually, even step 1 is optional. But you’ve got to have somewhere to host it.
  • 93. Optional extras However, although almost everything is now optional, you might want to do them anyway. It’s up to you. You don’t have to have a decent architecture…
  • 94. Optional extras Decent architecture However, although almost everything is now optional, you might want to do them anyway. It’s up to you. You don’t have to have a decent architecture…
  • 95. Following your Technical Architect’s advice will make your product more successful.
  • 96. Ditto for: decent engineering, decent product management, etc ( )
  • 97. Optional extras More things you don’t have to do. Your call.
  • 98. Optional extras Using the standard build tool More things you don’t have to do. Your call.
  • 99. Optional extras Using the standard build tool Continuous Integration More things you don’t have to do. Your call.
  • 100. Optional extras Using the standard build tool Continuous Integration Repeatable builds More things you don’t have to do. Your call.
  • 101. Optional extras Using the standard build tool Continuous Integration Repeatable builds Builds More things you don’t have to do. Your call.
  • 102. An efficient, repeatable build chain makes your product more reliable.
  • 103. Optional extras You don’t have to do these things…
  • 104. Optional extras Run Book You don’t have to do these things…
  • 105. Optional extras Run Book Demos You don’t have to do these things…
  • 106. Optional extras Run Book Demos Telling anybody anything You don’t have to do these things…
  • 107. If you help the support team, they can help you. If you tell people what your product is, does, how it works, etc. then they can help you, for example when things go wrong.
  • 108. Optional extras You don’t have to do these!
  • 109. Optional extras Out-of-hours support You don’t have to do these!
  • 110. Optional extras Out-of-hours support In-hours support You don’t have to do these!
  • 111. Optional extras Out-of-hours support In-hours support Giving a damn about anything You don’t have to do these!
  • 112. If you care about your product, other people will care too. But you might want to :-)
  • 113. Optional extras You don’t have to do these things either…
  • 114. Optional extras Monitoring You don’t have to do these things either…
  • 115. Optional extras Monitoring Load testing You don’t have to do these things either…
  • 116. Optional extras Monitoring Load testing Integration testing You don’t have to do these things either…
  • 117. Optional extras Monitoring Load testing Integration testing Component testing You don’t have to do these things either…
  • 118. Optional extras Monitoring Load testing Integration testing Component testing Unit testing You don’t have to do these things either…
  • 119. Optional extras Any concept of success whatsoever
  • 120. If you care about something, monitor / test it. Monitoring ftw.
  • 121. Test-Driven Development In fact, let’s take this one step further. If we think that TDD is a good thing…
  • 122. Monitoring Load testing Integration testing Component testing Unit testing And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  • 123. Monitoring Load testing Integration testing Component testing Unit testing } And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  • 124. Monitoring Load testing Integration testing Component testing Unit testing } And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things we often do TDD on… Then why did we miss out Monitoring? Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive our development process in the same way?
  • 125. Monitoring-Driven Development Monitoring-Driven Development: define how you’re going to monitor this behaviour that you want, when it’s live. How do you know if it’s working? Create the alarm first, before you create the behaviour. The alarm goes red, unhappy. Good: now you know that you need to do some work to create that behaviour. Do that, release it. Then the alarm clears, is happy. So straight away, you know that this thing is monitored, from day 1, and that the monitoring works.
  • 126. As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  • 127. Responsibility As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  • 128. Responsibility Power As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  • 129. Responsibility Power Better product As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  • 130. Responsibility Power Better product Take the credit As a team you have extra responsibility to ensure things happen. But also the extra power to make sure they do. And it leads to a better product. And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You made the decisions: you enacted them.
  • 131. Tyler now represents not our failure, but our innovation. He has a new, innovative use, keeping hold of our video adapter. He lives in that silver cup, the BBC Digital Platform Innovation Award.
  • 132. You can see it engraved there with our team name. Yes, they got the year wrong.
  • 133. You can see it engraved there with our team name. Yes, they got the year wrong.
  • 134. You can see it engraved there with our team name. Yes, they got the year wrong.
  • 135. You can see it engraved there with our team name. Yes, they got the year wrong.
  • 136. Continuous Delivery sounded scary In 2010, we were making large, infrequent releases, and after every release we’d spend days or weeks putting fires out. Over the next 2 or 3 years we adopted Continuous Delivery. It sounded scary at first; we were afraid we'd just break things more often. But the feared “Continuous Destruction” never happened. In fact, it turned out to be absolutely critical to Video Factory's success.
  • 137. Change the team. We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being involved. As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what level of support our product needs.
  • 138. Change the team. Be in control of your product. We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being involved. As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what level of support our product needs.
  • 139. Smaller, safer changes. Change the team. Be in control of your product. Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer; and we have much quicker feedback of results from each stage.
  • 140. Smaller, safer changes. Rapid feedback. Change the team. Be in control of your product. Deployment is now literally an every day occurrence, we now have a steady flow of changes, each of which is smaller, safer; and we have much quicker feedback of results from each stage.
  • 141. Smaller, safer changes. Rapid feedback. Change the team. Be in control of your product. Smaller, safer changes. Having a more stable product of course gives a better experience for the audience; but additionally, adopting Continuous Delivery has helped to make working in the team be more enjoyable. And with the faster feedback, we’re able to deliver features, and fixes, to deliver value, more quickly, and more reliably. We change things more often, and more safely. But also, we can experiment; we can try stuff out; we can innovate.
  • 142. Continuous Delivery enabled us to create Video Factory. Continuous Delivery enabled us to create Video Factory – new architecture, new code, new platform – in just 12 months.
  • 143. What could it enable you to create? What could it enable you to create?