Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

BBC Digital Platform Media Services
Rachel Evans
rachel.evans@bbc.co.uk
Destruction, Decapods,
and Doughnuts
@rvedotrc
Continuous Delivery for Audio & Video Factory

☹
A few years ago, the system that handled video publication for iPlayer was unreliable. Programmes were often missing, or
published late.

What to do?

We committed to killing it.

We committed so much, we deliberately declined to renew the third-party contract which iPlayer relied upon. The system’s
fate was sealed: on 1st October 2013, it would stop working.

All we had to do was build a completely new replacement system, and we had a little over a year in which to do it.

The trouble was, it was a half-million line codebase, and we were in the habit of only releasing two or three times a year, and
every time we released, things broke.

This was the situation we had deliberately, knowingly placed ourselves in. We had just over a year not only to build a
complete replacement, but to re-learn how to develop, test, release, and support software.

“Publish all BBC AV media
produced for IP platforms”
My name’s Rachel Evans, I’m a Principal Software Engineer in Media Services, part of the BBC Digital Platform.

Our mission is to “Publish all BBC Audio/Video media produced for IP Platforms”. So if you’ve ever watched a BBC News
clip online, or listened to a BBC podcast, or if you watched the Olympics online in 2012, or watched iPlayer, either live or on-
demand, or listened to iPlayer Radio – if you’ve done any of those things, then you’ve used our products.

This, then, is my team’s story: the story of how we changed the way we make software, and how that enabled us to
successfully launch Video Factory and Audio Factory – the media processing systems that now power iPlayer.

And what it all has to do with a toy crab who lives in a silver trophy.

Media Services:
a history
Part I
I joined the BBC in 2007, and I’ve been with Media Services since 2009, but this story really starts in Summer 2010.

Summer 2010
This was basically the low point for us as a team – when we were at our least eﬀective.

Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering,
you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no
way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.

Absolutely no
automated tests
Audio On Demand
This session is about Testing…
Our Audio-On-Demand codebase had zero unit tests. Not one. Which if you know anything about software engineering,
you’ll know is a bad thing. We hadn’t defined what this product was meant to do. Every time we make a change, we have no
way of knowing whether or not it worked, because we hadn’t defined what “worked” meant.

Video On Demand
Our video-on-demand product did have unit tests, but they took 90 minutes to run, which is a very long time. Which meant
that people got lazy, and didn’t always bother to run the tests.
And the code coverage – that is, how much of the product are we actually testing – well, we let it run for 4 days, and it still
hadn’t finished, so we killed it. So we had no idea how much of the product we were actually testing.

Not-really-unit tests: 90 minutes
Video On Demand

Not-really-unit tests: 90 minutes
Code coverage: killed after 4 days
Video On Demand

Patch, patch, patch, …
We didn’t always build and deploy things cleanly, because building and deploying were slow.
So, all too often, we’d apply patches, sometimes directly to the live system.
We knew this was a bad thing, a bad habit to get into. But we did it anyway.
It was our team’s dirty open secret.

“Patch Club”
So we called it “Patch Club”.

The first rule of
Patch Club is,
you don’t talk
about Patch Club.
Of course, the ﬁrst rule of Patch Club is, you don’t talk about Patch Club. Everyone knows that!

Patch Club
Err, OK. Better not talk about “Patch Club” in public.

P***h C**b
So in team online chats, we started calling it “P***h C**b”. And the joke then became, “What could P…h C..b mean?”
One day our colleague Mike brought in two mysterious artefacts…

Pooch Comb
Nah. That doesn’t feel right.

Plush Crab
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.

Plush Crab“Tyler”
Our team’s resident decapod, and his name is Tyler.

Although he’s lovely, Tyler is a symbol of our failure to properly develop and deploy working software.

Let’s ship it!
Eventually, a few times a year, we decided that we had so many undeployed commits that it was time to release them.

What’s in release 10.6?
This is us, in July 2010, trying to work out what’s new in this release that we’re about to deploy. We don’t know for sure: we
make our best guess.

simplify bumper, bumper_in (howet03)
added drop table statements for repairing fuckup with db backup (howet03)
moved view to view file (howet03)
fix wmv test and make it stable when config changes via local yml file (howet03)
fix transcode bumpers test (howet03)
2057-Console-transcode-task-page-showing-status-as-o (murrac21)
bump (howet03)
Fix for AutoQCPassed test, from R10.4B (alexb)
2076-Console-version-page-Add-Asset-button (murrac21)
https://jira.dev.bbc.co.uk/browse/NEWWORLD-2061 https://jira.dev.bbc.co.uk/
browse/NEWWORLD-2076 Merging asset changes (dbennett)
Removed hard-wired TERMs, for those of us not on TERM=xterm (evansd17)
Added "db" ops script, and use this in the other scripts (evansd17)
MySQL query optimisation for monitor_schedule_item (evansd17)
Merge of test fixes and qc domain hack round potential for verified asset records
lacking mtimes. (alexb)
The change log.

removed (howet03)
Changes by Tom H: - support passing plugin name explicitly on the command line
- use wfe env vars when connecting to the db (weyt03)
WORKFLOWENGINE-83 Delta worklist filtering on profile (evansd17)
Give up after 2 days (mary)
Self polling set to five minutes to avoid thrashing under heavy asset registration
load (dbennett)
add index db deltas for asset_file (marcus)
swop constraints (howet03)
merge from R10.5A.22 (marcus)
merge of 10.5A changes into trunk (howet03)
add indecies to various ingest_metadata and ingest_task columns (marcus)
148-Seeding-Bug - rework seed creation in scheduler (howet03)
To recover from unexpected PIPs outages (mary)
Merge of Andy's 209-Fix_MAD_for_3G branch. (alexb)
New cut of R10.6 from trunk (including new seeding fixes, and 209 - Fix MAD for
3G) (alexb)
The change log.

1 commit with swearing,
4 commits with no message,
15 commits which talk about “fixing tests” (well, you should have run the tests before you committed huh? But we know
that the tests took 90 minutes to run, so it’s not surprising that people got lazy).

swearing: 1

swearing: 1
no message: 4

swearing: 1
no message: 4
/ﬁx.*test/:15

202But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.

But in total, 202 commits! That’s a lot.
And we’re not even sure that that’s right.

We have no idea
what we are
deploying
We have a problem.

No standard
deployment procedure
Someone goes into a trance-like state and experiments with substances (coffee) and tar files for a day.
Different people deployed the product in different ways, giving inconsistent results.

cc by 2.0;
And whenever we deploy, something catches ﬁre.

bit.ly/1btCODY
Or sometimes, everything catches ﬁre.
Or at least that’s what it felt like.

Bad tests
Terrible code
Slow development
Huge, infrequent releases
Deployment is slow and unreliable
Followed by days repairing the damage
Summary.
So with all this failure as a team, what did we do?

[various suppliers of
doughnuts are
available]
We rewarded ourselves with doughnuts!
Releases took a long time to create, test, deploy, and extinguish, so we must have done a good job, right?
Doughnuts.

Delivery = doughnuts!
We learnt that delivery == doughnuts.

Autumn 2012
Summer 2012: We’re creating a new workﬂow in the cloud to put iPlayer content onto Sky set-top boxes, and the start of
Video Factory.

Better software, better architecture, better practices,

More Jenkins. More BDD, some automated testing.

But deployment is still hard, so we’re still not deploying often enough.

Continuous Delivery
We know what we need to do: smaller releases, more often. Continuous Delivery.

We’d just had the London 2012 Olympics. You could tell, because the branding was everywhere in our building. Just in case
we forgot, like.

The Olympics – branding was everywhere.

But then suddenly, the Olympics branding was gone, and it was replaced by this: our Top 5 Priorities, writ large on the very
walls.
There it was, right in front of us: Priority: Continuous Delivery.
But we were scared of this. Every time we deploy, things catch ﬁre. And you want us to deploy more often? Uh, huh.

Continuous Destruction
So semi-jokingly in the team, we called it Continuous Destruction.
We were scared of this.

Continuous Disaster
“Continuous Disaster” was another name we used.

Delivery = doughnuts
But then we started to rationalise it. We’ve already learned that delivery == doughnuts, so maybe…

Continuous Doughnuts
It’s Continuous Doughnuts.

Summer 2013
Summer 2013: Video Factory is ready. 25 microservices instead of the previous monolith.

Deployment by now is easy, we’ve worked closely with another team in the BBC to help them develop the deployment
system.

I’ll talk about some of the other changes we made to help achieve this in a moment.

Deployment
weekly averages
(total for 10 weeks, divided by 10)
int:
test:
live:
So, now we can deploy quite a lot - not just 2 or 3 times per year.

Deployment
weekly averages
int:
test:
live:
140

Deployment
weekly averages
int:
test:
live:
140
38

Deployment
weekly averages
int:
test:
live:
140
38
26

0
10
20
30
40
50
60
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Deployments by day of week
We peak at just over 50 deployments per day. (The majority is on the int environment, because there, the deploy
happens whenever we commit something).

Summer 2014
Summer 2014: we’re now up to 75 microservices – three times bigger. We’ve gone from barely being able to make any
changes without things catching ﬁre, to threefold growth in a year.

Sustainable growth via better tooling.

Automating the build
wasn’t enough
“Just adding build automation wasn’t enough – deploying was still hard, which meant we didn’t deploy very often,
which meant that each deployment was big, and was therefore risky.”
“To reduce the latency and risk, we needed to be able to deploy more quickly, more often: Continuous Delivery. But
what did that mean for us as a team?”

Continuous Discovery
Part II
This is a process of Continuous Discovery: this is simply what we’ve done, and learnt, so far. But we’re still learning, still
adapting.

Going to focus on just a few areas.

It takes the whole team
The whole team is involved in delivery – that’s why the team is who it is.

For us, that means product owners, project managers, architects, software engineers, testers, support.

Everyone has their part to play in CD. Everyone needs to adapt, and beneﬁts.

Earlier, continuous communication within the team. Quicker feedback from each small step, so we know what step to take
next.

Automated checking
Hopefully obvious. We had it already, before Continuous Delivery, but it’s even more critical afterwards.

Being the masters
of our own destiny
We needed to be in control of our own destiny

Hence can’t have someone else telling us when we can or can’t deploy

Can’t have someone else doing the deployments for us

Therefore we needed to perform the deployments ourselves

There was some inertia within the organisation that we had to overcome.

“Do you want
support with that?”
We choose our own level of support.

For us in Media Services, almost everything we do needs 24/7 support – downtime is never OK. But, for each service,
evaluate the required level of support for that service.

Media Services Support
In-hours: team, to team, to team.

Out of hours: team, to individual, to individual.

1. The BBC’s central 24/7 operations team


2. Our own support team


2. Our own support team
3. The development team


Challenges
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in.
Some do, some don’t. If not enough people opt in, support might not be viable.
For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good
to know that someone’s there if needs be.

Challenges
How many people opt in?
Providing (paid) out-of-hours supports is not mandatory for our software engineers. We ask them if they’re willing to opt in.
Some do, some don’t. If not enough people opt in, support might not be viable.
For us, about 5 or 6 out of 30 have opted in. That seems to be just about enough. They hardly ever get called, but it’s good
to know that someone’s there if needs be.

Challenges
Understanding the product
Our product estate is big. Audio simulcast is rather diﬀerent from video on-demand, for example.

Mitigating factors:

- lots of common patterns, which help us make educated guesses

- we only allow ourselves simple operations out-of-hours

Challenges
What do I do if I get called?

What might we do, for out-of-hours support?

Speciﬁcally: NOT code changes.

Roll back (small roll forward, therefore small roll back).

Retry (e.g. a failed message).

Kill it (killing things is normal, meh. Chaos Monkey). Failover.

Scale it.

With those few options, turns out you can ﬁx quite a lot.

Code changes





Scale it.


Code changes
Roll back





Scale it.


Code changes
Roll back
Retry it





Scale it.


Code changes
Roll back
Retry it
Kill it





Scale it.


Code changes
Roll back
Retry it
Kill it
Scale up or down





Scale it.


Decisions, Decisions …
Part III
Including the whole team, with close communication; automated checking; being in control; adapting to support.
That’s what it meant for us.

But what might it mean for you?

Continuous Delivery at the BBC means ﬂexibility. Pick and choose what’s right for you.

I have never used
the in-house
hosting platform
A caveat / confession.
The BBC does have an in-house hosting platform, but none of my products have ever used it. But I’m going to compare
the steps for getting something live before, vs after, Continuous Delivery.

The documentation describing the old (pre-Continuous-Delivery) process for doing software on the in-house hosting
platform.
Lots of mandatory steps.

The (unofﬁcial) new Pipeline:
My unofficial take on the new, simplified mandatory steps.
Actually, even step 1 is optional. But you’ve got to have somewhere to host it.

1. Get an AWS account

2. Get Infosec approval

2. Get Infosec approval
3. Go live

Optional extras
However, although almost everything is now optional, you might want to do them anyway. It’s up to you.
You don’t have to have a decent architecture…

Optional extras
Decent architecture
However, although almost everything is now optional, you might want to do them anyway. It’s up to you.
You don’t have to have a decent architecture…

Following your Technical
Architect’s advice will make
your product more successful.

Ditto for:
decent engineering,
decent product management,
etc
( )

Optional extras
More things you don’t have to do. Your call.

Optional extras
Using the standard build tool

Optional extras
Continuous Integration

Optional extras
Repeatable builds

Optional extras
Repeatable builds
Builds

An efﬁcient, repeatable build chain
makes your product more reliable.

Optional extras
You don’t have to do these things…

Optional extras
Run Book

Optional extras
Run Book
Demos

Optional extras
Run Book
Demos
Telling anybody anything

If you help the support team,
they can help you.
If you tell people what your product is, does, how it works, etc. then they can help you, for example when things go wrong.

Optional extras
You don’t have to do these!

Optional extras
Out-of-hours support

Optional extras
In-hours support

Optional extras
In-hours support
Giving a damn about anything

If you care about your product,
other people will care too.
But you might want to :-)

Optional extras
You don’t have to do these things either…

Optional extras
Monitoring

Optional extras
Monitoring
Load testing

Optional extras
Monitoring
Load testing
Integration testing

Optional extras
Monitoring
Load testing
Integration testing
Component testing

Optional extras
Monitoring
Load testing
Integration testing
Component testing
Unit testing

Optional extras
Any concept of success whatsoever

If you care about something,
monitor / test it.
Monitoring ftw.

Test-Driven
Development
In fact, let’s take this one step further.

If we think that TDD is a good thing…

Monitoring
Load testing
Integration testing
Component testing
Unit testing
And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things
we often do TDD on…

Then why did we miss out Monitoring?

Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive
our development process in the same way?

Monitoring
Load testing
Integration testing
Component testing
Unit testing
}
And these four things are automated checks for correct behaviour that we usually apply before production, i.e. the things
we often do TDD on…

Then why did we miss out Monitoring?

Monitoring is an automated check for correct behaviour that we usually apply after production. But why not let that drive
our development process in the same way?

Monitoring-Driven
Development
Monitoring-Driven Development: deﬁne how you’re going to monitor this behaviour that you want, when it’s live. How do you
know if it’s working?

Create the alarm ﬁrst, before you create the behaviour. The alarm goes red, unhappy. Good: now you know that you need to
do some work to create that behaviour.

Do that, release it.

Then the alarm clears, is happy. So straight away, you know that this thing is monitored, from day 1, and that the monitoring
works.

As a team you have extra responsibility to ensure things happen.
But also the extra power to make sure they do.
And it leads to a better product.
And, when things go well, you as a team get to take all the credit! Nobody else did the deployments for you, etc. You
made the decisions: you enacted them.

Responsibility

Responsibility
Power

Responsibility
Power
Better product

Responsibility
Power
Better product
Take the credit

Tyler now represents not our failure, but our innovation.

He has a new, innovative use, keeping hold of our video adapter.

He lives in that silver cup, the BBC Digital Platform Innovation Award.

You can see it engraved there with our team name.

Yes, they got the year wrong.

Continuous Delivery
sounded scary
In 2010, we were making large, infrequent releases, and after every release we’d spend days or weeks putting ﬁres out.

Over the next 2 or 3 years we adopted Continuous Delivery. It sounded scary at ﬁrst; we were afraid we'd just break things
more often. But the feared “Continuous Destruction” never happened. In fact, it turned out to be absolutely critical to Video
Factory's success.

Change the team.
We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being
involved.

As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what
level of support our product needs.

Change the team.
Be in control of your product.
We had to change. That change didn’t happen overnight, and it wouldn't have worked without the whole team being
involved.

As part of Continuous Delivery, we’re now in control of our own testing, and our own deployments; and we choose what
level of support our product needs.

Smaller, safer changes.
Change the team.
Deployment is now literally an every day occurrence, we now have a steady ﬂow of changes, each of which is smaller, safer;

and we have much quicker feedback of results from each stage.

Rapid feedback.
Change the team.
Deployment is now literally an every day occurrence, we now have a steady ﬂow of changes, each of which is smaller, safer;

and we have much quicker feedback of results from each stage.

Rapid feedback.
Change the team.
Having a more stable product of course gives a better experience for the audience; but additionally, adopting Continuous
Delivery has helped to make working in the team be more enjoyable.

And with the faster feedback, we’re able to deliver features, and ﬁxes, to deliver value, more quickly, and more reliably. We
change things more often, and more safely. But also, we can experiment; we can try stuﬀ out; we can innovate.

Continuous Delivery enabled us
to create Video Factory.
Continuous Delivery enabled us to create Video Factory – new architecture, new code, new platform – in just 12 months.

What could it enable
you to create?
What could it enable you to create?

Thank you
Rachel Evans
rachel.evans@bbc.co.uk
@rvedotrc
Digital
Platform Media Services

Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

Recommended

Recommended

More Related Content

Similar to Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory

Similar to Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory (20)

Recently uploaded

Recently uploaded (20)

Destruction, Decapods and Doughnuts: Continuous Delivery for Audio & Video Factory