At some point, the code you write today will be deleted and replaced with something new. This talk will discuss the life cycle of a large code base, and how to manage it over time to accommodate rewrites, giving examples from a major rewrite of the Firefox build and release pipeline over the last two years. You'll learn how to replace components of a running distributed system while keeping it operational, the proverbial replacing the wing of an airplane in flight.
From Hello World to
Kim Moir, Staff Release Engineer, Mozilla, @kmoir
Bonjour à toutes et à tous, hello. I’m very happy to see you all this morning. Je suis
très heureuse de vous voir tous ce matin. My name is Kim Moir and I’m a staff release
engineer at Mozilla. Montreal in January is only slightly colder than Ottawa in January
where I live, so I was not scared off by the weather.
I’ve been paid to work full-time in open source since 2001. Before that I worked in
government, education, and at other tech companies. Before that I was a student just
like you. We didn’t have email on our phones, in fact, we barely had email. I’ve been
working longer than most of you have been alive. But that’s okay. If I can survive 20+
years in the tech industry, so can you.
Mozilla is most well known for building Firefox the web browser. As well as for their
mission to make the internet open and accessible to all. I don’t work on the Firefox
code base itself. As a release engineer, I write tools to scale our large build and
release pipeline that transforms the Firefox code into a shippable product. This
pipeline is a large distributed system. We are constantly optimizing this system to be
more scalable, more resilient to failure and modifying the services it provides.
Outside of work, I like baking and running long distances. I have an amazing family
too! I put these pictures up here to show you that as a developer you can have a life
outside of work. Our industry tends to glamourize long hours at the keyboard at the
expense of everything else but it doesn’t have to be that way.
● The life cycle of code
● Distributed systems
● Replacing components of a running distributed system
● You can try it too!
● The life cycle of code
● Distributed systems
● Replacing components of a running distributed system (in the context of
Firefox pipeline rewrite)
● You can try it too!
“And as everyone knows, the best kind of laughter is
laughter born of a shared memory.”
― Mindy Kaling, Why Not Me?
Let’s create some memories and talk about distributed systems and deleting code!
Hands up, how many of you have worked with a completely new code base in a work
context? How many of you have worked with a existing code base?
I’ve mentored a number of interns over the years, and one thing that I notice is that
many school assignments are based on a completely new code base. I understand
that this is done because everyone is learning language semantics, ui or testing
frameworks for the course curriculum.
In most companies, you will be looking at a existing code base. Even if you start your
own company, you will probably use existing open source or language specific
libraries, or call existing APIs. So an really important skill is learn how to work with an
existing code base.
Photo by Markus Spiske on Unsplash
Photo by Francesco Gallarotti on Unsplash
Often an existing code base is like a very large, well established forest that you need
to walk around in for a few hours, days or even a few weeks. Just to understand how
it all works.
Photo by Koen Eijkelenboom on Unsplash
It’s also good to talk to other people that have wandered in the code before. What do
they know? What can you learn from them? Asking lots of questions as a software
engineer is one of of the most important skills you can learn.
Healthy code bases and their teams
● Documented shipping and deployment processes that work
● Ship new binaries or provide updates on a regular basis
These are things that I look for when I look at a new code base. As a release
engineer, I’m biased to these qualities because I really care about shipping.
Is the process documented on how to ship?
Can more than one person ship the product or is this a magical set of steps that only
one person knows how to execute?
How often do you deploy or update users
Healthy code bases and their teams (cont’d)
● Readable code
● Tested code - correctness, integration, performance
● Feedback mechanism between developers and users
Is it readable code or is there dead code and tests?
Are there tests with a reasonable level of code coverage?
Where do you report bugs? Or request new features?
Is there telemetry that report failures in the product automatically?
● Code ownership and review is shared among multiple people
● Ownership = responsibility for change
● This doesn’t mean that you have to do everything yourself
● You can serve a code reviewer and mentor new people
● People need to CARE about the code and the people who use and maintain it
Healthy code bases and their teams (cont’d)
When I used to work in the Eclipse community many years ago, the project I worked
on didn’t have a code review process in place until a few weeks before the release
each release. The problem was this approach was that there were limited people
who understood different components. And when they decided to leave the
community, the expertise left with them. (This process has since changed and they do
have more code review in place)
At Mozilla, we have the concept of module ownership and a robust code review
process. This helps a larger group of people understand components of the code
base because people are required to evaluate contributions. Reduces the bus factor
as well when people leave.
● Photo by John Baker on Unsplash
● Examples of old code bases actively being updated
○ Voyager space probes (~40 years)
○ Airplanes (~30 year service lifetime)
○ Industrial robots (~20 years)
○ The first Firefox release was over 15 years ago. I’m not sure how much
of the original code base remains. I often think that large code bases
are like the cells in a human body, over time, much will be replaced by
new, but eventually it will die.
Social implications of old code
Updating voyager software
Nasa retiring engineer voyager
There are also social issues to maintaining old code bases. For instance, last year
NASA was looking for a new developer to maintain it’s code base for the Voyager
Space probes because last of the original team members were getting ready to retire.
Firefox continuous integration
Builds x N
This is a very simplified diagram of the process that occurs when a developer lands
code on our build pipeline. With her commit, a decision graph is generated that lists
all the jobs that need to run. Then we build for four platforms - Linux, Mac, Windows
and Android. These builds are then signed, and we run unit tests and performance
tests so the developers can see the results of their commit. Did the tests fail? Or are
there performance regressions they need to address?
● Constraints - it needs to be up and running all the time for developer
● ~500 commits a day
● 140K jobs a day
The build and release pipeline for Firefox is a large distributed system. Here are
some metrics about it
● Developers love to ship. In order to ship, they need feedback on their
patches. Can I ship this? Or does is there a regression that needs to be
backed out? Improves happiness if they can see the results of their work more
Photo by Uroš Jovičić on Unsplash
End to end times - This is the time from a developer lands a commit until we are able
to ship the finished product.
Why are they important?
1. Landing small incremental patches reduces risk. Too difficult to figure out what
went wrong on a high velocity team with a huge number of commits.
2. 0 days - we need to be able to get security patches to our users quickly. For
instance last week we released five releases to address the recent Meltdown
and Spectre vulnerabilities.
This is a picture of the Firefox release engineering pipeline from 3 years ago that
Selena Deckelmann created. It took (optimistically around 11 hours at that time to
ship a release from the time a developer landed a commit to builds being available to
users). You don’t have to understand or read all the components of this diagram, only
understand that it was scary and had many single points of failure and scalability
It takes 4-5 hours from developer commit to builds we can ship.
Why did we rewrite?
● Developer autonomy
● Fail faster
● Better local and pipeline testing
● Change technology stack (Docker, microservices, graph generation,
optimization and transformation, task parallelization)
● Learn new things!
● So we decided to rewrite our existing pipeline to be more resilient and scalable
● Any developer can make changes to build and test configuration, before
releng was a blocker for these changes
● With every push to a repo, a decision graph is generated automatically.
Basically it contains a list of tasks and all their dependencies that are needed
to run associated with that push. If it fails, the builds aren’t run which saves
● Developers can also test these changes locally or on the build pipeline
● Photo by ARTHUR YAO on Unsplash
Reasons not to rewrite?
● Failure is highly likely
● Really expensive
● May lose people on your team who aren’t interested in working on a new
Have to defer other project work because you are heads down on a rewriting project.
There is also usually a huge learning curve if you are moving to a new technology
stack, not just for developers but for operations folks as well
“A system that spans more than one physical
location and uses the related concepts of copying
and decoupling to improve operational efficiency
(speed, resilience) and, more recently, developer
efficiency (team productivity).”
If your system spans more than one location you can make it more resilient.
For instance, our pipeline uses Amazon instances to run builds and tests, and we run
these jobs in multiple Amazon regions which correspond to different geographic
Copying data means that that it is available in more than one location, which is
another way to make the system more resilient. For instance, when we release
Firefox we release it from multiple CDNs.
Decoupling means that you have services that can operate on their own without
depending on other services being available
Decoupled services usually communicate with each other via APIs
This allows you to change the internal implementation without the other services
having to change the way you interact with the service
In this approach you can also stop, start and replace parts of the system. With a
monolith, this is more difficult to do.
This approach also allows team members to work on different parts of the system
without everyone contending for the same resources.
Another reason that we use distributed system is that is allows us to scale up capacity
incrementally by instantiating copies of existing services. For instance with our
migration we ran many more services in parallel to allow the end to end time for
releases to drop significantly.
They also allow us to provide a reasonable level of service to clients.
Availability means we can always provide a predictable service to clients. Even if
there are issues like network problems, the system can appear available.
Why do we use distributed systems
Resilience, Performance & Availability
How to approach migration
● Incremental portions of pool
● Monitor capacity and wait times
● Monitor state after migration
● Rollback plan
● Decommission old
● Migrate more
● This is in the context of a large migration that we did at Mozilla where we
migrated components of our build and release pipeline to a new microservices
architecture and Docker
● Communicate - open an issue.
● Let people know via mailing list, Slack/irc of timeframes for deletion
● Update issue tracker with plan and time
Strangler Application - Martin Fowler
From Jez Humble’s Continuous delivery page
“One pattern that is particularly valuable in this context is the strangler application. In
this pattern, we iteratively replace a monolithic architecture with a more
componentized one by ensuring that new work is done following the principles of a
service-oriented architecture, while accepting that the new architecture may well
delegate to the system it is replacing. Over time, more and more functionality will be
performed in the new architecture, and the old system being replaced is “strangled”.”
In Mozilla releng, we recently migrated from an old build job scheduling system called
Buildbot to one called Taskcluster.
One of the things that really helped us achieve this in our transition was an application
called buildbot bridge. This allowed us to schedule jobs on taskcluster, but continue
to run them on buildbot. This is similar to the dispatcher function showed in the
What have we learned?
● Incrementalism - change one thing, evaluate, then change
● Expectations change. The faster we build, the faster other
groups expect to be able to ship
● Staging environment is important to test new automation
● Organizational changes
● Consider the operational side, not just landing code
This is an excellent talk on code rewrites as well
So you want to rewrite that - Camille Fournier
How to delete code
● Communicate, note in issue tracker
● Delete. Don’t comment it out.
● Update or delete relevant tests
● Look at dependencies - can they also be updated or removed?
I’ve looked at a lot of code bases in the past where people are afraid to delete code,
so they comment it out. This makes the code really unreadable for future
maintainers. Or they leave the tests in place that are no longer relevant.
It’s 2018 and version control is your friend. If you need to look and see why the code
was deleted, you can bisect the code.
Hard to open up that door
When you're not sure what you're going for
But we've got to grow
We've got to try
Though it's hard so hard
We've got to say goodbye
Sometimes it’s hard to delete code. You get emotionally attached to it. You spent so
much time working on it. It’s okay, there will be something new to learn about!
From WOCintechchat stock photos License Creative Commons Attribution 2.0
Generic (CC BY 2.0)
How can you apply these principles yourself?
When you work on a new project, think about the lifecycle of the code
What is the update strategy? Mobile or web? With desktop apps you can’t ship 1.0
until you have an update strategy for 2.0
What is your deployment strategy
How will you find out if your users are unhappy
How can you distribute code ownership?
In conclusion, as you embark upon your careers in engineering, it has been my
experience that people matter more than code.
We are hiring - check out
Also I have a couple hundred Firefox and Mozilla stickers, please see me afterwards
if you are interested
● Camille Fournier: So you want to rewrite that, GOTO conference, Chicago,
● Caitie McCaffrey: Resources for Getting started with distributed systems
● Anne Currie:
○ What is a Distributed system? https://container-solutions.com/what-is-a-distributed-system/
○ Why is a Single-Threaded Application like a Distributed System?
○ Why Use Distributed Systems? Resilience, Performance, and Availability
● Lin Clark: Entering the Quantum Era—How Firefox got fast again and where
it’s going to get faster