Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From hello world to goodbye code

2,645 views

Published on

At some point, the code you write today will be deleted and replaced with something new. This talk will discuss the life cycle of a large code base, and how to manage it over time to accommodate rewrites, giving examples from a major rewrite of the Firefox build and release pipeline over the last two years. You'll learn how to replace components of a running distributed system while keeping it operational, the proverbial replacing the wing of an airplane in flight.

Published in: Software
  • Be the first to comment

  • Be the first to like this

From hello world to goodbye code

  1. 1. From Hello World to Goodbye Code Kim Moir, Staff Release Engineer, Mozilla, @kmoir Bonjour à toutes et à tous, hello. I’m very happy to see you all this morning. Je suis très heureuse de vous voir tous ce matin. My name is Kim Moir and I’m a staff release engineer at Mozilla. Montreal in January is only slightly colder than Ottawa in January where I live, so I was not scared off by the weather. I’ve been paid to work full-time in open source since 2001. Before that I worked in government, education, and at other tech companies. Before that I was a student just like you. We didn’t have email on our phones, in fact, we barely had email. I’ve been working longer than most of you have been alive. But that’s okay. If I can survive 20+ years in the tech industry, so can you. Mozilla is most well known for building Firefox the web browser. As well as for their mission to make the internet open and accessible to all. I don’t work on the Firefox code base itself. As a release engineer, I write tools to scale our large build and release pipeline that transforms the Firefox code into a shippable product. This pipeline is a large distributed system. We are constantly optimizing this system to be more scalable, more resilient to failure and modifying the services it provides.
  2. 2. Outside of work, I like baking and running long distances. I have an amazing family too! I put these pictures up here to show you that as a developer you can have a life outside of work. Our industry tends to glamourize long hours at the keyboard at the expense of everything else but it doesn’t have to be that way. Firefox logo https://blog.mozilla.org/blog/2017/11/14/fast-for-good-launching-the-new-firefox-into-t he-world/
  3. 3. Today’s agenda ● The life cycle of code ● Distributed systems ● Replacing components of a running distributed system ● You can try it too! ● The life cycle of code ● Distributed systems ● Replacing components of a running distributed system (in the context of Firefox pipeline rewrite) ● You can try it too!
  4. 4. “And as everyone knows, the best kind of laughter is laughter born of a shared memory.” ― Mindy Kaling, Why Not Me? Let’s create some memories and talk about distributed systems and deleting code!
  5. 5. Hands up, how many of you have worked with a completely new code base in a work context? How many of you have worked with a existing code base? I’ve mentored a number of interns over the years, and one thing that I notice is that many school assignments are based on a completely new code base. I understand that this is done because everyone is learning language semantics, ui or testing frameworks for the course curriculum. In most companies, you will be looking at a existing code base. Even if you start your own company, you will probably use existing open source or language specific libraries, or call existing APIs. So an really important skill is learn how to work with an existing code base. Photo by Markus Spiske on Unsplash
  6. 6. Photo by Francesco Gallarotti on Unsplash Often an existing code base is like a very large, well established forest that you need to walk around in for a few hours, days or even a few weeks. Just to understand how it all works.
  7. 7. Photo by Koen Eijkelenboom on Unsplash It’s also good to talk to other people that have wandered in the code before. What do they know? What can you learn from them? Asking lots of questions as a software engineer is one of of the most important skills you can learn.
  8. 8. Healthy code bases and their teams ● Documented shipping and deployment processes that work ● Ship new binaries or provide updates on a regular basis These are things that I look for when I look at a new code base. As a release engineer, I’m biased to these qualities because I really care about shipping. Is the process documented on how to ship? Can more than one person ship the product or is this a magical set of steps that only one person knows how to execute? How often do you deploy or update users
  9. 9. Healthy code bases and their teams (cont’d) ● Readable code ● Tested code - correctness, integration, performance ● Feedback mechanism between developers and users Is it readable code or is there dead code and tests? Are there tests with a reasonable level of code coverage? Where do you report bugs? Or request new features? Is there telemetry that report failures in the product automatically?
  10. 10. ● Code ownership and review is shared among multiple people ● Ownership = responsibility for change ● This doesn’t mean that you have to do everything yourself ● You can serve a code reviewer and mentor new people ● People need to CARE about the code and the people who use and maintain it Healthy code bases and their teams (cont’d) When I used to work in the Eclipse community many years ago, the project I worked on didn’t have a code review process in place until a few weeks before the release each release. The problem was this approach was that there were limited people who understood different components. And when they decided to leave the community, the expertise left with them. (This process has since changed and they do have more code review in place) At Mozilla, we have the concept of module ownership and a robust code review process. This helps a larger group of people understand components of the code base because people are required to evaluate contributions. Reduces the bus factor as well when people leave.
  11. 11. ● Photo by John Baker on Unsplash ● Examples of old code bases actively being updated ○ Voyager space probes (~40 years) ○ Airplanes (~30 year service lifetime) ○ Industrial robots (~20 years) ○ The first Firefox release was over 15 years ago. I’m not sure how much of the original code base remains. I often think that large code bases are like the cells in a human body, over time, much will be replaced by new, but eventually it will die. Industial robots https://www.bastiansolutions.com/blog/index.php/2015/04/30/increase-life-span-of-ind ustrial-robot/ Voyager https://www.nasa.gov/mission_pages/voyager/index.html Social implications of old code Updating voyager software https://www.quora.com/Was-the-opportunity-to-update-the-Voyager-spacecraft-firmwa re-ever-considered-If-there-are-plans-to-launch-another-Voyager-could-we-keep-upda ting-its-Earth-information-content
  12. 12. Nasa retiring engineer voyager http://www.popularmechanics.com/space/a17991/voyager-1-voyager-2-retiring-engine er/
  13. 13. There are also social issues to maintaining old code bases. For instance, last year NASA was looking for a new developer to maintain it’s code base for the Voyager Space probes because last of the original team members were getting ready to retire.
  14. 14. Firefox continuous integration Land code Unit tests Decision graph Builds x N platforms Performance tests Sign Builds This is a very simplified diagram of the process that occurs when a developer lands code on our build pipeline. With her commit, a decision graph is generated that lists all the jobs that need to run. Then we build for four platforms - Linux, Mac, Windows and Android. These builds are then signed, and we run unit tests and performance tests so the developers can see the results of their commit. Did the tests fail? Or are there performance regressions they need to address?
  15. 15. Pipeline Metrics ● Constraints - it needs to be up and running all the time for developer productivity ● ~500 commits a day ● 140K jobs a day The build and release pipeline for Firefox is a large distributed system. Here are some metrics about it ● Developers love to ship. In order to ship, they need feedback on their patches. Can I ship this? Or does is there a regression that needs to be backed out? Improves happiness if they can see the results of their work more quickly
  16. 16. Photo by Uroš Jovičić on Unsplash End to end times - This is the time from a developer lands a commit until we are able to ship the finished product. Why are they important? 1. Landing small incremental patches reduces risk. Too difficult to figure out what went wrong on a high velocity team with a huge number of commits. 2. 0 days - we need to be able to get security patches to our users quickly. For instance last week we released five releases to address the recent Meltdown and Spectre vulnerabilities.
  17. 17. This is a picture of the Firefox release engineering pipeline from 3 years ago that Selena Deckelmann created. It took (optimistically around 11 hours at that time to ship a release from the time a developer landed a commit to builds being available to users). You don’t have to understand or read all the components of this diagram, only understand that it was scary and had many single points of failure and scalability issues. It takes 4-5 hours from developer commit to builds we can ship. http://www.chesnok.com/daily/2014/05/02/release-engineering-a-draft-of-an-architect ure-diagram/
  18. 18. Why did we rewrite? ● Developer autonomy ● Fail faster ● Better local and pipeline testing ● Change technology stack (Docker, microservices, graph generation, optimization and transformation, task parallelization) ● Learn new things! ● So we decided to rewrite our existing pipeline to be more resilient and scalable ● Any developer can make changes to build and test configuration, before releng was a blocker for these changes ● With every push to a repo, a decision graph is generated automatically. Basically it contains a list of tasks and all their dependencies that are needed to run associated with that push. If it fails, the builds aren’t run which saves resources ● Developers can also test these changes locally or on the build pipeline ● Photo by ARTHUR YAO on Unsplash
  19. 19. Reasons not to rewrite? ● Failure is highly likely ● Really expensive ● May lose people on your team who aren’t interested in working on a new technology stack Have to defer other project work because you are heads down on a rewriting project. There is also usually a huge learning curve if you are moving to a new technology stack, not just for developers but for operations folks as well
  20. 20. “A system that spans more than one physical location and uses the related concepts of copying and decoupling to improve operational efficiency (speed, resilience) and, more recently, developer efficiency (team productivity).” -Anne Currie Distributed system If your system spans more than one location you can make it more resilient. For instance, our pipeline uses Amazon instances to run builds and tests, and we run these jobs in multiple Amazon regions which correspond to different geographic areas. Copying data means that that it is available in more than one location, which is another way to make the system more resilient. For instance, when we release Firefox we release it from multiple CDNs. Decoupling means that you have services that can operate on their own without depending on other services being available Decoupled services usually communicate with each other via APIs This allows you to change the internal implementation without the other services having to change the way you interact with the service In this approach you can also stop, start and replace parts of the system. With a monolith, this is more difficult to do. This approach also allows team members to work on different parts of the system without everyone contending for the same resources. Another reason that we use distributed system is that is allows us to scale up capacity incrementally by instantiating copies of existing services. For instance with our
  21. 21. migration we ran many more services in parallel to allow the end to end time for releases to drop significantly. They also allow us to provide a reasonable level of service to clients. Availability means we can always provide a predictable service to clients. Even if there are issues like network problems, the system can appear available. Why do we use distributed systems http://container-solutions.com/use-distributed-systems-resilience-performance-availab ility/ Resilience, Performance & Availability
  22. 22. How to approach migration ● Incremental portions of pool ● Communication ● Checklist ● Monitor capacity and wait times ● Monitor state after migration ● Rollback plan ● Decommission old ● Migrate more ● This is in the context of a large migration that we did at Mozilla where we migrated components of our build and release pipeline to a new microservices architecture and Docker ● Communicate - open an issue. ● Let people know via mailing list, Slack/irc of timeframes for deletion ● Update issue tracker with plan and time
  23. 23. Strangler Application - Martin Fowler From Jez Humble’s Continuous delivery page https://continuousdelivery.com/implementing/architecture/ “One pattern that is particularly valuable in this context is the strangler application. In this pattern, we iteratively replace a monolithic architecture with a more componentized one by ensuring that new work is done following the principles of a service-oriented architecture, while accepting that the new architecture may well delegate to the system it is replacing. Over time, more and more functionality will be performed in the new architecture, and the old system being replaced is “strangled”.” In Mozilla releng, we recently migrated from an old build job scheduling system called Buildbot to one called Taskcluster. One of the things that really helped us achieve this in our transition was an application called buildbot bridge. This allowed us to schedule jobs on taskcluster, but continue to run them on buildbot. This is similar to the dispatcher function showed in the diagram above.
  24. 24. What have we learned? ● Incrementalism - change one thing, evaluate, then change another ● Expectations change. The faster we build, the faster other groups expect to be able to ship ● Staging environment is important to test new automation ● Communication ● Organizational changes ● Consider the operational side, not just landing code This is an excellent talk on code rewrites as well So you want to rewrite that - Camille Fournier https://www.youtube.com/watch?v=PhYUvtifJXk
  25. 25. How to delete code ● Communicate, note in issue tracker ● Delete. Don’t comment it out. ● Update or delete relevant tests ● Look at dependencies - can they also be updated or removed? ● Celebrate! I’ve looked at a lot of code bases in the past where people are afraid to delete code, so they comment it out. This makes the code really unreadable for future maintainers. Or they leave the tests in place that are no longer relevant. It’s 2018 and version control is your friend. If you need to look and see why the code was deleted, you can bisect the code.
  26. 26. Hard to open up that door When you're not sure what you're going for But we've got to grow We've got to try Though it's hard so hard We've got to say goodbye ―Beyoncé Sometimes it’s hard to delete code. You get emotionally attached to it. You spent so much time working on it. It’s okay, there will be something new to learn about!
  27. 27. From WOCintechchat stock photos License Creative Commons Attribution 2.0 Generic (CC BY 2.0) How can you apply these principles yourself? When you work on a new project, think about the lifecycle of the code What is the update strategy? Mobile or web? With desktop apps you can’t ship 1.0 until you have an update strategy for 2.0 What is your deployment strategy How will you find out if your users are unhappy How can you distribute code ownership?
  28. 28. In conclusion, as you embark upon your careers in engineering, it has been my experience that people matter more than code.
  29. 29. We are hiring - check out https://careers.mozilla.org/ Thank you! Also I have a couple hundred Firefox and Mozilla stickers, please see me afterwards if you are interested
  30. 30. Additional Reading ● Camille Fournier: So you want to rewrite that, GOTO conference, Chicago, 2014 https://www.youtube.com/watch?v=PhYUvtifJXk ● Caitie McCaffrey: Resources for Getting started with distributed systems https://caitiem.com/2017/09/07/getting-started-with-distributed-systems/ ● Anne Currie: ○ What is a Distributed system? https://container-solutions.com/what-is-a-distributed-system/ ○ Why is a Single-Threaded Application like a Distributed System? http://container-solutions.com/single-threaded-application-like-distributed-system/ ○ Why Use Distributed Systems? Resilience, Performance, and Availability http://container-solutions.com/use-distributed-systems-resilience-performance-availability/
  31. 31. Additional Reading ● Lin Clark: Entering the Quantum Era—How Firefox got fast again and where it’s going to get faster https://hacks.mozilla.org/2017/11/entering-the-quantum-era-how-firefox-got-fa st-again-and-where-its-going-to-get-faster/ ●

×