37. The structure of your team changes.
CTO
Architect
Dev Dev Dev Dev Dev Dev
38. The structure of your team changes.
CTO
Director
FrontEnd
Dev Dev
QA BackEnd
Dev Dev
Director
Frontend QA Backend
VP Ops
DevOps
DevOps DevOps
Syseng
Rent the Runway initially launched as a Drupal application that ran the storefront, inventory and fulfillment systems. When I joined the company in 2011, we encountered pretty much every issue you’d expect with a legacy system.
We couldn’t support the growth of the business, our customers used the product despite a slow and buggy website, and we couldn’t release new features without trying and rolling back several times due to instability.
I lead the team through a rewrite of our systems over the past 2 years, and became interested in the issues around rewriting, especially in the context of growing startups.
When you start a rewrite, you are generally in a trainwreck sort of situation
Aha, you think! I’ll rewrite everything, that will get me out of this mess!
But, as my friend Cliff said recently in a different talk, there’s basically no such thing as a successful rewrites. Rewriting is an act borne of failure. The best you can hope for is a rewrite that looks like firefighting.
Don’t believe me? Let’s dive in.
Rewriting successfully is difficult because of where you are now, what you don’t know, and the many ways you can flounder along the way.
If you are contemplating a rewrite, you’re probably in the midst of failure, or heading quickly towards a very measurable problem.
Can’t Scale .Twitter is a classic example, but this is a common problem. In fact, this is almost the way you want your team to be forced to rewrite, because you have so much volume you can’t keep up. Trade volumes increasing for a banking world, additional regulatory requirements, more data, etc.
When I joined RTR, we had a huge scaling issue. Due to the version of Drupal we were running, we could not sustain more concurrent sessions than we had web server space for, and we couldn’t just spin up new servers when we needed to to handle additional load. We were completely unable to consider doing promotional work that would drive spikes in traffic to the site. Cyber Monday was a nightmare.
You can’t meet your customer’s demand. You’ve written software that depends on the cloud, and someone wants to give you a ton of money to run it in-house.
This goes a bit hand-in-hand with scaling if you are building a product where you can’t tolerate additional users.
We couldn’t scale our ability to handle more reservations because of our data model.
Crushed under the weight of technical debt. You can barely keep things going as they are, let alone move forward.
We were definitely in this situation.
DANGER DANGER
Every developer can point to some software they’d like to rewrite
Developers are lazy about reading software and hate dealing with anything legacy
This is the most tempting failure to handwave: of COURSE we’re being crushed. But are you really? “Crushed” isn’t we spend a little bit of additional time figuring out legacy code sometimes. Crushed is we spend all our time supporting the legacy systems and every time we try to add a new feature it breaks.
Want to ship features faster, but what does that really look like?
You’re in the midst of failure, so you’re already firefighting. How much time and bandwidth do you have for the heavy planning necessary to pull off a successful rewrite?
If it’s so bad you need to rewrite it, how well do you know what it’s doing now?
Lots of accumulated fixes, production hardening, and stability gained through the virtue of it being in prod for a long time
The customer base has (presumably) grown, or grown used to the site working as well as it does, and won’t necessarily tolerate regressions
We did a couple-day hackathon to try to “rewrite” our site, and the team left it thinking “oh yeah, we can totally do this”
But they forgot about pretty much every edge case of the site
All of the administrative tools
Performance
Credits
Gift cards
And on and on and on…
The first bits seem easy because they are obvious, but it’s all the things that exist only in the head of your PM that has been there since the beginning, that you have to worry about
If you create or modify data, you need to reconcile the new data with the old data
BTW, most modern companies have analytics teams that probably rely heavily on the structure of the data as it exists right now
I’ve never had a project involving data that hasn’t been a fairly difficult and painful migration.
You alone are probably not enough
If you’re going to change your software, what standards need to change? What new expertise do you need that you don’t have? What does the existing team need to learn to be successful? Do you need more ops?
The team on the ground is responsible for the mess you’re in now
The experts may be spending more time supporting the old stuff than paying attention to the new stuff, so then what? Even more unknown unknowns
So many “quick” fixes that are not
Trying to do too much at once
Rewriting rescal
Billed as 6 weeks of 1-2 developers
Ended up being closer to 5 months with a whole team working a death march at the end
Why? Because we changed the schema at the same time that we did the migration
Reduces thinking up front, increases complexity and likelihood of failing at the end
Easy to undervalue tooling and support of a huge community
Languages and frameworks also improve over years of production hardening
Choose for the next several years, not the next several months
Choosing for vanity reasons, recruiting, etc, is a very risky proposition
We chose the original Play framework. Even at the time we chose it the community was moving to a new version, still in beta, that was being written in Scala. Now it is one of our bigger pieces of technical debt that everyone really wants to rewrite.
As the right software might bring you life, the wrong software can take it.
We chose the original Play framework. Even at the time we chose it the community was moving to a new version, still in beta, that was being written in Scala. Now it is one of our bigger pieces of technical debt that everyone really wants to rewrite.
This is a generic problem of software engineering. Everyone will probably do this at least once, if not several times throughout their career.
As such, apply principles to the problem to make sure you’re on the right track.
Sustainable rewrite looks like firefighting, as I said earlier. So identify the worst fires first, and attack them.
Maybe you stop all progress for a while, and tackle your biggest problem areas. We actually got pretty far with just some database improvements before we started our rewrite. If nothing else, this can help keep the old system in a stable place while you have time to build out the new thing.
Don’t split up your knowledge base. More languages means fewer people that really understand what is going on.
Maybe there’s a better framework you could be using. Much easier to salvage code and knowledge when you aren’t changing languages.
We rewrote data by data and function by function, creating java logic, hollowing out Drupal, and then finally moving the client layer into Ruby
The key thing that allowed our rewrite to be successful
Also allowed us to interleave features in the process
If you must rewrite, this is absolutely the way to go. Generally the best way to do ANY project anyway.
I did many things wrong with my rewrite, but this is one of the things I did right, and was probably the most important keyto success.
Rewriting costs a lot of political capital, and may cost you your job if you are not careful.
Cultivate dissent, get other opinions from your most skeptical team members, friends
Hockeystick of infrastructure cost
Flatline of scalability, ability to add engineers
Big Scary Graphs that will change when you successfully complete the rewrite
Your team is going to be the ones working their asses off for possibly years to make this happen
They need to be fully on board
They need to have thought through the implications and understand clearly the goals
This can go on forever if you let it. When do you declare victory?
uncover unstated assumptions and also helped catch compatibility problems
If you’re rewriting the backend logic, a smoketest of user-facing functionality can be helpful to get a sense of what you’re actually going to need to do
Those same Big Scary Graphs should change when you successfully complete the rewrite
If you aren’t measuring them up front, it is hard to ensure that the rewrite is actually improving anything. Might just be creating code that is only better for the people who rewrote it. This is VERY true for a tech debt rewrite.
When are we cutting over? Will we dual write? Even if we aren’t changing the data structure, how do you really make sure the new system is writing the right thing? If you are changing the data model, how do you map the old to the new? Are you going to backfill the new data? Will you run both in parallel?
No one likes to hear that the system they shed blood, sweat and tears over is crap
A business person that might have been able to go to a dev and have them add something real quick now has to go through a process
More ops if you go scalable
More people to support new languages
More ops if you go scalable
More people to support new languages
Teams may become silos
Too much abstraction to deal with overly tight coupling
We created “swimlanes” for a scalability problem we didn’t have, which made our deployment process incredibly complex
You can’t boil the ocean, but you can cause global warming
ERIC: Goals: high volume traffic, fault isolation, fault tolerance, geodiversity. Make sure to mention not just syncing submodules and deploying, but separate Git repos, separate tags, etc, repeated over every environment (stage, production). Devs would write code in isolation for long periods of time before merging to master to avoid deploying.
There’s always a fire somewhere
We declared “Done” when the interactive core customer-facing pages were out of drupal
But still had and still have a ton of admin pages to be rewritten
Now my devs want to rewrite off of Play
Probably not twice as long, but longer
Scale to a load greater than twice your current load, or enable changes that will support that load.
Allow new features and products more easily. Allow faster developer productivity.
Think about how to avoid the “optimization corner” where we had started to trade off readability and flexibility of the codebase for performance and efficiency.
Think about more than just the code, the data storage and network can also cause load bottlenecks, will this address them?
Remember that the new system, while it should last longer than the last one, won’t last forever
But you probably need some additional abstractions
Loose coupling is great but needs documented standards or it becomes incredibly difficult to use
APIs and middleware are both great, but if you start talking REST to services but don’t follow RESTful standards, it is confusing for new devs. This was a big mistake that we made with earlier systems, along with not adopting standard clients or decent documentation. Makes a distributed SOA system much harder to maintain.
Possibly middleware
Related logic in one place
Uniform ways to do concurrency and network calls