In thinking through the approach to this blog post, I wanted to start framing the narrative around the concept of a refactor. I’m not talking about the mini changes that are required for everyday feature delivery, but major refactors to the codebase that are going to take weeks of planning and development. In a lot of ways, knowing when or why to refactor is just as hard as knowing how to do the refactor. So that’s where I want to start, with the when and why - at what point does a refactor become the better path forward? And how do you convince stakeholders and non-engineers of that? How do you ask for the resources and time necessary to complete a refactor that will not immediately change a user’s experience, but will dramatically improve a developer’s experience? We’ve all heard the now-infamous Mark Zuckerberg adage “move fast and break things”. But what happens when you move too fast and break too much?
A little over a year ago, we hit exactly this point. We were working on a couple different features that ultimately took double the amount of time we thought they would. With engineering handoffs, messy code, and pressure to get features out - we were trying to work as fast as possible, but at the expense of our codebase. Eventually it felt like everything was just constantly broken. To paraphrase from our VP Engineering, our codebase was like having a workshop where every time you used a tool, you put it down in a random location; nothing had a home. Eventually it gets very difficult to get anything done because your shop is such a mess. Now imagine you have a lot of people trying to share that same workshop. Things were getting done, but to the detriment of our workshop/codebase (and frankly, our developers’ sanities). Around the same time, Product and Design asked for a couple features that got us thinking about considering a refactor. For what was being asked of us, we were going to need a couple months and a pause on new features, necessitating product buy in and stakeholder alignment.
For context, our cross-functional team has been working on a system to store and reuse data. In our case specifically, we have a collection of sales comparables (sales comps) to use within the space of commercial real estate appraisals. To complete an appraisal, there are a couple of approaches to value, and sales comps help support the Sales Approach. An appraiser looks at properties similar to the subject property by comparing property type, location, gross building area, number of units, number of floors, etc., and then uses the sale histories of these properties (date and price) to support their forecasted value of the subject property. The property getting compared is what makes a sales comp - property information and sale information. Sale information is hard to come by and harder to verify, so our goal is to make the process of finding and validating a sales comp as efficient as possible by creating a centralized database that our users (appraisers) can access through our web application. Therefore if one appraiser has verified a sales comp, the next appraiser can reuse the same comp without having to re-do the work.
The original system built to satisfy these requirements was called data-reuse-mart (DRM), and it was published as an NPM package to use within our larger appraisal web application (Webapp). Earlier when I mentioned the messy workshop? That’s Webapp. DRM could maybe be the built-in shelves and workbench that were kind of separate from the rest of the shop, but not enough to completely decouple from the clutter.
DRM was set up to do exactly what was outlined - store sale and property data, maintain uniqueness of sales comps (by address and sale date), and keep a version history so that appraisers with an older version didn’t automatically see changes applied. My team adopted this system a few weeks after it went live, which added an interesting challenge. To continue with our metaphor, we were working in a disorganized shop while we were still trying to learn our way around. We hacked our way through features, putting things in whatever home we could find for them even if it didn’t really make sense. The boundary lines between our Webapp monolith and new DRM system blurred, making it even more difficult to see where to build a feature. We were constantly struggling with the trade-off between taking the time to clean up the code and moving quickly so we could release new features to our users. And then we finally reached a point where we could no longer deliver the features being asked of us in a reliable timeframe. Hacking our way through features was not only slowing us down, but also breaking a whole lot along the way.
Product asked us for three new features:
- Capture additional verification data for the sales comp, which involved changing the data model drastically;
- Connect each sales comp with an appraisal to show users where else the data was being used; and
- Give users the ability to see the exact changes between versions, who had made them, and when.
The engineers quickly realized that while DRM was set up to handle the initial requirements, the backend architecture was too inflexible to add or change what we needed to accomplish these new asks without creating an enormous amount of tech debt. By this point we (being the engineers) had answered the question of when to refactor. We did not want to move forward with these new features before cleaning up our code base. Engineering knew this, but how do you convince Product, Design, users, and stakeholders that this is the best solution in the long term even though it means slower delivery in the short term? We’re in a unique position at Bowery in that our users are in-house. I would say in 99.99% of cases, this is an amazing thing - shorter feedback loops, user testing, getting to see firsthand how your product affects someone. Refactoring is the 0.01%. How do you explain to someone sitting next to you that you’re not going to give them any new features for a while because you have to rewrite the system? The question of when was clear to the engineers, but we needed to clearly define the why to convince everyone else that now was the right time.
Our end goal with this sales comp reuse is efficiency, so the users and how they’re impacted are always top of mind for us. So why slow down to refactor correctly instead of just quickly getting in a solution that works?
First and foremost, we needed to define the scope of this project. If we were going to ask for the time and resources necessary to complete it, we needed to know exactly what it was we wanted to accomplish. In order to get all this information out, we did a “Pain-Storming” exercise (template linked here) where we got to hash out all the complications in our current system. In this exercise, the engineers anonymously wrote out pain points on stickies - everything from “NPM package causes pain like dentistry. Am I running the local or remote package?!” to “what is its actual purpose...” We got every little detail written out, so that we could narrow down which areas of the application we needed to focus our efforts on. The point of going through “pain-storming” was to distill down the root causes of our “pain.” There were some general themes to what was giving developers trouble, so we grouped those together and then plotted them on a Pain vs. Effort matrix. More than anything, I think this exercise gave the team specific starting off points and goals for a refactor. We distilled down the problem areas to the frontend packaging system and the backend data model.
Our frontend was packaged as an NPM library, so we were constantly publishing new versions which lead to updating our main application at least once a day. On top of that, we had to use npm link to develop locally but with NPM versions out of sync, this had constant local problems - we couldn’t test DRM without connecting to the Webapp, but starting both systems would sometimes take 20 minutes or longer. From these pain points, we decided one of our goals for refactoring would be to decouple DRM from the monolith. We wanted clear boundaries between what our system does and what the monolith does. We wanted to have a local setup separate from the main application as well as a separate deployment process that did not require updating a package version in our Webapp with every change.
As for the backend, one of the biggest takeaways was that we didn’t actually know what our system was trying to accomplish. It worked for users, but we couldn’t tell you in a sentence what the context of the system was. Since we took over maintaining the system, we never got to do the upfront research of why it was necessary and what problems it was directly trying to solve. The goals were unclear to us and because of that, we weren’t coding in a clean, clear, understandable way. The data model being used no longer made sense to us; it was too rigid and overly complex for what the system required. We had outgrown the model, so we decided to start using a Domain Driven Design approach to define a data model that would allow us to be firm with business rules and logic, but flexible enough to allow for the requests we were getting from Product and Design. This breakdown of issues was a way to start defining actions, milestones, and a proposed timeline for the work.
Showing the refactor in terms of the problems or pain points we set out to solve, I think we could more easily describe to Product and stakeholders why it was necessary. The frontend was slowing down development time and the data model wasn’t set up to allow for fast delivery of new features. Our biggest selling point on this work was that if we could just slow down for a couple weeks to do this refactor, we’d be able to move much faster going forward. Without a refactor, we cautiously estimated about 4 months for all three epics they wanted and a questionable amount of time for anything else after that. But with a refactor, we estimated about 6-8 weeks of refactor time with the promise that the epics would be fast-follows taking a 1-2 months max. AND doing a refactor, we said there would be an added bonus of ensuring future epics could be done reliably. I’m not sure we ever got 100% alignment between departments on the refactor work, but we got close enough by showing the time tradeoffs for delivering new features. And so our refactor work began… (Stay turned for Part II: The What & The How and Part III: The Result)