A weekend marathon of an infrastructure upgrade

Earlier this month, I received an email from Heroku that my “stack” was reaching its EOL (end-of-life) and I needed to upgrade to the latest stack. These emails typically catch me off guard because over the years I’ve actually lost my appetite for fancy new things—I’m much more attracted to tech that’s boring, battle-tested, and does the job. That said, this EOL was more about Heroku no longer supporting an older Ubuntu OS and no longer providing security updates for it, so this wasn’t fancy at all—it was more of a necessity.

Making a significant upgrade like this could be nerve-wracking, and it was, despite Cushion’s backend being incredibly well-covered with tests. After running the entire test suite locally, I was reminded that there are in fact ~5,500 tests for Cushion’s backend. This calms my paranoia when I need to make any backend changes, which is incredibly important for an app that’s over nine years old. Even so, with such a significant upgrade that involves a major version bump for both the OS and backend framework, I was still pretty nervous.

At first, I didn’t even know that Ruby would need a major upgrade as well, but I quickly learned this when my CI complained that Heroku 22 doesn’t support Ruby 2.x. I’d be the old man yelling at clouds about this, but it was fine. I should keep everything up-to-date for security purposes, even if I don’t need the shiny new features. When I bumped the Ruby version, I started to see the real work that needed to be done—217 failing tests. I know the exact number because I kept a log of my progress as I chopped away at it. Even with a number to track, however, I didn’t know the true weight of each failing test. A simple dependency upgrade could fix half of these tests whereas a single failing test could end up taking days to fix. Considering this unknown along with a tight deadline of May 1st, I decided to set up camp at my wife’s artist studio for an entire weekend and get to work. Luckily for me, that weekend was the start of the NBA playoffs.

After an entire 12-hour day (or four NBA playoff games), I was able to get the number of failing tests from 217 down to 161. This did feel like progress, but not as much as I had hoped. The reality is that this progress wasn’t a straight line. Halfway through the day, I had actually gone up to 228 failing tests.

With certain fixes, like updating a dependency to support a new syntax, you can sometimes end up with more failing tests because the updated dependency included breaking changes of its own. This means that it’s not always a given that I’d be making forward progress. It’s your classic one step forward, two steps back. I wasn’t deterred, though, because the test suite would always let me know exactly what broke—I never had to guess. After this first day of troubleshooting, I still felt accomplished because in addition to fixing dozens of failing tests, I also found my footing. I was now confidently wearing my backend hat, and I had momentum carrying me forward at a steady pace.

The next day, after another solid 12-hour day, I got the number of failing tests down to six. I couldn’t believe how much progress I had made on such an open-ended scope. I actually gave myself several weeks of lead time to finish the upgrade in time just in case. This shouldn’t be a surprise to anyone who knows me, but I’m often the person who arrives at the airport so early that I can take the earlier flight (and that’s happened!)

Fixing the final failing tests actually involved removing a feature. One of my bigger regrets with Cushion early on is that I decided to build custom integrations too soon. For a solo dev, integrations become a prime example of Murphy’s Law because 3rd party services tend to make changes without notice. You’re heads-down working on the new feature when a user points out that a certain integration isn’t working. You take a closer look and the service launched a new API that’s alpha and specifically says that it’s bound to change, but your integration relies on the rock-solid-but-now-deprecated API. It’s such a draining context switch. I should’ve just used a service that manages these connections for me.

Back to the feature in question—Cushion’s Xero integration. This feature has been a thorn in my side for years because their API requires a specific certificate-based authentication that’s strangely complicated and unlike any service I’ve authenticated with before. In addition, Xero calculates their invoice tax differently than every other service, so even if I imported invoice line items identically to how they are in Xero, there’s a chance the tax and totals are slightly different—not great when you’re dealing with currency. With all of these scars, and the fact that Cushion doesn’t have any active Xero integrations at the moment, I decided to quietly sweep it under the rug. I honestly don’t think anyone will miss it.

The following weekend, I merged the code, deployed to staging, and did a thorough click-through of Cushion’s staging environment. Everything seemed fine, so I went ahead with the production deploy. I truly didn’t know what to expect because these are often the moments where one unexpected hiccup occurs and melts your innards as you try to quickly troubleshoot the issue. Fortunately, this particular deploy went without a hitch. I was actually unsettled by how uneventful it was, but that’s the point—no news is good news. Over the next few days, I kept an eye on production, but everything was fine. I’m relieved, but also proud. I’m not primarily a backend dev, but when put in the situation, I can wear that hat. It’s also empowering because I know (with an app of Cushion’s size) that there’s nothing I can’t handle—even after all these years.

Journal

A weekend marathon of an infrastructure upgrade

Get started with Cushion