Technical Debt is just messiness

It's the time of year for people to share their reframings of technical debt.

Yossi Kreinin has two takes I like, from https://twitter.com/YossiKreinin/status/1431748651571896320 and https://twitter.com/YossiKreinin/status/1341741855214546949 respectively.

Much of "technical debt" isn't - it wasn't done to ship quickly at the cost of more work in the future. It was just a shitty job that you're now stuck with, that never helped ship anything quickly. If programmers were plumbers, we'd spill shit all over your room & call it "debt"

- A shoddy job creates problems down the road.

- What? I'm from biz. Explain!

- This would create technical debt.

- Ah! Now I get it. Like real debt, but not seen in the books? Awesome, let's issue lots! Tell me more! Is there also technical stock? Can we do a technical IPO?

To me, technical debt is just messiness. Sometimes it's just a small mess, but sometimes it's an exploding shit pipe, or even a whole room where the walls, floor, and ceiling are just covered in it.

This sentiment along with those two tweets seems like enough to infer a lot, but still, I want to articulate a few thoughts.

Small messes can be kept small or even eliminated through continuous bits of small "cleaning" that come as side effects of common professional practices like refactoring as you go, breaking unneeded tight coupling/dependencies, requiring code reviews and test coverage and sometimes design or API reviews, and various other things. Like maybe you have dedicated pre-release-deployment "hardening" time or something where for a few weeks no new feature work is done, just quality improvements and maybe some cleaning. Whatever, there are lots of methods to try or argue about that are part of the idea of doing "good (and clean) work", with large variances on their effectiveness. I think most software written handles small messes well enough, otherwise they would all turn into big messes and nothing would get done.

Big messes, like the aforementioned room of shit, can't be cleaned with such minor efforts. Where would you even begin? If you just clean a little bit, it'll quickly become dirty again as the remaining mess spreads back over, it might even seem like it's multiplying. If you try to build a room next to it, but still keep the shit room accessible with a door you sometimes use and move things through, the mess is likely to spread and you may soon find yourself with effectively a bigger shit room. Your only options faced with a room of shit are to either figure out how to clean the whole thing, if you even can, or find a way to stop interacting with it, in other words ignore it to some extent.

This lack of options for dealing with the problem is a key characteristic of technical debt, which makes it quite different from regular debt. With regular debt, you can "pay it down" in small or large chunks with nice tradeoffs with respect to effort, or you might just get rid of it through bankruptcy or some forgiveness policy, or you might be keeping it from growing by at least making interest payments, or you might just be making so much money and building so much wealth that you can afford to ignore the debt -- you could even forget to make payments and still be fine if anyone ever comes complaining because you're so flush with cash. If paying down debt is akin to cleaning technical debt, it doesn't fit, because there can be technical debt so big and so messy that you can't actually clean it, whereas even a huge debt can be paid a little bit.

Debt also has multiple connotations -- sometimes it can be very good and useful to acquire debt! This is a poor idea for software, hinted at in the second tweet quoted, where people can get the idea that tech debt is so simple to acquire and pay off, so let's knowingly introduce it and go faster! Maybe. If instead you see tech debt as mess, are you going to want to intentionally introduce a mess? For what purpose..? (Devil's advocate: you might be shipping a one-off, like a game, and you don't know the best approach to do something, but you need to get it done, and just doing this ugly thing this way works well enough, so move on. Or some actions arguably might not directly produce tech-debt/lead to a mess, but merely have a risk of tech-debt/mess, and so it's a time gamble that might pay off.)

Surprise surprise, I don't like the debt metaphor, or any metaphor that stays in the finance mindset (tech wealth, margin calls/"unhedged call option"...). The act of creating and destroying and changing code is just too dissimilar to finance. The only metaphor I do like is that of a mess, where it's clear the only options are clean or ignore somehow.

But in the eternal conflict of management and development, tech debt is an established very useful excuse for why something is expected to take, is already taking, or in retrospect took kind of a long time, when the exact reasons why are hard to articulate right now. Many organizations already have managers and devs who have come to a sort of mutual understanding around this, and that's not something useless that should be discarded just because the metaphor sucks. Having this excuse be acceptable and valid reduces the friction involved in making software. If programmers were forced to articulate the details behind their intuition of "there is tech debt involved" every time, things would go slower just from the articulation process alone (if it can even be fully articulated, people have complexity limits), then there's the problems that would come from political blame-games that would soon dominate company culture.

Why is tech debt hard to articulate? Well, how do you define "mess", how do you define "clean"? (I certainly won't define either here, or the verb "cleaning".) How many explanations/reframings/deconstructions of "tech debt" are there? The terms we use to talk about this stuff in general are vague already, so it makes sense for specific things in the code base perceived by intuition to be "tech debt" to be hard to reify into specific problems, especially in a short form right this moment during standup. The size of the mess doesn't really help things either. If it's a small mess, there can be endless subtlety and even disagreements. If it's a big mess, like the extreme example of the shit room, you just want to say "just look at it all!", which doesn't help a non-technical manager who lacks any intuitive concept of code cleanliness and can't really see it.

If you haven't developed a familiarity with the shit room, but you have to work in it anyway, you're psychologically probably not going to want to articulate the details very much. If you're the lucky person who needs to work in a shit room to get this thing done, then ahead of time, there's just this vague shit room you know you'll have to explore some subset of the details of to finish, you expect things to go the opposite of smoothly, but you hope you don't have to spend too much time with it; during, you're focused on fixing the problem and getting out as fast as you can; after, you just want to forget about it.

This can be countered to some extent by permitting some preliminary exploration, or "spike" work, and the output can serve as at least partial articulation for future reference.. I've got two examples of tech debt from my last job near the end of the post, but I've left out literal pages and pages of articulation that was done by me/other people over time.

If you don't yet agree, then at least humor me that despite not defining cleanliness or messiness, tech debt is messiness whose actual concrete data-processing-level reasons for being messy can be hard to articulate, at least until you investigate, and such a feature plus having a lack of options about what to do about it makes it a distinguishable concept. Especially it's distinguishable from other kinds of things in software that get in the way of doing things faster.

For an example of such another thing, there are many kinds of testing, and they all take various amounts of effort. Testing is inherently an economical activity with "how much testing" dependent on how much time and money you're willing to spend. "I will need to write automated tests in addition to the production code, so that's why I've estimated I need this extra time on the work item" is a valid excuse and something that should be articulated if it's not already captured and shared by a compressed thought like "quality work" or "trust work". Looking to manage this, you have a number of options to deal with a project you think could go faster if some testing burden were relieved. You could add more manpower just for the testing sides of things, add more CPU power if automation turnaround is slow, cut some testing requirements... Ideally that last one by cutting features, but you can also adjust priorities of types of tests, and you might even be in a special-but-it-happens case where you can totally be fine with no automated testing...

And one more example, there can be a prerequisite that needs to be delivered by someone else before you can finish your piece. Even if your piece is now "done" barring that prerequisite being slotted into place, "I'm waiting on them" is still a valid excuse for why things as a whole aren't done yet. And this can be addressed with rescheduling (if your piece isn't done yet, and can't be done until the prerequisite arrives, maybe you can work on something else instead, or maybe you can go help get the prerequisite shipped, or if you are done then you can do something else instead of "waiting"), investigating a way to move forward without the prerequisite or with a dummy temporary version, playing some politics game to make sure higher ups are prioritizing the prerequisite above what the team responsible for it may want...

My point with both examples is that they are common causes of slowdown, but they have more than a few ways to resolve them and either get back some speed or parallelize and get more done in a fixed time.

Tech debt though, the only solutions that address it are to clean it very thoroughly, or find a way to ignore it. If you do neither, and just accept it, you're going to go slower, because you're working in a growing mess.

Doing neither is certainly common, especially with bigger messes, since with skill and determination you can go quite a long ways in extending, changing functionality, and fixing problems of your money-printing messy ball of mud before things become untenable.

But when an attempt to fix tech debt is made, I think unsurprisingly some form of "ignore" tends to be the preferred solution. It feels like you've re-captured a lot of options that I just said you don't have for technical debt, but it's an illusion, because they all exist in the "ignore" space and only cleaning has a chance of restoring development speed and not just preventing more slowdown. Let's sample the space a little though, and see how "ignoring" relates to "isolating", "encapsulating", "modularizing", or similar such feel-good terms.

"Separation of concerns", i.e. component A being able to ignore component B's existence, is a common practice that helps prevent small messes from joining together and becoming big messes, and you can live with and even occasionally clean a group of small messes more easily than a giant shit room. Plus, some separated rooms might be clean already! There are modularity techniques that actually can reduce the risk of a mess spreading, you don't necessarily need to connect to a shit room with a swinging door that all but guarantees it'll quickly make a mess of the new room.

Any time you're faced with the "cleaning" solution, you also have the option to consider first breaking up a large mess into some smaller messes, and cleaning those separately. Even if you don't get to that second cleaning step, the modularization step helps, and probably involved a bit of, if not cleaning, then restructuring just to establish boundaries. Cleaning a big mess seems most approachable this way, however it's not always possible, and at the end of it you'll probably have some disagreement about whether to join all the smaller separate clean things back together into a nice and coherent big clean room, or let them be. (The "many small pieces" style is a variant of the "ravioli code" accusation some people thoughtlessly levy at some Java projects.)

Another type of "ignore" is to do a deletion. Maybe your project doesn't actually need this thing that's causing you nothing but pain, so just get rid of it. You might have to do some sort of deprecation period to keep people using it somewhat happy that they have a chance to move off of it rather than it suddenly disappearing, but as you drop support for it and eventually delete it, you've progressively ignored it out of existence.

A related type is to do a rewrite. Then you can create your new clean room, or your series of new clean small rooms, whatever you prefer, but you do it Right this time and make sure it's clean. The old one has been ignored. You might actually delete it, or maybe keep it running but without maintenance. There are some truly messy critical code paths out there that mostly work and keep doing their thing untouched. You might try to offer a migration step, if you can do so without such a concept fundamentally compromising the new design and leading to a migration of the mess rather than its elimination.

But still, even that might be acceptable. A septic tank on a rural property eventually gets full, having reached "peak messiness", and you now have to do something about it. So you just seal it up, bury a new one next to it, change the connections over, and bam, old one ignored, new one functioning even though it's getting messier all the time. The new one might be a "new model"/rewrite that'll perform better, or it might even just be the same as the old one, just with its state restarted so it is clean again. In any case, you know you're just kicking the can down the road, but you'll have a good deal of time before the mess in the new tank is big enough that you need to do something about it again. And really, can you do much else? Some things, like human waste, are inherently messy. I think too there are domains in programming where code will be inherently messy. If you can keep it isolated, let the rest of the system ignore it, that's a win.

Even when things aren't inherently messy though, it can still be a useful and economical ignoring technique to seal the mess up from the rest of your application, and put out a new fresh starts-clean version in its place. Isn't this just what Kubernetes suggests? Though I'm not a fan of the "treat your services like cattle" metaphor either, and I have a feeling actual ranchers might have some disagreements with the concept, I'd support the metaphor of "treat your mysteriously failing services like full septic tanks".

If you just clean, though, you don't have to spend all this effort... But cleaning, especially a big clean, will take more time up front than most other ignore techniques, and that's even assuming you can actually clean the thing. If it's too messy to clean no matter what you might do, you're just stuck with it unless you can figure out how to pull off an ignoring technique, like first modularizing and then cleaning smaller (and actually cleanable) pieces.

A final reason to avoid prioritizing cleaning first has to do with where tech debt even comes from. How do messes form? There's a fine line between a shoddy job where those involved should have known better (or left a paper trail if they did but the concerns were ignored), and simple nature where a thing designed for A is hard to repurpose for some quite different B but those involved tried to do so anyway. Both can result in messes, though the latter in principle can be done in a clean way, and perhaps when it ends up not clean it ultimately just comes back to the former as to why, but on the third hand, "should have known better" demands that predicting the future around this change was somehow easy, which isn't always the case.

This "third hand" of messes coming from not being able to foresee the future and thus not having designs (clean or not) suitable to easily handle new requirements can make a large cleaning seem pointless. Yes, in theory, if we competently do a big cleaning on this big messy thing that currently does A, we'll be able to more quickly do maintenance on that thing, extend it, and even repurpose it to also be able to do B in the future. But frequently such B's are really different, and take a long time anyway. We end up needing to do the same sorts of modularization techniques that we didn't do when we just cleaned the thing, because the process of trying to make the thing do both A and B has resulted in a fundamentally unclean design that can't be cleaned and can only be dealt with through modularization or a total rewrite that's aware of both requirements from the start. While things would have taken longer if the original cleaning hadn't been done that lets such modularization be easier than otherwise, it's hard to say whether it would have exceeded the time the cleaning itself took.

In other words, big clean things are fragile, and risk becoming unclean at the fundamental design level if they're asked to be extended too much, despite the best efforts of those involved. When you add in the possibility of shoddy work somewhere, they may become unclean at much more than the design level. How many programmers can tell a story of their pretty clean code suddenly being violated by someone else, maybe even a different team who didn't add you to the code review and has managed to politically get their changes in such that you can't just revert or force them to clean up? My favorite variant I indirectly witnessed involved a big fight like that, but the code lived in a git repository and only the owner could merge pull requests. He wouldn't budge until the submitting team cleaned up their code to his standards. More time was spent by the submitting team trying to argue and get their manager to go over the head of the repo owner and have his manager tell him to force accept the pull request, than it ultimately took them to clean up their mess of a request once they finally lost the political battle.

If the other team had won, though, at least it would have only impacted that small repo, and the change was small enough that the maintainer could probably have found the time to clean things up himself later. Having smaller separated things has a lot of benefits in the face of possible future change, even though nothing tops a clean and beautiful cathedral.

I've again been pretty vague about this, I didn't go into what made that other team's code unclean or where it came from (apart from a generic shoddy work ethic). I'll instead give two brief accounts of technical debt my team had to deal with, some of it our own fault, some of it not. They're not articulated very well here and nowhere near to completion, but hopefully there's enough detail to see the glimpse, and it's more interesting than something very concrete (and relatively easy to clean, all things considered; a small mess) like: "this area has 6 classes with original handling code in handle() that was mostly copy-pasted, except for some types, maybe the original author was uncomfortable with generics, anyway later 2 of them were slightly modified in the same way for some case but the others left alone, recently another one was modified to be slightly different from all of them, and today we need to make a change that impacts the lot of them. As part of that change I'm going to clean this up, starting with removing all the duplication, and things will be easier to understand, easier to add the new changes, and easier to see where any differences in the handlers are. (A little later...) Oh yeah, and in the process of doing this, I found that in the change they made to those two classes back then, they missed one and should have applied it also to this third class, anyway it's fixed now and we can close that item no one looked into yet about the uncaught exception that got logged and was deemed low priority for only happening once."

When you drop the metaphors and euphemisms and get into the details of things, i.e. when you articulate the technical debt that's in play, then you start to see not only what it is but also where it came from. While it's not worth it to do the articulation exercise every time, because the process itself can be costly in time, it's worth it to do it more than rarely. It's not always straightforward even if you're only communicating with other programmers, there may be code archaeology involved, you may need to dig through some shit and do preliminary work towards a specific goal just to see what happens when you do stuff, all in order to form an understanding of the mess so you can formalize and articulate. You might also see a common source of the mess (hopefully not It's All This Certain Individual, which is how the blame-games start, but sometimes there is such a person who just needs to go...) and by articulating it you can discover common lessons or newfound better practices that can be enshrined in processes if you're into that, and help prevent similar messes from arising in the future, or at least prevent the mess in this area from getting worse.

You also might find that management can understand more than you might have given them credit for, when you treat them as functioning intelligent humans who can see the messy complexity slowing you down after reviewing your document(s) articulating it. It's important for managers to not always buy a constant mumbling about technical debt, and programmers shouldn't egregiously use it as an excuse to slack off. (It can be draining, I know; there's a reason my favorite chapter title from Working Effectively With Legacy Code is "We Feel Overwhelmed. It Isn't Going To Get Any Better.")

Anyway, a large example of technical debt from my last job stemmed from an unfortunate decision, long before I started, to prefix most of the customer's custom URLs with "/s/". (Product allowed them to make their own web pages.) Being able to support the option of "removing" that is basically impossible, it's a big shit room/warehouse of tech debt, and the reasons are deep, hard to articulate, and many weren't discovered until things were tried and then testing happened to see what broke. One person somehow got the OK to spend a whole year of his time to figure out how to disentangle/remove a related underlying piece to this mudball, but he then quit and I've been gone too so who knows where things are at. When I left, new deployments on a new tech stack were able to have this removed, at the cost of the new stack not being able to do everything the old one could, and that's likely to be the most that'll ever happen. All over such a small thing, that seems like it might be fixable with a single route config change somewhere? Yup. It's all "technical debt" stemming from and building on that original sin many years prior.

A more minor example that nevertheless occurred more than once (good candidate to articulate) had to do with bits of pre-packaged data identified by some string name. Originally they were just pre-packaged 'out of the box' data-things customers could use and refer to by the names we gave them, to put on their web pages. But later on, we wanted customers to be able to make and name their own custom things. But we can't have name collisions... Ok, so just if-check against the existing set we made, customers can't take those names, done! Now later on again, we want to add more out-of-the-box things, but now we need to make sure our new names don't conflict with any customer names that they've made since! Ok, we can do a production database scan with some extra code processing and make sure our intended names have no collisions among our thousands of customers, and if so we can change our name before deploying. Or we could add some auto resolution logic that appends successive numbers to the name until it's unique, but then we need to update code to check for more than just the name to make sure it's "our" thing... Either way, later on we want customers to be able to share their custom data with other customers, and not collide with each others' names... There are many more possible missteps that could happen before anyone comes to the idea that maybe namespacing should have been done in the first place, or at least should have been done at any step along the way where the solutions just made things messier.

I think the final solution ended up being a crazy dual id setup -- the "main id" used for most lookups is some uuid, and the "secondary id" is a human readable string that has an alphanumeric namespace component and an alphanumeric id name component and is used in sharing contexts or read-by-humans contexts. Still not at all clean, there was another ID layer I didn't get into, and the pre-packaged things were still a bit special but at least encapsulated into one spot. Things would have been quite a bit cleaner if namespacing would have been considered from the start, as perhaps it should have been, given company API precedent, but at least after this happened more than once new id-things in the same space could go straight to the dual-id system instead of stepping on the same sequence of rakes, and things didn't get messier. It wouldn't surprise me if there's still occasional bugs being fixed from this ongoing technical debt, because that whole contextual-ness stuff was never fully abstracted and so has the chance to pollute any code that needs one or both ids...

In both these examples the tech debt "compounded" from their original forms, which does help the "debt" metaphor, but it's not like that could have been avoided by making sure at least interest payments are made. For technical debt, remember, is just a mess. If you're able to leave it alone and ignore it, it won't compound. But if you try to alter it or extend it or build on top of it, and you don't clean it first (effectively paying off all the debt owed, because very rarely does a partial cleaning of a messy whole on its own do much), then the new system as a whole is going to be messy and your total debt will have increased, and whatever shape the final system is to take will be harder to implement and maintain. Because of customer dependence a full clean basically became impossible, too, which doubles as a lesson to be really damn more careful about things you let others depend on and change or think you'll let them depend on in the future.

To reiterate some things and wrap up, once again tech debt is just mess. Many small messes are relatively easy to clean directly, and frequently stem from widely recognized poor/shoddy development practices. These even become easier to articulate once you've dealt with them enough, either by direct cleaning or by some form of isolation/ignoring. My favorite book on such things is again Working Effectively With Legacy Code.

But cleanliness, the opposite of messiness, goes beyond just well known shoddy practices like copy-pasting a snippet of code all over instead of defining a function. It can get quite subjective when referring to style, and it can get quite large when you have this monolithic old piece of software that can only be charitably described as a Big Ball of Mud. Such properties make it hard to articulate, and also hard to effectively clean, with a form of isolation being a common path towards something at least a little less smelly. (Note, this isn't to be taken as a blanket endorsement of microservices, which don't necessarily achieve isolation just by not sharing a memory space.)

Tech debt is a useful metaphor to reduce friction in development, but I think "mess" works better and helps avoid some bad ideas that finance metaphors can lead to, and actual concrete articulated issues work best if you can get them.

Posted on 2021-11-23 by Jach

Tags: programming

LaTeX allowed in comments, use $\\...\\$\$ to wrap inline and $$...$$ to wrap blocks.