What I Talk About When I Talk About Tech Debt

October 11, 2024

It is hard to find an engineer who could not go on for a couple of hours about how tech debt is not handled properly and the problems it causes for their team. Research in this field, along with my personal experience as both a developer and a manager, shows that tech debt is less visible and understood on the management level, while developers are exposed to it almost daily. For those outside of engineering, tech debt is even less understood; however, chances are it has already become a curse word. Let’s try to reconcile these viewpoints and hopefully help all the groups understand the nature of tech debt (and each other) better.

To the best of my knowledge, tech debt is here to stay for a long time. It is merely a tool that can be used for both good and bad. Let’s dive deeper into the topic to understand why I think so.

Is there a good definition for tech debt?

Of course, we can start with Wikipedia - not that people don’t have a definition for themselves, but it is a first step:

“Tech debt is the implied cost of future reworking required when choosing an easy but limited solution instead of a better approach that could take more time.”

The debt metaphor was introduced by Ward Cunningham ¹ and I like it a lot:

It is a deliberate decision to speed ourselves up and achieve our short-term goals faster and also that we are kept accountable to pay that debt back over time.

Just with credit cards and compounding interest, if you don’t pay back soon, it will cost a lot later. The definition mentions a limited solution and this needs to be communicated clearly to the project stakeholders: what are the limitations and what is the cost of moving us faster now. Please keep in mind: just because we choose an easier solution now and deliver the first couple of features faster, does not mean the whole project becomes magically faster forever. Good design and getting the fundamentals right is working against debt, and paying dividends like being frugal and investing our money - to stay with the financial metaphors.

Anything missing from the definition? Software left alone tends to decay in quality which builds up debt-like characteristics as well. Software systems do not exist in isolation and over time everything changes: the people working on them, use-cases, business and pricing models, their technical dependencies, new CPUs are invented, programming languages go out of fashion or get backwards incompatible changes, cloud-based solutions emerge, old security measures are not good enough anymore, growing databases perform slower than before. The list goes on forever. These changes happen without you wanting to touch the codebase at all. Mr. Cunningham referred debt in the context of source code and its refactoring, and today we are using the term in a broader meaning (code, build, release, documentation etc.).

Notice, that in the above paragraph, the shortcomings of the system does not come from its current state, but from an ideal, imagined state, which is influenced by many things outside the scope of the company or the team working with the system. This Google study from 2023 arrived to the conclusion that predicting what will be tech debt for their developers, based on current, available metrics seems impossible - exactly because “An engineer’s judgments about technical debt concern both the present state and the possible state”. Let’s not forget the subjective element, that comes with people judgments and what they consider possible or good enough of a state. Still, your engineer’s opinion about the codebase and systems they are working with and how they perceive technical debt, greatly influences morale - thus listen to your (fellow) engineers!²

DEFINITION: Tech debt is the cost of original design decisions that became critical issues as recognized by today’s software industry, which - left unchecked - makes incremental changes to the solution even harder over time.³

Let’s reiterate the key characteristics of tech debt:

a deliberate design decision was made originally (e.g.: use a given technology or implement a limited solution)
Hindsight bias plays a role (using Python2 years ago was not bad decision until Python3 came along and got supported)
incurs an ever-growing cost as fewer people understand the old solution (either internally or maybe even industry-wide, think FORTRAN or Python 2 😉)
the issue is critical in the sense that it causes significant problems for the current engineering organization and transitively the whole company (critical as a word is vague on purpose to encourage you to understand what is critical in your context)

Based on the above, we can also define, what is NOT tech debt:

poorly written code: developers are humans!, they also make mistakes. Implementing something partially, ignoring an edge case, not thinking about scalability for some time is one way of approaching a problem. For a prototype, first release etc. can be a good decision - but the code itself should be clean and clear enough for the next developer, that this was the intention. If the code is buggy, not understandable, solving a different problem than it should: that is something else, not debt.

We considered the human element in the first chapter, and especially because Hindsight Bias, let’s address that here as well. Every time you encounter something that you consider as tech debt, keep in mind:

assumptions could have changed since the initial implementation (the company understands the market and its customers better, a given technology went out of favor or simply got deprecated)
whole ecosystems could have changed (think about a Python 2 to 3 migration and change)
you might not be right 😄 especially if this is a special edge case, or you did not gather the domain expertise yet. At first, give the benefit of the doubt to whom wrote the code.

Is tech debt good or bad?

POSTULATE 1: There are two types of companies: one is having troubles with tech debt, the other is out of business.

As an edge case, if you just started your company, there will be no tech debt. However, around one year in, there will most likely be discussions about what needs a rewrite.

CONSEQUENCE 1: Tech debt is a good - it enabled the company to create business value in time.

If it is a good thing, why on earth are we talking and complaining about it so much? Just check out this 2018 Stripe study - around 30% of our time as software engineers is spent addressing technical debt. If you don’t believe in random company studies, perhaps a research article convinces you that technical debt is crippling developer productivity as, on average, 23% of all software development working time is wasted due to tech debt.

You shouldn’t focus on dry effectivity metrics though: that 23% time waste is felt by your developers the most and just imagine that one quarter of your job feels useless and wasted. It is making your people miserable, on the long run maybe even burnt out. It is harder to measure the social or emotional effect, but it will affect how long your people will stay at the company and how the company brand is doing.

CONSEQUENCE 2: Tech debt is a bad - it slows you down today, will do so in the future even more and eroding the morale of the engineering team and then the team itself.

Like many things in life, having tech debt is a balancing act. The financial metaphor helps us here too: if you borrow too much money, it will suffocate the company. At the same time, in the case of tech debt, there is no 3rd party like a bank to collect on the debt. So you need to have the fortitude to pay it back by yourself!

If you have a hard time believing the detrimental effects of debt, consider the recent case of Sonos releasing a new version of their mobile app. They neglected their codebase for a long time. To the point that “software was based on virtually obsolete infrastructure and code languages” and they spent more time sorting out the mess than releasing new features. Not only that, but the new version was buggy, as they did not have time to test it properly. Finally, the release hurt their earning, stock price and probably hiring too.

Do not think that I am singling out Sonos to bash them. At a previous job, we had a laptop at IT with a post-it note on it: “do not wipe this machine”, as that had the only developer environment that could build an old version of desktop app. This laptop was found by chance, and it still took 1 week for 2 people to make the build working. Thankfully, we did not need to do an actual release!

And here is the story of LinkedIn’s developer infrastructure changes, that stopped all feature development for a couple of months.

One important piece of information is still missing though - how much tech debt do I have? So when do I need to pay it back? Let’s take a closer look at what type of issues are considered debt and then maybe we can figure out how to measure it.

Usual suspects of debt

If you are unsure where to look for tech debt or maybe a categorization, we can turn to the Google Study again, which iterated over the types of debts engineers are encountering, until less than 2% of engineers chose the “other” catch-all type. I will use the categories from the study and add my own comments about them.

Code quality: either the product architecture or source code within a project was not well-designed. It may have been rushed or a prototype/demo went to production. Product requirements have been changed, or many people worked on the codebase without proper synchronization or and architect helping the out.
Code degradation: The code base has not kept up with changing standards (internal or external) over time. The project may be in maintenance mode.
Documentation on project and APIs: the code or system is hard to understand, not documented well, an API hard to use, or simply it is hard to find the information about them.
Test quality: similarly to code quality, the quality of tests are not satisfactory might be missing tests, insufficient coverage, not realistic test data, flaky tests.
Migration is needed or in progress: a migration might be needed because of performance issues (or scaling), security problems, ISO/SOC2/GDPR compliance requirements, dependency issues or avoiding outdated technologies.
Migration was poorly executed or abandoned: a migration is done, when it is 100% done. Though it is easier said than done - a previous stopped migration can result in maintaining two parallel versions, or if you had multiple unfinished migrations, then even more. A hell for platform teams.
Dependencies: too many of them, then upgrading takes a significant amount of time. Dependencies can also be unstable or frequently changing (eg you are using bleeding edge technology)
Release process: the rollout to production is not automatic, too slow, buggy, its monitoring not working well. I am a big fan of one-click or automatic rollouts, though this might not fit every industry or technology.
Dead or abandoned code: service is decommissioned from production, features not supported, but it is still there, CI is building it, unit tests are running on dead code, but not removed. Infra components for the given codebase can also be around. Having a maintained checklist of what to do to fully decommission a service works well (or if automated, even better)
Team lacks necessary expertise: a favorite, as it shows tech debt can be created by organizational issues as well, like hiring gaps, an inherited project or high attrition.

Ohh, and of the course, one of the greatest signs: The Rewrite Project. A complete rewrite of a system is very likely a default on debt, when the burden is high enough, the organization deciedes that recreating the whole thing is more economical. What means a migration essentially. If you are doing a rewrite and think you don’t have tech debt, think again.

Maybe you should not do a rewrite at all - anyway, I did, you did and most probably we will do it again 🤷.

How to measure tech debt

Technical debt is not affecting directly the bottom-line of a company. Period. But as we could see, around 20-30% of the workload of an engineering organization goes toward fighting it, while sometimes the goal of a feature development project is a couple of percents of extra revenue. It is essential to measure or at least quantify the impact of tech debt or a project that aims to eliminate some of it. Also accept, that tech debt will not be totally removed. It is the best if impact can be translated to a dollar amount. As engineers have a salary, engineering time can ve converted to a dollar amount too!

Developer surveys

If you are new to the company, or the situation is not clear at all - you can ask the people working on the codebase: your developers. Constructing a survey around tech debt and understanding what the developers think about it is a good start. Doing it regularly, and your will have an understanding of the direction the company is going. The above Google study can be used as a starting point, you want to understand how much of the workload is going towards fighting issues and what are the biggest impediments.

I love surveys for another reason as well: they can be combined with other metrics and measurements, which might be more automated (like DORA), but discount the subjective factor of what the developers think and how morale tracks over time.

Maintenance load

Another approach is to introduce the measurement of maintenance load.

maintenance load how much effort your development team spends to keep the existing features running the same as before.

I think it makes sense to think about tech debt this way - it emphasizes that each and every feature has a maintenance cost, not only development cost. How much? A quick search on your favorite search engine estimates maintenance around 60% of overall costs, though in some cases 90%. Don’t believe me? How many times did you hear during feature development that we are taking on some tech debt in order to get a feature released faster? 🤔 I rest my case.

We measure maintenance load by ongoing developer effort. In the above post, an average engineering team grows maintenance load by 1 developer every 2ish years - while I think these numbers are very context specific, still in the case of a 10 years old codebase it means 4-5 developers working only on maintenance.

Don’t forget that at a bigger company, there are multiple ecosystems and platforms - if you are building a web and desktop application, the maintenance load doubles as your exposure to outside forces also doubles.

DevOps Research and Assessment (DORA) metrics

Extensive research is going into discovering what are the metrics describe great software delivery processes. DORA identified 4 keys (metrics) that work across a wide variety of technology organizations, and top performing teams do well on all of them, while low-performers tend to do poorly. 2 out of the 4 are suggested to measure the impact of tech debt and I like this approach as introducing these measurements are not only helping you fight tech debt, but establish high-level metrics for the entire engineering organization and helps the communication with leadership.

Change lead time measures the time it takes for a code commit to be successfully deployed to production, thus it reflects the efficiency of your delivery pipeline.

Change fail rate measures the percentage of deployments that cause failures in production requiring hotfixes or rollbacks, thus it indicates the reliability of the whole delivery process.

Find a way to automatically collect these and establish a baseline, track trends, and you can even set up triggers. Assume that change lead time is 1 hour, which - depending on your context - can be great or awful, but nevertheless makes a good conversation if it should be lowered. While there could be an agreement, that if it reaches 90 minutes, a project will be planned and started to lower it. Notice, that we are constructing a Service Level Agreement (SLA) for the delivery pipeline here!

Tech debt index

Tech debt index is the total technical debt divided by some measure of the whole system’s scale.

Suggestions are that tech debt could be measured by hours to fix or story points, while the system’s scale with the size of the codebase as lines of code or function points. I think it is a bit hard to maintain a full list of tech debt and making sure there is an estimation attached to every piece of it, and it is also pretty hard to know all the functionality that went into a sufficiently large system. Lines of code is too easy to game: you don’t want people to commit too much code to lower tech debt, or being afraid to remove code as that would seemingly increase tech debt ratio.

At the same time, an artificially created tech debt index can come in handy.

Example: At Prezi, we compiled a full list of services, and collected all possible information about them: what deployment and monitoring system they used, dependency versions and how old they are, if security fixes were applied properly, OS versions, what type of AWS machines they run on, etc. For every aspect, we assigned a score and if a service used an old type of deployment, it would rank higher on the tech debt score - same for any other category. This created a list where it was easy to see which services needed the most attention or which ones were not maintained properly. As the list was automatically re-generated every night, it was a great snapshot, plus with the historical data, trends could be spotted as well.

Constructing a tech debt index also assumes that you understand what is contributing towards debt, and then you can construct metrics for those areas - this is what we explore in the next section.

Target specific aspects of tech debt

It is very likely that your issues will fall into the categories previously mentioned, so let’s take a look again:

Code quality: if it is a design problem, it is likely that the problem is poorly written code, which is not tech debt by our definition, though probably it is missing some non-functional requirements: response times are slow, the service is not reliable, security issues, etc. It is likely you already have the metrics to represent these. Introduce code reviews, measure their length, comments made and survey developer’s opinion of code quality.
Code degradation: Decide on the Return on Investment for bringing the code up to today’s standards, use the metrics those standards aim to ensure.
Documentation on project and APIs: collect the non-documented APIs / projects, use surveys to measure understandability of the documentation.
Test quality: coverage and flakiness are measurable, chances are that your Continuous Integration (CI) system is mixed into this area. Ratio of successful jobs, CI job duration are good candidates.
Migration is needed or in progress: track the list of migrations needed over time, and don’t forget a migration is done when it is 100% done!
Migration was poorly executed or abandoned: see the previous point.
Dependencies: a very broad category, you can try to track the list of dependencies and their age, or the number of interrupts (like a security issue is found, and you need to update this library/OS everywhere) they are causing then the time these interrupts are taking.
Release process: the above-mentioned lead time fits here well.
Dead or abandoned code: tooling exists to discover unused pieces of code. Relentlessly hunt for those, even better if you can prove a whole service is unused or just marginally used. Remove them without mercy. Use instrumentation in production (if possible) to identify unused API endpoints and code.
Team lacks necessary expertise: in many cases, I just created a spreadsheet with a list of components in our ownership, and asked every team member to categorize their expertise on a 1-5 scale. Gaps can be identified quickly, and we organized learning opportunities for the debts-wise high-ranking components. Of course hiring, passing the ownership are also possible.

After discussing metrics, let’s look at how to reduce or avoid tech debt.

How to have less tech debt

By collecting detailed information about what is considered as debt at your company, a repository of debt is automatically created. These are great project ideas for the future, and you can channel them into the project management and planning process of your company. Do a lightweight priority exercise to see what are the most pressing issues and represent them.

It makes sense to estimate the impact, urgency and effort for the tech debt projects. At first, T-shirt sizes will be fine enough - especially if you have 100s of ideas right off the bat.

Put metrics in action, establish the baseline and it will support your impact estimations and goals for the project ideas. Always convert the impact to time or dollars saved, or company KPIs! It will make collaboration with leadership an order of magnitude easier. Saying that deployment is slow will not get you anywhere. Doing a simple back of the envelope calculation that our engineers wait 156 hours every month on deployments, which costs X dollars turns the volume up!

There are some changes, that you can do without getting into the project management process. I like to call them Boy Scout changes.

Boy Scout changes

Many teams, that are successful in handling tech debt, will not ask for permission to make the necessary changes. In many cases, these changes are not taking up too much time, some cases I met:

a source code file does not adhere to the introduced structure/format requirements (in Python’s case, just run e.g.: black on it)
there is a missing unit test case
code is not easily readable, variable naming is not consistent

Fix it right away, when you are touching that part of the code - if you are working at a place that does not empower you to make decisions about 10 minutes of your time, or have lengthy processes to get permission for such changes, then the actual problem is not technical, but organizational.

Example: with a moderately-sized Python codebase, many developers complained that PEP-8 was not followed despite some efforts and even git pre-commit hooks that ran checks. With many files in violation, it did not seem feasible to fix it by hand. Enter autopep8 - which automatically fixed 90% of the problems and I could do the remaining ~200 manually. People were more than happy to accept the pull request.

The above also pinpoints a good migration strategy: do the easy bulk by automation, which immensely boosts effectivity, and you will also pull out less of your hair because of manual repetition!

This approach also fights against Broken Windows: these small problems build up over time, everybody sees that neglect is the norm, new joiners or juniors simply assume they can use existing solutions as an example and just copy the previous behavior. On the other hand, if they see what needs to be changed, they will also copy the fixing behavior and your codebase will be better over time.

This approach also has the benefit of reducing debt gradually, and at the places the team works the most. As debt removal speeds up any future development too, we are making our own life easier for the next feature.

Tactical-level tech debt

Some technical debts can be handled within the scope and deadlines of a given project and most probably by a single team or by a collaboration among a few teams, like:

there are no integration tests for a service / API - implementing them and adding it to CI, communicating about it within the company might take a couple of days or weeks depending on your company processes
new web framework should be used within a moderately sized service
a small API should be extracted from a monolith or get rewritten

Be clear and even formal about how much time it is OK to spend on tech debt: let’s say 10% of the time, the first Monday of a 2 weeks sprint, every fifth week is a tech debt cleanup week. Depending on your context, scheduling might be easier, if you have a target like 10%, and you let your developers decide when to spend it - in some projects and teams, it works better to fix debts alongside the current projects and avoid context switching. Other teams have a lot of responsibilities, and they prefer a fixed timeframe to work on things that are not related to the current project. Also keep in mind, that as 20-30% time is already related to friction, it is not necessarily a big commitment to spend 10% of your time to keep tech debt in check.

Example: my team had a chatbot (magneto) that helped to ask questions from other team members and finding someone who was in charge of support. Having a moderate-size codebase but implemented in a programming language not supported anymore in the company and some irritating bugs, we decided within the team to rewrite it and migrate our users to it. First, the initial version was implemented as a research prototype and when we were sure it could be implemented within a short timeframe, we made the decision and communicated it to our stakeholders.

Strategic-level tech debt

A major technical decision needs to be revised and many teams, possibly the whole company are affected.

GDPR is introduced, all teams need to adhere to new rules
Flash as a technology is getting deprecated
a whole new programming language, and with that, a new ecosystem should be introduced for backend services

It is very probable, that a single individual will not have the scope of influence to push such a change through - do not start working on such a project in silence. Quite probably many people are already aware of the issue. Start to communicate the need to address the problem, conduct a smaller scale research project to lower the risk of change and drive a collaborative design process and ensure to have a plan for the whole project. This does not mean the process is totally democratic: people with the right level of experience and influence should spearhead these efforts.

Example: I led the Python3 migration and microservices dockerization projects at Prezi though I am not going into many details as solutions are highly dependent on the ways of working of any company. Communication, tracking progress, derisking decisions + dealing with ambiguity and soft skills play a big role in such projects.

It is a good idea to bundle tech debt removal into big impact projects: most probably you are already making large changes, refactoring a whole service, maybe even rewrite good chunks of the code or introducing a new piece of technology to solve a new business use-case. Remember: the goal is not to hide extra work into an already big project, but tackle issues that make sense and are simply easier to make as you are already making significant changes anyway. It gives you a chance to rethink design decisions from earlier.