I built a tech debt simulation. At 10x coding speed, feature output approaches zero.

An agent writes 500 lines in an hour. Three depend on a global config it never loaded into context. The tests pass. Production breaks. This happens more as the codebase grows, because defect rate per line rises with codebase size. At 10x speed you hit the sprint where rework consumes more capacity than new work. I built a model to find that sprint and ran it through four scenarios.

Before the pitchforks come out

I use formulas in this post to describe relationships between things like codebase complexity, defect rates, and iteration cost. These are mental models, not physics. I am not claiming to have discovered a new law of thermodynamics for software. If you’re the type who sees an equation in a blog post and reflexively comments “that’s not how math works,” I want you to know: I see you, I respect your commitment, and I ask that you channel that energy into reading what the variables actually represent before firing up Hacker News. The formulas are communication tools. They make relationships between concepts precise enough to argue about, which is the whole point.

The formula

This is the model we’re going to build and test in this post. We’ll start with the plain English, translate it to math, then run it through four scenarios to see what the formulas predict.

Burden = the accumulation of defects that outpace repairs.

In plain language: every tick of the clock, your team introduces some defects and fixes some defects. Tech debt is what’s left over, the running total of defects you haven’t gotten to yet. In math:

B(T) = \int_0^T \Big[ D(t) - R(t) \Big] \, dt

$B$ is the burden (cumulative tech debt), $D(t)$ is defects introduced, $R(t)$ is repairs completed, and $T$ is time. When $D$ outpaces $R$ , debt grows. The interesting part is what governs $D$ .

Defects = velocity × defect rate × coupling × debt feedback.

The faster you ship, the more defects you produce. The higher the defect rate per line, the more defects you produce. The more coupled your code is, the more each defect cascades. And existing debt makes new defects more likely (working in a debt-laden codebase means navigating workarounds and known-broken paths):

D(t) = V(t) \cdot p(t) \cdot \big(1 + \lambda \ln(1 + C(t))\big) \cdot \left(1 + \gamma \frac{B(t)}{B_{\text{ref}}}\right)

$V(t)$ — code velocity. How fast you ship. 1x for a human team, 3-10x with AI agents.
$p(t)$ — defect rate per unit of code (bounded between 0 and 1).
$\lambda$ — coupling sensitivity. The log dampens this: a 10x increase in coupling doesn’t produce 10x more defects.
$C(t)$ — coupling complexity. Dimensionless, normalized against a reference codebase size: $C(t) = (S(t)/S_{\text{ref}})^\beta$ .
$\gamma$ — debt feedback. How much existing burden accelerates new defect introduction. This is what creates the spiral.

Defect rate rises with codebase size, but saturates.

The defect rate is not constant. It rises as the codebase grows, because there’s more context to miss. But it can’t exceed 1 (you can’t have more than 100% of commits be defective), so it follows a logistic curve:

p(t) = p_0 + (1 - p_0)\,\frac{\alpha \, S(t)}{1 + \alpha \, S(t)}

$p_0$ — base defect rate. Irreducible. You’ll have off-by-one errors in a 10-line script.
$\alpha$ — context decay coefficient. How fast comprehension degrades with scale. This is the parameter your architecture controls.
$S(t)$ — codebase surface area. Files, interfaces, state dimensions.

Codebase size = everything you’ve written minus everything you’ve deleted.

$S(t)$ grows with every commit and shrinks only when you actively remove code:

S(t) = S_0 + \int_0^t V(\tau) \, d\tau - \int_0^t \text{deletions}(\tau) \, d\tau

Here’s the feedback loop: velocity grows the codebase, which raises the defect rate, which raises defect output, which means each new line of code is more dangerous than the last. It’s not linear scaling; it compounds. The rest of this post unpacks each piece and visualizes the dynamics.

What we mean by “tech debt”

Ward Cunningham invented the metaphor in 1992 while building a portfolio management system at Wyatt Software. He needed to explain to his boss why the team should refactor working code. The financial analogy worked: shipping code with partial understanding is like taking on a loan. It accelerates delivery. But if you never refactor to incorporate what you’ve learned, you pay interest on the gap between the code and your understanding, forever.

His exact words, from a 2009 video clarifying the metaphor: “I’m never in favor of writing code poorly, but I am in favor of writing code to reflect your current understanding of a problem even if that understanding is partial.”

This is narrower than most people think. Cunningham’s debt is specifically about incomplete knowledge, not incomplete effort. A team that ships a naive O(n²) algorithm because they haven’t yet learned the data will grow is taking on debt. A team that ships a naive O(n²) algorithm because they’re in a rush is just writing bad code.

Martin Fowler expanded the idea into four quadrants crossing two axes: deliberate vs. inadvertent, and reckless vs. prudent. The most interesting quadrant is prudent-inadvertent: a skilled team builds the best design they can, works on it for a year, and realizes a better approach existed all along. Fowler calls this “not just common but inevitable for teams that are excellent designers.” That kind of debt has nothing to do with laziness or time pressure. It’s the cost of learning.

Steve McConnell drew a simpler line in 2007: intentional (strategic) vs. unintentional (the non-strategic result of doing a poor job). The intentional kind is a tool. The unintentional kind is a symptom.

The industry has since stretched “tech debt” to cover everything from TODO comments to entire legacy systems. That’s fine as shorthand, but it muddies the math. A TODO comment and a race condition both count as “debt” in casual conversation, but they have wildly different costs and different compounding dynamics.

The model in this post uses a broader definition. $B(t)$ tracks burden: the accumulated cost of all defects, design friction, and knowledge gaps that haven’t been resolved. This includes Cunningham’s original debt, Fowler’s quadrants (all four of them), McConnell’s unintentional sludge, and the bugs that slipped through because nobody held enough context. The codebase doesn’t care which category a problem falls into. The compounding dynamics work the same on all of them.

When you write a line of code, you’re betting that your mental model of the system is accurate enough to produce correct logic. When you’re wrong now, it’s a bug. When you’re wrong in a way that won’t surface for months, it becomes burden. A senior engineer and a junior engineer both produce it. So does GPT-4. So does Claude. The question isn’t whether it accumulates, it’s how fast, and what governs the rate.

Why $p$ rises: the context bottleneck

Neither humans nor LLMs can hold an entire system in mind simultaneously.

A human developer has roughly $7 \pm 2$ items in working memory. A 500-file codebase has thousands of potential interactions between modules. You will miss relevant context. It’s not a question of effort; it’s a cognitive ceiling.

An LLM agent has a context window. Even at 200k tokens, a large codebase exceeds it. The agent chooses which files to load, and every file it leaves out is a potential missed interaction. Modern agents partially compensate by using tools (grep, LSP, file search) to retrieve context on demand, which makes the bound softer than raw working memory. But retrieval only helps when the agent knows what to search for. Unknown unknowns don’t get retrieved. A function that depends on a global config three modules away, a singleton that needs initialization, a side effect buried in a “pure” helper: the agent has no reason to search for something it doesn’t know exists.

Defect probability rises as the share of relevant context you considered falls.

Context gaps aren’t the only source of defects. Algorithmic mistakes, spec misunderstandings, and genuinely hard problems (concurrency, distributed state) produce defects even with full context. But the context gap is the dominant scaling factor: the one that gets worse as codebases grow. The model captures that term:

p(t) \propto 1 - \frac{\text{context\_considered}}{\text{context\_relevant}(t)}

As the codebase grows, $\text{context\_relevant}$ grows. But $\text{context\_considered}$ is bounded (by neurons or by tokens). The ratio shrinks. $p$ rises.

A human team at 1x velocity. The grid fills slowly. Most squares stay bright because the codebase grows within the spotlight’s reach. Defects are rare.

In context Out of context Defect (missed context)

Files: 0 Defects: 0 p(t): 0%

The same team at 10x velocity with tightly coupled architecture (high $\alpha$ ). The grid floods with code. Amber squares appear everywhere because the agent can’t hold enough of the system in context. This is where most AI teams end up.

In context Out of context Defect (missed context)

Files: 0 Defects: 0 p(t): 0%

10x velocity again, but with decoupled architecture (low $\alpha$ ). Each function depends on less of the overall system, so the spotlight covers more of what matters. The grid grows just as fast, but the amber fraction stays small.

In context Out of context Defect (missed context)

Files: 0 Defects: 0 p(t): 0%

Each square is a code unit. Click through the tabs to see how the same model behaves under different conditions.

The difference between the second and third tab is $\alpha$ . Same velocity, same team, same model. The only variable is how much of the codebase is relevant to any given change. Architecture controls that.

The velocity trap

Joe Magerramov observed that his AI-augmented team ships at 10x velocity with 80% AI-generated code. A bug that used to appear once a year now appears weekly. But that calculation assumes constant $p$ , same defect rate per line regardless of codebase size.

With rising $p(t)$ , it’s worse. Defects = (10x speed) x (rising error rate) x (rising coupling) x (rising debt feedback). Every term on the right is growing:

D(t) = \underbrace{V(t)}_{\text{10x}} \cdot \underbrace{p(t)}_{\text{rising}} \cdot \underbrace{\big(1 + \lambda \ln(1{+}C)\big)}_{\text{rising}} \cdot \underbrace{\big(1 + \gamma B/B_{\text{ref}}\big)}_{\text{spiraling}}

You’re not just getting 10x bugs. You’re getting 10x bugs where each bug is progressively more likely.

The trap: early speed feels free. The codebase is small, $p \approx p_0$ , everything works beautifully. By the time $p$ has risen noticeably, you’ve already accumulated massive surface area. The debt is baked in.

Features (productive output) Debt servicing (wasted output) Human team features (1x)

The AI team's total output (amber + red) stays constant at 10x. But watch where it goes: the red wedge is time consumed by debt. The dashed gray line is the human team's feature rate. By tick 100, the human team is shipping more usable features despite working 10x slower.

The total height of the stacked area is constant because velocity doesn’t change. What changes is the split: early on, almost everything is features (amber). By the end, the red wedge has eaten most of the output. The dashed gray line is a human team at 1x velocity, quietly shipping at a steady rate that eventually wins.

The agentic paradox

An agent deciding which files to load into context is making the same bet a human developer makes when deciding what to think about. Bigger codebase = more files left out of the window = higher $p$ . The bottleneck shifts from production to comprehension.

Multi-agent systems multiply this. Agent A modifies module X without knowing Agent B just changed module Y’s interface. Each agent operates with its own partial view, a fragmented $\text{context\_considered}$ across a shared, growing $\text{context\_relevant}$ .

What keeps $p$ flat

You can’t meaningfully scale $\text{context\_considered}$ . Human brains are fixed. Context windows grow, but slowly compared to how fast AI agents grow codebases. The winning move is shrinking $\text{context\_relevant}$ , reducing the amount of context that matters for any given change.

This is what functional programming and the SUPER principles accomplish. Each principle targets a specific parameter in the model:

Side Effects at the Edge reduces $\alpha$ . When core logic is pure, the only context relevant to a function is its inputs. No global state to trace, no execution-order dependencies, no hidden I/O.

Uncoupled Logic reduces $\lambda$ . When components are composed through explicit dependency injection instead of shared mutable state, coupling complexity $C(t)$ grows linearly instead of combinatorially.

Pure & Total Functions reduce $p_0$ . Deterministic functions that handle all inputs don’t crash on edge cases. The irreducible defect floor drops.

Explicit Data Flow reduces $\alpha$ . When data moves through visible pipelines, $\text{context\_relevant}$ equals what’s piped in. Nothing hidden, nothing implicit.

Replaceable by Value reduces $\lambda$ . Referentially transparent expressions can be evaluated in isolation. No need to understand callers, call sites, or temporal state.

Strong service boundaries achieve a related effect: each service has its own $S(t)$ , so a developer only needs to hold one service’s context plus its API surface. Type systems reduce $p_0$ by making invalid states unrepresentable at compile time. And the most underrated tool — aggressive deletion — directly shrinks $S(t)$ .

Four trajectories

The phase portrait below plots cumulative debt (x-axis) against feature output rate (y-axis) over 200 ticks. Each trajectory starts at the green dot (top-left) and traces where the system ends up. Systems that stay in the green zone are sustainable. Systems that spiral right and down have collapsed.

10x velocity, minimal testing, poor architecture (high $\alpha$ , high $\lambda$ ). The trajectory dives toward the bottom-right: debt explodes while feature output collapses to near zero. This is the default outcome for teams that bolt AI onto an existing codebase without changing the architecture.

Same 10x velocity, but with 40% of effort redirected to testing. The trajectory bends less aggressively, but still drifts right. Testing raises $R(t)$ , but $p(t)$ keeps rising because the architecture hasn’t changed. You can’t test your way out of a coupled codebase.

10x velocity with decoupled architecture (low $\alpha$ , low $\lambda$ ), moderate testing, and active deletion. The trajectory stays in the green zone. Feature output remains high because $p(t)$ stays flat and debt never accumulates faster than repairs can handle.

Human velocity, default architecture. The trajectory barely moves from the starting point because code accumulates slowly, giving the team time to absorb context naturally. Sustainable by default, but at the cost of shipping 10x less.

Each trajectory traces 200 ticks through debt-vs-feature-rate space. The highlighted path is the active tab. All four start from the same point.

The third tab is the winning configuration. It ships at 10x velocity with the same trajectory shape as the human team. The difference: architecture investment (low $\alpha$ ), decoupling (low $\lambda$ ), and enough testing plus refactoring to keep $R(t) \geq D(t)$ .

The equilibrium condition

A sustainable codebase is one where repairs keep pace with defects, continuously, not just on average.

D(t) \leq R(t) \quad \text{sustained over time}

At high velocity, this requires three things simultaneously:

Architecture that keeps $p(t)$ flat. Without this, no amount of testing saves you at scale. (See SUPER principles and the section above.)
Automated testing that scales with output. $R(t)$ must grow proportionally with $V(t)$ . Manual testing can’t keep up with AI velocity.
CI/CD fast enough to not bottleneck. A pipeline built for 10 commits/day buckles at 100. The infrastructure has to match the velocity.

When defects persistently outpace repairs, feature output approaches zero. The team spends all its time on bugs:

\lim_{t \to \infty} \text{features}(t) = 0 \quad \text{when} \quad \int_0^t [D(\tau) - R(\tau)] \, d\tau \to \infty

The build-maintain cadence

The equilibrium condition $D(t) \leq R(t)$ doesn’t mean you maintain at a constant rate. In practice, the teams I’ve seen succeed with AI agents alternate between two modes: build sprints and maintenance releases.

During a build sprint, velocity is high and debt accumulates. That’s fine, the math allows it as long as you don’t stay there. The debt curve bends upward, $p(t)$ starts climbing, and you can feel it: PRs get harder to review, tests take longer to pass, agents start producing changes that break things two modules away.

That’s the signal to switch modes. A maintenance release is a focused sprint where the only goal is driving $B(T)$ back down: delete dead code, break apart coupled modules, increase test coverage on the areas where $p$ spiked. AI agents are as useful here as they are during build sprints, possibly more so, because refactoring is pattern-heavy work that agents handle well.

At 1x velocity, you might do a cleanup pass quarterly. At 10x velocity, you need one every week or two. Teams that try to sprint continuously without maintenance releases hit collapse, not because any single sprint was too aggressive, but because they never paid down the debt before the next sprint compounded it.

Automated guardrails: Biome and cyclomatic complexity

The build-maintain cadence only works if you can objectively measure when it’s time to switch modes. Gut feel doesn’t scale, especially when agents are producing code faster than you can read it.

Cyclomatic complexity is the single most useful proxy for $\alpha$ that you can measure automatically. It counts the number of independent paths through a function: every if, for, while, &&, ||, and ternary adds a path. A function with cyclomatic complexity of 15 has 15 possible execution paths. An agent modifying that function needs to hold all 15 paths in context to avoid introducing a defect.

This is $\text{context\_relevant}$ made concrete. A function with complexity 3 needs 3 paths in your head. A function with complexity 30 needs 30. The defect probability per change rises directly with this number. When your codebase average complexity starts climbing, $\alpha$ is climbing with it. That’s your signal to schedule a maintenance release.

Biome is the tool I use to enforce this. It’s a single binary that handles linting, formatting, and complexity checks at native speed, fast enough to run on every save, every commit, every CI check, even at 10x velocity. The rules that matter most for the debt model:

complexity/noExcessiveCognitiveComplexity flags functions where $\text{context\_relevant}$ has grown too large. Set a threshold (I use 15) and treat violations as debt that must be resolved in the next maintenance cycle.
complexity/noForEach nudges toward functional patterns (map, filter, reduce) that are pure and composable, directly reducing $\lambda$ .
suspicious/noExplicitAny catches type erasure that expands $\text{context\_relevant}$ by making the type system unable to constrain what values a variable can hold.
correctness/noUnusedVariables flags dead code, phantom $S(t)$ . It inflates surface area without contributing features.

The combination is powerful: Biome gives you a real-time readout of the parameters in the debt model. When complexity scores drift upward, you know $\alpha$ is rising. When unused code accumulates, you know $S(t)$ is bloating. You don’t need to wait until the codebase feels slow. The numbers tell you before you feel it.

For AI-augmented teams specifically, Biome serves a second purpose: it constrains agent output. An agent writing a 40-line function with cyclomatic complexity 25 will have that flagged before it merges. The agent rewrites it as three smaller functions with complexity 5 each. Those three functions are individually testable, individually comprehensible, and individually replaceable, exactly the properties that keep $p(t)$ flat.

What to do about it

Decouple everything you can.

If a change to module A can break module B, your coupling is too high. Pure functions, explicit data flow, dependency injection over shared state.

Delete more than you write.

Deletion is the only operation that makes future changes cheaper. Dead code, unused exports, speculative abstractions: they all inflate the context an agent or human needs to hold without contributing features.

Alternate build sprints with maintenance releases.

Continuous sprinting without consolidation is how teams collapse. At 10x velocity, schedule a maintenance cycle every one to two weeks.

Measure complexity automatically.

Run Biome (or equivalent) on every commit. When the average starts climbing, switch from build mode to maintenance mode.

Test at the rate you ship.

If your team ships 10x more code, your test suite and CI pipeline need to handle 10x more output.

The formula doesn’t care whether a human or an agent wrote the code. It cares how much context the next change requires.

Before the pitchforks come out

The formula

What we mean by “tech debt”

Why ppppppDefect rate per unit of code rises: the context bottleneck

The velocity trap

The agentic paradox

What keeps ppppppDefect rate per unit of code flat

Four trajectories

The equilibrium condition

The build-maintain cadence

Automated guardrails: Biome and cyclomatic complexity

What to do about it

Related Posts

AI agents keep failing. The fix is 40 years old.

Stop reading walls of text from your AI agent

Pretext is a text measurement library. The most interesting use cases have no DOM at all.

Why $p$ rises: the context bottleneck

What keeps $p$ flat