Skip to content

The Fastest Codebase Produces the Fewest Features

Before the pitchforks come out

I use formulas in this post to describe relationships between things like codebase complexity, defect rates, and iteration cost. These are mental models, not physics. I am not claiming to have discovered a new law of thermodynamics for software. If you’re the type who sees an equation in a blog post and reflexively comments “that’s not how math works,” I want you to know: I see you, I respect your commitment, and I ask that you channel that energy into reading what the variables actually represent before firing up Hacker News. The formulas are communication tools. They make relationships between concepts precise enough to argue about, which is the whole point.

The formula

This is the model we’re going to build and test in this post. We’ll start with the plain English, translate it to math, then run it through four scenarios to see what the formulas predict.

Burden = the accumulation of defects that outpace repairs.

In plain language: every tick of the clock, your team introduces some defects and fixes some defects. Tech debt is what’s left over, the running total of defects you haven’t gotten to yet. In math:

B(T)=0T[D(t)R(t)]dtB(T) = \int_0^T \Big[ D(t) - R(t) \Big] \, dt

BBBBBurden — cumulative tech debt is the burden (cumulative tech debt), D(t)D(t)D(t)D(t)Defects introduced per unit time is defects introduced, R(t)R(t)R(t)R(t)Repair capacity — testing, refactoring, review is repairs completed, and TT is time. When DD outpaces RR, debt grows. The interesting part is what governs DD.


Defects = velocity × defect rate × coupling × debt feedback.

The faster you ship, the more defects you produce. The higher the defect rate per line, the more defects you produce. The more coupled your code is, the more each defect cascades. And existing debt makes new defects more likely (working in a debt-laden codebase means navigating workarounds and known-broken paths):

D(t)=V(t)p(t)(1+λln(1+C(t)))(1+γB(t)Bref)D(t) = V(t) \cdot p(t) \cdot \big(1 + \lambda \ln(1 + C(t))\big) \cdot \left(1 + \gamma \frac{B(t)}{B_{\text{ref}}}\right)
  • V(t)V(t)V(t)V(t)Code velocity — how fast you ship (1x human, 3-10x AI) — code velocity. How fast you ship. 1x for a human team, 3-10x with AI agents.
  • p(t)p(t)p(t)p(t)Defect rate per commit (logistic, bounded 0-1) — defect rate per unit of code (bounded between 0 and 1).
  • λ\lambdaλ\lambdaCoupling sensitivity (log-dampened) — coupling sensitivity. The log dampens this: a 10x increase in coupling doesn’t produce 10x more defects.
  • C(t)C(t)C(t)C(t)Dimensionless coupling ratio, normalized to reference size — coupling complexity. Dimensionless, normalized against a reference codebase size: C(t)=(S(t)/Sref)βC(t) = (S(t)/S_{\text{ref}})^\beta.
  • γ\gammaγ\gammaDebt feedback — how much existing burden accelerates new defects — debt feedback. How much existing burden accelerates new defect introduction. This is what creates the spiral.

Defect rate rises with codebase size, but saturates.

The defect rate is not constant. It rises as the codebase grows, because there’s more context to miss. But it can’t exceed 1 (you can’t have more than 100% of commits be defective), so it follows a logistic curve:

p(t)=p0+(1p0)αS(t)1+αS(t)p(t) = p_0 + (1 - p_0)\,\frac{\alpha \, S(t)}{1 + \alpha \, S(t)}
  • p0p_0p0p_0Base defect rate — irreducible minimum — base defect rate. Irreducible. You’ll have off-by-one errors in a 10-line script.
  • α\alphaα\alphaContext decay — how fast comprehension degrades with scale — context decay coefficient. How fast comprehension degrades with scale. This is the parameter your architecture controls.
  • S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state — codebase surface area. Files, interfaces, state dimensions.

Codebase size = everything you’ve written minus everything you’ve deleted.

S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state grows with every commit and shrinks only when you actively remove code:

S(t)=S0+0tV(τ)dτ0tdeletions(τ)dτS(t) = S_0 + \int_0^t V(\tau) \, d\tau - \int_0^t \text{deletions}(\tau) \, d\tau

Here’s the feedback loop: velocity grows the codebase, which raises the defect rate, which raises defect output, which means each new line of code is more dangerous than the last. It’s not linear scaling; it compounds. The rest of this post unpacks each piece and visualizes the dynamics.

What we mean by “tech debt”

The term has a specific origin and a precise meaning that most usage gets wrong. Ward Cunningham coined it in 1992 to describe one thing: the gap between your code and your current understanding of the problem. You ship code that reflects partial knowledge, then refactor as you learn. The “interest” is the friction of working in code that no longer matches what you know. Cunningham was explicit that this is not about sloppy code, hacks, or bugs. Those are just bad code.

The model in this post uses a broader definition. B(t)B(t)B(t)B(t)Burden — cumulative tech debt tracks burden: the accumulated cost of all defects, design friction, and knowledge gaps that haven’t been resolved. This includes Cunningham’s original debt (code that reflects outdated understanding), but also defects that exist because the author missed context, and coupling that nobody chose deliberately. The codebase doesn’t care which category a problem falls into. The compounding dynamics work the same on all of them.

When you write a line of code, you’re betting that your mental model of the system is accurate enough to produce correct logic. When you’re wrong now, it’s a bug. When you’re wrong in a way that won’t surface for months, it becomes burden. A senior engineer and a junior engineer both produce it. So does GPT-4. So does Claude. The question isn’t whether it accumulates, it’s how fast, and what governs the rate.

Why ppppDefect rate per unit of code rises: the context bottleneck

Neither humans nor LLMs can hold an entire system in mind simultaneously.

A human developer has roughly 7±27 \pm 27±27 \pm 2Miller's Law — human working memory capacity items in working memory. A 500-file codebase has thousands of potential interactions between modules. You will miss relevant context. It’s not a question of effort; it’s a cognitive ceiling.

An LLM agent has a context window. Even at 200k tokens, a large codebase exceeds it. The agent chooses which files to load, and every file it leaves out is a potential missed interaction. Modern agents partially compensate by using tools (grep, LSP, file search) to retrieve context on demand, which makes the bound softer than raw working memory. But retrieval only helps when the agent knows what to search for. Unknown unknowns don’t get retrieved. A function that depends on a global config three modules away, a singleton that needs initialization, a side effect buried in a “pure” helper: the agent has no reason to search for something it doesn’t know exists.

Defect probability rises as the share of relevant context you considered falls.

Context gaps aren’t the only source of defects. Algorithmic mistakes, spec misunderstandings, and genuinely hard problems (concurrency, distributed state) produce defects even with full context. But the context gap is the dominant scaling factor: the one that gets worse as codebases grow. The model captures that term:

p(t)1context_consideredcontext_relevant(t)p(t) \propto 1 - \frac{\text{context\_considered}}{\text{context\_relevant}(t)}

As the codebase grows, context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change grows. But context_considered\text{context\_considered}context_considered\text{context\_considered}Context actually held in mind (bounded by cognition or tokens) is bounded (by neurons or by tokens). The ratio shrinks. ppppDefect rate per unit of code rises.

A human team at 1x velocity. The grid fills slowly. Most squares stay bright because the codebase grows within the spotlight’s reach. Defects are rare.

In context Out of context Defect (missed context)
Files: 0 Defects: 0 p(t): 0%

The same team at 10x velocity with tightly coupled architecture (high α\alphaα\alphaContext decay — how fast comprehension degrades with scale). The grid floods with code. Amber squares appear everywhere because the agent can’t hold enough of the system in context. This is where most AI teams end up.

In context Out of context Defect (missed context)
Files: 0 Defects: 0 p(t): 0%

10x velocity again, but with decoupled architecture (low α\alphaα\alphaContext decay — how fast comprehension degrades with scale). Each function depends on less of the overall system, so the spotlight covers more of what matters. The grid grows just as fast, but the amber fraction stays small.

In context Out of context Defect (missed context)
Files: 0 Defects: 0 p(t): 0%
Each square is a code unit. Click through the tabs to see how the same model behaves under different conditions.

The difference between the second and third tab is α\alphaα\alphaContext decay — how fast comprehension degrades with scale. Same velocity, same team, same model. The only variable is how much of the codebase is relevant to any given change. Architecture controls that.

The velocity trap

Joe Magerramov observed that his AI-augmented team ships at 10x velocity with 80% AI-generated code. A bug that used to appear once a year now appears weekly. But that calculation assumes constant ppppDefect rate per unit of code, same defect rate per line regardless of codebase size.

With rising p(t)p(t)p(t)p(t)Defect rate per commit (logistic, bounded 0-1), it’s worse. Defects = (10x speed) x (rising error rate) x (rising coupling) x (rising debt feedback). Every term on the right is growing:

D(t)=V(t)10xp(t)rising(1+λln(1+C))rising(1+γB/Bref)spiralingD(t) = \underbrace{V(t)}_{\text{10x}} \cdot \underbrace{p(t)}_{\text{rising}} \cdot \underbrace{\big(1 + \lambda \ln(1{+}C)\big)}_{\text{rising}} \cdot \underbrace{\big(1 + \gamma B/B_{\text{ref}}\big)}_{\text{spiraling}}

You’re not just getting 10x bugs. You’re getting 10x bugs where each bug is progressively more likely.

The trap: early speed feels free. The codebase is small, pp0p \approx p_0pp0p \approx p_0Defect rate per unit of code, everything works beautifully. By the time ppppDefect rate per unit of code has risen noticeably, you’ve already accumulated massive surface area. The debt is baked in.

Features (productive output) Debt servicing (wasted output) Human team features (1x)
The AI team's total output (amber + red) stays constant at 10x. But watch where it goes: the red wedge is time consumed by debt. The dashed gray line is the human team's feature rate. By tick 100, the human team is shipping more usable features despite working 10x slower.

The total height of the stacked area is constant because velocity doesn’t change. What changes is the split: early on, almost everything is features (amber). By the end, the red wedge has eaten most of the output. The dashed gray line is a human team at 1x velocity, quietly shipping at a steady rate that eventually wins.

The agentic paradox

An agent deciding which files to load into context is making the same bet a human developer makes when deciding what to think about. Bigger codebase = more files left out of the window = higher ppppDefect rate per unit of code. The bottleneck shifts from production to comprehension.

Multi-agent systems multiply this. Agent A modifies module X without knowing Agent B just changed module Y’s interface. Each agent operates with its own partial view, a fragmented context_considered\text{context\_considered}context_considered\text{context\_considered}Context actually held in mind (bounded by cognition or tokens) across a shared, growing context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change.

What keeps ppppDefect rate per unit of code flat

You can’t meaningfully scale context_considered\text{context\_considered}context_considered\text{context\_considered}Context actually held in mind (bounded by cognition or tokens). Human brains are fixed. Context windows grow, but slowly compared to how fast AI agents grow codebases. The winning move is shrinking context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change, reducing the amount of context that matters for any given change.

This is what functional programming and the SUPER principles accomplish. Each principle targets a specific parameter in the model:

Side Effects at the Edge reduces α\alphaα\alphaContext decay — how fast comprehension degrades with scale. When core logic is pure, the only context relevant to a function is its inputs. No global state to trace, no execution-order dependencies, no hidden I/O.

Uncoupled Logic reduces λ\lambdaλ\lambdaCoupling factor — how entangled components are. When components are composed through explicit dependency injection instead of shared mutable state, coupling complexity C(t)C(t)C(t)C(t)Coupling complexity — grows superlinearly with size grows linearly instead of combinatorially.

Pure & Total Functions reduce p0p_0p0p_0Base defect rate — irreducible minimum. Deterministic functions that handle all inputs don’t crash on edge cases. The irreducible defect floor drops.

Explicit Data Flow reduces α\alphaα\alphaContext decay — how fast comprehension degrades with scale. When data moves through visible pipelines, context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change equals what’s piped in. Nothing hidden, nothing implicit.

Replaceable by Value reduces λ\lambdaλ\lambdaCoupling factor — how entangled components are. Referentially transparent expressions can be evaluated in isolation. No need to understand callers, call sites, or temporal state.

Strong service boundaries achieve a related effect: each service has its own S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state, so a developer only needs to hold one service’s context plus its API surface. Type systems reduce p0p_0p0p_0Base defect rate — irreducible minimum by making invalid states unrepresentable at compile time. And the most underrated tool — aggressive deletion — directly shrinks S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state.

Four trajectories

The phase portrait below plots cumulative debt (x-axis) against feature output rate (y-axis) over 200 ticks. Each trajectory starts at the green dot (top-left) and traces where the system ends up. Systems that stay in the green zone are sustainable. Systems that spiral right and down have collapsed.

10x velocity, minimal testing, poor architecture (high α\alphaα\alphaContext decay — how fast comprehension degrades with scale, high λ\lambdaλ\lambdaCoupling factor — how entangled components are). The trajectory dives toward the bottom-right: debt explodes while feature output collapses to near zero. This is the default outcome for teams that bolt AI onto an existing codebase without changing the architecture.

Same 10x velocity, but with 40% of effort redirected to testing. The trajectory bends less aggressively, but still drifts right. Testing raises R(t)R(t)R(t)R(t)Repair capacity — testing, refactoring, review, but p(t)p(t)p(t)p(t)Defect rate per unit of code keeps rising because the architecture hasn’t changed. You can’t test your way out of a coupled codebase.

10x velocity with decoupled architecture (low α\alphaα\alphaContext decay — how fast comprehension degrades with scale, low λ\lambdaλ\lambdaCoupling factor — how entangled components are), moderate testing, and active deletion. The trajectory stays in the green zone. Feature output remains high because p(t)p(t)p(t)p(t)Defect rate per unit of code stays flat and debt never accumulates faster than repairs can handle.

Human velocity, default architecture. The trajectory barely moves from the starting point because code accumulates slowly, giving the team time to absorb context naturally. Sustainable by default, but at the cost of shipping 10x less.

Each trajectory traces 200 ticks through debt-vs-feature-rate space. The highlighted path is the active tab. All four start from the same point.

The third tab is the winning configuration. It ships at 10x velocity with the same trajectory shape as the human team. The difference: architecture investment (low α\alphaα\alphaContext decay — how fast comprehension degrades with scale), decoupling (low λ\lambdaλ\lambdaCoupling factor — how entangled components are), and enough testing plus refactoring to keep R(t)D(t)R(t) \geq D(t).

The equilibrium condition

A sustainable codebase is one where repairs keep pace with defects, continuously, not just on average.

D(t)R(t)sustained over timeD(t) \leq R(t) \quad \text{sustained over time}

At high velocity, this requires three things simultaneously:

  1. Architecture that keeps p(t)p(t)p(t)p(t)Defect rate per unit of code flat. Functional principles, SUPER, decoupled modules. Without this, no amount of testing saves you at scale.
  2. Automated testing that scales with output. R(t)R(t)R(t)R(t)Repair capacity — testing, refactoring, review must grow proportionally with V(t)V(t)V(t)V(t)Code velocity — how fast you ship. Manual testing can’t keep up with AI velocity.
  3. CI/CD fast enough to not bottleneck. A pipeline built for 10 commits/day buckles at 100. The infrastructure has to match the velocity.

When defects persistently outpace repairs, feature output approaches zero. The team spends all its time on bugs:

limtfeatures(t)=0when0t[D(τ)R(τ)]dτ\lim_{t \to \infty} \text{features}(t) = 0 \quad \text{when} \quad \int_0^t [D(\tau) - R(\tau)] \, d\tau \to \infty

The build-maintain cadence

The equilibrium condition D(t)R(t)D(t) \leq R(t)D(t)R(t)D(t) \leq R(t)Defects must not outpace repairs doesn’t mean you maintain at a constant rate. In practice, the teams I’ve seen succeed with AI agents alternate between two modes: build sprints and maintenance releases.

During a build sprint, velocity is high and debt accumulates. That’s fine, the math allows it as long as you don’t stay there. The debt curve bends upward, p(t)p(t)p(t)p(t)Defect rate per unit of code starts climbing, and you can feel it: PRs get harder to review, tests take longer to pass, agents start producing changes that break things two modules away.

That’s the signal to switch modes. A maintenance release is a focused sprint where the only goal is driving B(T)B(T)B(T)B(T)Burden — cumulative tech debt back down: delete dead code, break apart coupled modules, increase test coverage on the areas where ppppDefect rate per unit of code spiked. AI agents are as useful here as they are during build sprints, possibly more so, because refactoring is pattern-heavy work that agents handle well.

At 1x velocity, you might do a cleanup pass quarterly. At 10x velocity, you need one every week or two. Teams that try to sprint continuously without maintenance releases hit collapse, not because any single sprint was too aggressive, but because they never paid down the debt before the next sprint compounded it.


Automated guardrails: Biome and cyclomatic complexity

The build-maintain cadence only works if you can objectively measure when it’s time to switch modes. Gut feel doesn’t scale, especially when agents are producing code faster than you can read it.

Cyclomatic complexity is the single most useful proxy for α\alphaα\alphaContext decay — how fast comprehension degrades with scale that you can measure automatically. It counts the number of independent paths through a function: every if, for, while, &&, ||, and ternary adds a path. A function with cyclomatic complexity of 15 has 15 possible execution paths. An agent modifying that function needs to hold all 15 paths in context to avoid introducing a defect.

This is context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change made concrete. A function with complexity 3 needs 3 paths in your head. A function with complexity 30 needs 30. The defect probability per change rises directly with this number. When your codebase average complexity starts climbing, α\alphaα\alphaContext decay — how fast comprehension degrades with scale is climbing with it. That’s your signal to schedule a maintenance release.

Biome is the tool I use to enforce this. It’s a single binary that handles linting, formatting, and complexity checks at native speed, fast enough to run on every save, every commit, every CI check, even at 10x velocity. The rules that matter most for the debt model:

  • complexity/noExcessiveCognitiveComplexity flags functions where context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change has grown too large. Set a threshold (I use 15) and treat violations as debt that must be resolved in the next maintenance cycle.
  • complexity/noForEach nudges toward functional patterns (map, filter, reduce) that are pure and composable, directly reducing λ\lambdaλ\lambdaCoupling factor — how entangled components are.
  • suspicious/noExplicitAny catches type erasure that expands context_relevant\text{context\_relevant}context_relevant\text{context\_relevant}Total context that matters for a given change by making the type system unable to constrain what values a variable can hold.
  • correctness/noUnusedVariables flags dead code, phantom S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state. It inflates surface area without contributing features.

The combination is powerful: Biome gives you a real-time readout of the parameters in the debt model. When complexity scores drift upward, you know α\alphaα\alphaContext decay — how fast comprehension degrades with scale is rising. When unused code accumulates, you know S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state is bloating. You don’t need to wait until the codebase feels slow. The numbers tell you before you feel it.

For AI-augmented teams specifically, Biome serves a second purpose: it constrains agent output. An agent writing a 40-line function with cyclomatic complexity 25 will have that flagged before it merges. The agent rewrites it as three smaller functions with complexity 5 each. Those three functions are individually testable, individually comprehensible, and individually replaceable, exactly the properties that keep p(t)p(t)p(t)p(t)Defect rate per unit of code flat.

The math doesn’t care who wrote the code

Human or AI, the defect rate is governed by three things: how much code exists (SSSSCodebase surface area — files, interfaces, state), how much of it you can hold in mind (context_considered\text{context\_considered}context_considered\text{context\_considered}Context actually held in mind (bounded by cognition or tokens)), and how interconnected it is (CCCCCoupling complexity — grows superlinearly with size, λ\lambdaλ\lambdaCoupling factor — how entangled components are).

AI doesn’t create a new kind of problem. It accelerates an existing one past the point where human intuition calibrates. A bug every six months feels manageable. A bug every week feels like chaos. The base defect rate and context decay are the same; only VVVVCode velocity — how fast you ship changed.

The teams that thrive with AI coding agents aren’t the ones generating the most code. They’re the ones keeping context_relevant(t)\text{context\_relevant}(t)context_relevant(t)\text{context\_relevant}(t)Total context that matters for a given change small enough that ppppDefect rate per unit of code stays low: pure functions, explicit data flow, aggressive deletion, interfaces so clear you don’t need to read the implementation.

Every new line of code that increases S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state without proportionally increasing feature output is a liability. The winning strategy is writing code that shrinks the surface area of everything around it: pure functions that replace imperative tangles, clear interfaces that make implementation details irrelevant, and deleting everything that doesn’t earn its keep.

Where “tech debt” comes from

Ward Cunningham invented the metaphor in 1992 while building a portfolio management system at Wyatt Software. He needed to explain to his boss why the team should refactor working code. The financial analogy worked: shipping code with partial understanding is like taking on a loan. It accelerates delivery. But if you never refactor to incorporate what you’ve learned, you pay interest on the gap between the code and your understanding, forever.

His exact words, from a 2009 video clarifying the metaphor: “I’m never in favor of writing code poorly, but I am in favor of writing code to reflect your current understanding of a problem even if that understanding is partial.”

This is narrower than most people think. Cunningham’s debt is specifically about incomplete knowledge, not incomplete effort. A team that ships a naive O(n²) algorithm because they haven’t yet learned the data will grow is taking on debt. A team that ships a naive O(n²) algorithm because they’re in a rush is just writing bad code.

Martin Fowler expanded the idea into four quadrants crossing two axes: deliberate vs. inadvertent, and reckless vs. prudent. The most interesting quadrant is prudent-inadvertent: a skilled team builds the best design they can, works on it for a year, and realizes a better approach existed all along. Fowler calls this “not just common but inevitable for teams that are excellent designers.” That kind of debt has nothing to do with laziness or time pressure. It’s the cost of learning.

Steve McConnell drew a simpler line in 2007: intentional (strategic) vs. unintentional (the non-strategic result of doing a poor job). The intentional kind is a tool. The unintentional kind is a symptom.

The industry has since stretched “tech debt” to cover everything from TODO comments to entire legacy systems. That’s fine as shorthand, but it muddies the math. A TODO comment and a race condition both count as “debt” in casual conversation, but they have wildly different costs and different compounding dynamics.

This post uses burden (B(t)B(t)B(t)B(t)Burden — cumulative tech debt) as the umbrella term: the accumulated weight of everything that makes the codebase harder to change safely. That includes Cunningham’s original debt (knowledge gaps), Fowler’s quadrants (all four of them), McConnell’s unintentional sludge, and the bugs that slipped through because nobody held enough context. The formulas don’t care which category a problem came from. The codebase doesn’t either.

What to do about it

Decouple everything you can.

Low α\alphaα\alphaContext decay — how fast comprehension degrades with scale and low λ\lambdaλ\lambdaCoupling factor — how entangled components are are the only parameters that keep p(t)p(t)p(t)p(t)Defect rate per unit of code flat as the codebase grows. Pure functions, explicit data flow, dependency injection over shared state. If a change to module A can break module B, your coupling is too high.

Delete more than you write.

Every line you remove shrinks S(t)S(t)S(t)S(t)Codebase surface area — files, interfaces, state. Dead code, unused exports, speculative abstractions: they all inflate the context an agent or human needs to hold. Deletion is the only operation that makes future changes cheaper.

Alternate build sprints with maintenance releases.

At 10x velocity, schedule a maintenance cycle every one to two weeks. Use it to break apart coupled modules, delete dead code, and drive B(T)B(T)B(T)B(T)Burden — cumulative tech debt back down. Continuous sprinting without consolidation is how teams collapse.

Measure complexity automatically.

Run Biome (or equivalent) on every commit. Set a cyclomatic complexity ceiling of 15. When the average starts climbing, that is your signal to switch from build mode to maintenance mode. Do not wait until you feel it.

Test at the rate you ship.

R(t)R(t)R(t)R(t)Repair capacity — testing, refactoring, review must grow with V(t)V(t)V(t)V(t)Code velocity — how fast you ship. If your team ships 10x more code, your test suite and CI pipeline need to handle 10x more output. Manual testing cannot keep pace with AI velocity.

Keyboard Shortcuts