Skip to content

AI agents keep failing. The fix is 40 years old.

Code examples in this post are available in five languages. Pick yours:

The pattern I keep seeing

An agent reads a function that takes a list and returns a list. It writes tests. They pass. The function fails in production because it depends on a global config and a database singleton the signature never declared. The agent had no way to know. This isn’t a model problem. Functional programmers solved it in the 1980s.

I’ve shipped AI products for over a decade, and the trajectory is always the same: impressive demo, promising pilot, gradual degradation, debugging nightmare, project abandoned. Most agent projects never make it to production. The ones that do often get rolled back within a year. MIT found 95% of AI pilots fail to deliver ROI. The instinct is to blame the models. “GPT-5 will fix it” or “we need better prompts.” The failures are architectural.

When an agent writes code into a mutable, tightly-coupled codebase, it’s producing non-deterministic output that depends on hidden state it can’t see. The global config object three modules away, the function that logs to disk as a side effect, the test that was mocking a database that behaves differently in production: the agent has no way to know about any of it.

The codebase is hostile to automation, and we keep blaming the agent.

Why agents need different code

A human developer builds a mental model of a codebase over months. They know where the bodies are buried: which functions mutate state, which modules share globals, which tests are flaky. They carry this context between sessions.

Agents don’t have that luxury. Every session starts from scratch. An agent reads the code that’s in front of it, follows the explicit contracts, and produces output based on what it can verify. This means anything implicit, any hidden state, any side effect buried inside a “pure” function, becomes a trap.

Here’s a function that looks fine to a human:

A developer on the team knows that config gets loaded from a YAML file at startup and the database accessor is a singleton that needs initialization. An agent sees a function that takes a list and returns a list. It writes tests against that contract, the tests pass in isolation, and the function fails in production because the global config wasn’t loaded.

Now multiply this across a codebase with hundreds of these hidden dependencies. Every function the agent touches has an invisible blast radius. Every change it makes can break something in a module it never read. This is why agent projects degrade: each iteration introduces subtle state corruption that compounds.

The agent sees inputs and outputs. The hidden dependencies are invisible.

The fix is forty years old

Functional programming solves these problems because it was designed to eliminate exactly the properties that make code hostile to automated reasoning. This isn’t a new insight. ML researchers have known since the 1980s that referentially transparent code is easier for machines to analyze, optimize, and transform. We just haven’t applied the lesson to the agents writing our code.

The principles are straightforward:

Pure functions return the same output for the same input, with no global state, database calls, or logging inside the function body. An agent can test a pure function by calling it with no setup or mocking required.

Explicit data flow means you can trace how inputs become outputs by reading the code linearly, without action-at-a-distance or mutations happening in a callback three layers deep. An agent can follow the data pipeline and understand what each step does.

Side effects at the boundaries means I/O, database access, and external API calls happen in a thin outer layer. The core logic is deterministic. An agent can rewrite core logic without worrying about accidentally triggering a payment or sending an email.

Composition over coupling means small functions that snap together like Lego bricks. An agent can replace one function without understanding the entire module graph.

This isn’t about purity for its own sake. I don’t care about monads or category theory. I care that when an agent modifies a function, the scope of possible breakage is exactly one function.

SUPER: five principles for agent-friendly code

I put these into an acronym because that’s how principles survive in organizations. Hover any term below for the full definition.

SUPER is five constraints on how you write code. Side Effects at the Edge means I/O happens in a thin outer layer, never inside business logic. Uncoupled Logic means dependencies are passed in, never pulled from globals. Pure & Total Functions means deterministic functions that handle every input. Explicit Data Flow means you can trace data linearly from input to output. Replaceable by Value means any expression can be swapped with its computed result.

The practical effect: an agent working on SUPER-compliant code can modify any function by reading only that function and its type signature. No hidden state to trace, no global config to discover, no side effects to accidentally trigger. Here’s what that looks like on a real function:

Before: the evaluate_options function from earlier, with its hidden dependencies.

An agent writing tests for this function will miss the config dependency, the database singleton, and the logger. The tests pass in isolation. The function fails in production.

After: the same logic, SUPER-compliant. Dependencies are parameters. I/O is the caller’s job. Every input is explicit.

The agent can now test evaluate_options by calling it with a list and a number. No mocking, no setup, no teardown. If the function is wrong, the agent sees it immediately. If it’s right, it stays right regardless of what the rest of the codebase does. The blast radius of any change is exactly one function.

SPIRALS: a process loop for human-agent collaboration

SUPER handles the code. But agents also need a structured process, or they drift. Anyone who’s watched Auto-GPT burn through API credits in an infinite loop knows what unstructured agent autonomy looks like.

SPIRALS is a seven-step loop that I run agents through on every task. It’s not a waterfall; it’s a tight cycle, often sub-minute, that keeps agents focused and gives humans natural checkpoints to intervene.

Sense

Gather context: read the relevant files, check git status, identify what already exists. Agents that skip this step rebuild things that already work.

Plan

Draft an approach, consider trade-offs, and define what “done” looks like. The human validates before any code gets written.

Inquire

Identify gaps in knowledge. What assumptions is the agent making? What doesn’t it know? This prevents the confident hallucination problem where an agent barrels ahead on wrong assumptions.

Refine

Simplify the plan. Apply the 80/20 rule. If a ticket is bigger than 3 story points, split it. Complexity gets killed here, before it enters the codebase.

Act

Write the code, following SUPER principles, as small bounded changes with tests alongside.

Learn

Run the tests and check the output. If something failed, the agent records what specifically went wrong for the next iteration.

Scan

The step Auto-GPT never had. The agent zooms out, looks for duplication, new risks, and things the change might have broken elsewhere. This is why Auto-GPT looped forever: it never checked whether it was actually making progress.

The seven steps split into two phases:

SPIRplan    ALSexecute\underbrace{\textsf{S} \cdot \textsf{P} \cdot \textsf{I} \cdot \textsf{R}}_{\text{plan}} \;\Big|\; \underbrace{\textsf{A} \cdot \textsf{L} \cdot \textsf{S}}_{\text{execute}}

In practice, I run these as two separate commands. The planning phase (Sense, Plan, Inquire, Refine) produces design docs, tickets, and a burndown. A human reviews and approves. Only then does the execution phase (Act, Learn, Scan) start, and it runs per-ticket: write the code, verify it works, check for regressions, commit, move to the next ticket. The gate between SPIR and ALS is the only point where I require human approval. Everything else, the agent handles.

The SPIRALS loop: each iteration cycles through all seven steps until Scan confirms the goal is met.

The loop terminates when Scan confirms the goal is met. If it doesn’t converge, Scan flags it and a human decides what to do next, so you don’t wake up to an infinite loop that burned through your API budget overnight.

Why they work together

SUPER without SPIRALS gives you clean code with no process. The agent writes a perfect function, then writes nine more that weren’t needed. Or it refactors something that didn’t need refactoring. Discipline in the code means nothing without discipline in the workflow.

SPIRALS without SUPER gives you a structured process applied to a messy codebase. The agent follows all seven steps, but the Act step produces code with hidden dependencies that corrupt on the next iteration. The loop degrades because the underlying code can’t support reliable automated modification.

Together:

  • Side effects at the edge means only the Act step touches the real world. Sense, Plan, Inquire, and Refine are pure reasoning, safe to retry and cheap to test.
  • Uncoupled logic means each SPIRALS step can be its own module or its own agent. You can swap in a better planner without rewiring the system.
  • Purity means Plan and Refine are deterministic. Same input state, same plan. You can reproduce bugs by replaying inputs.
  • Explicit data flow means you can trace exactly what happened at each step. When something goes wrong at minute 47 of a long run, you read the log linearly and find it.
  • Referential transparency means intermediate results are cacheable. If Sense returns the same context, skip to Plan.

What this looks like in practice

I use SUPER and SPIRALS on every project now. This website, Unfudged, Intraview, all of it.

The concrete difference: agents working on SUPER-compliant code produce changes that pass tests on the first try about 3x more often than agents working on typical imperative code with global state. I don’t have a rigorous study for this; it’s what I’ve observed across projects over the past year. The debugging time drops even more because when something does fail, the failure is local to one function, not spread across a graph of shared state.

The process difference with SPIRALS: agents used to require heavy babysitting, where I’d check every output and try to catch hallucinations before they landed. With SPIRALS, the Scan step catches most regressions before I see them. I review at the Plan and Learn steps and skip the rest unless Scan flags something. My involvement per task dropped from continuous to two checkpoints.

Neither framework requires rewriting your codebase from scratch. Start with SUPER’s “S”: move side effects out of your three most-modified modules. That alone makes agent modifications safer. Add the Scan step to your agent workflows. You’ll catch the infinite loops and the confident-but-wrong outputs before they cost you.

Both frameworks are in my CLAUDE.md files, so every agent I work with follows them from the first prompt.

Where to start

You don’t need to rewrite your codebase. Pick one module and work through these five steps.

Find your three most-modified modules

Run git log --format=format: --name-only | sort | uniq -c | sort -rn | head -20. The files your team touches most are where hidden dependencies cause the most damage. Start there.

Move the side effects out

Find every function in those modules that reads global config, hits a database, writes a log, or calls an external API. Pull that I/O into the caller. The function’s job is to compute; the caller’s job is to interact with the world.

Make dependencies explicit

Every value a function needs should be in its parameter list. If a function reaches into a singleton or ambient context, add the parameter and pass it in. The function signature should be the complete contract.

Add the Scan step

After your agent completes a change, have it zoom out: check for duplication, look for things the change might have broken elsewhere, and verify the goal is actually met. This is the step that prevents infinite loops and confident-but-wrong outputs.

Measure the difference

Run your agent against the refactored code. Count how often the tests pass on the first try compared to before. If the architecture is right, you’ll see it in the numbers.


The industry is moving toward more agent autonomy, not less. If your code can’t be reasoned about by a machine, no amount of model improvement will save you.

The fix has been in your CS textbook for forty years. The agents just made it urgent.

Keyboard Shortcuts