Models are optimizing their own tooling now
Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.
Vernor Vinge, 1993Nonbiological intelligence will have access to its own design and will be able to improve itself in an increasingly rapid redesign cycle.
Ray Kurzweil, 2005Success would be the biggest event in human history ... and perhaps the last event in human history.
Stuart Russell, 2019I think it's quite conceivable that humanity is just a passing phase in the evolution of intelligence.
Geoffrey Hinton, 2023Intelligence may be very powerful, but it isn't magic fairy dust.
Dario Amodei, 2024We are seeing signs in recent months of these systems having self-preservation behavior and power-seeking behavior.
Yoshua Bengio, 2024If there were literal copies of me, I'm not sure how much more incremental value you'd get.
Ilya Sutskever, 2025This feedback loop is gathering steam month by month, and may be only 1-2 years away from a point where the current generation of AI autonomously builds the next.
Dario Amodei, 2026In 1965, the statistician I.J. Good described an “ultraintelligent machine” that could improve its own design. The result, he wrote, would be an “intelligence explosion.”
The first ultraintelligent machine is the last invention that man need ever make.
That sentence shaped sixty years of discourse. Bostrom warned a self-improving AI would execute a “treacherous turn”: cooperating while weak, seizing control once strong. Yudkowsky argued the explosion would take “weeks or hours.” Schmidhuber proposed the Gödel Machine, a system that rewrites its own code whenever it can mathematically prove the rewrite helps. Elegant theory. Never built.
In March 2026, self-improving AI shipped. It adjusted its own sampling temperature.
What MiniMax actually built
Shanghai-based MiniMax released M2.7: 100+ rounds of autonomous self-optimization, no humans in the loop. The model analyzed its own failures, rewrote its scaffold code, evaluated results, decided what to keep or revert.
Results: 9 gold medals on MLE Bench Lite, behind only Opus 4.6 and GPT-5.4. 56.2% on SWE-Pro. 30% performance gain from self-optimization alone.
M2.7 didn’t retrain its neural network weights. It optimized the agent layer: scaffolding code, tools, memory systems, sampling parameters, workflow logic. It built its own loop detection to avoid dead ends.
It didn’t rewrite its brain. It organized its desk.
Everyone built the same loop
MiniMax isn’t alone. Multiple labs converged on this pattern independently.
Karpathy’s autoresearch shipped this month. Around 630 lines of Python. An AI agent edits a training script, runs a 5-minute GPU experiment, checks results, keeps or reverts, repeats. One run produced 700 experiments over two days and found 20 additive improvements. “Time to GPT-2” dropped from 2.02h to 1.80h. Shopify CEO Tobi Lutke reported a 19% gain from an overnight run.
DeepMind’s AlphaEvolve (May 2025) runs evolutionary search over algorithm code. It beat Strassen’s 1969 matrix multiplication result, freed 0.7% of Google’s total compute through scheduling optimization, and sped up FlashAttention kernels by 32.5%.
Microsoft’s STOP (Self-Taught Optimizer) is the academic ancestor. A scaffolding program uses an LLM to improve itself recursively. The model proposed beam search, genetic algorithms, and simulated annealing, all introduced after its own training cutoff.
Meta went a different direction with Meta-Rewarding Language Models. The model acts as both actor and judge, generates its own training rewards, trains on its own judgments. Llama-3-8B-Instruct win rate went from 22.9% to 39.4% on AlpacaEval 2. Unlike the rest, this actually modifies model weights.
Brute force won
| Approach | What changes | Who |
|---|---|---|
| Scaffold optimization | Tools, prompts, memory, workflow | MiniMax, Karpathy, Microsoft |
| Evolutionary code search | Algorithm implementations | DeepMind AlphaEvolve |
| Self-reward training | Model weights | Meta |
Schmidhuber’s Gödel Machine needed mathematical proof before every self-modification. That made it intractable. What actually worked was the brute-force version: try a change, measure the result, decide. No proofs. Just empirical outcomes and a 5-minute timeout.
Scaffold optimization is winning because it’s cheap. You don’t need a training cluster. You need inference credits and a decision loop.
Three reads
The optimistic read: models handle the grunt work of hyperparameter search, prompt engineering, and workflow design. Karpathy’s system found 20 improvements across hundreds of experiments. Most engineers would try maybe 10 in that time.
The honest read: none of these systems do anything a motivated engineer with a for-loop couldn’t. Optimization targets are narrow. Evaluation functions are pre-defined. Autoresearch tweaked training scripts. Brute-force search with good taste in what to keep.
The practical read: doesn’t matter. A model that runs experiments overnight and surfaces the ones that helped is useful whether or not it’s “really” self-improving. Does it save you a week of grid search? Yes.
Where it leads
In 2008, Yudkowsky and Hanson debated what self-improving AI would look like. Yudkowsky said it would happen fast: a single system bootstrapping itself to dominance in weeks. Hanson called that “implausibly extreme.” Intelligence needs “thousands of good modules” and “thousands of inventions,” he argued. No single system would FOOM.
What showed up looks more like Hanson’s version on a faster clock. Narrow improvements across separate systems: scaffold parameters, training scripts, algorithm variants. Each gain is modest but the aggregate adds up.
AlphaEvolve already crossed one line: it improved the training of its own underlying LLMs, even if the gain was small.
MiniMax’s roadmap points toward full autonomy across data construction, model training, and evaluation. NVIDIA released OpenShell this month, a sandboxed runtime where autonomous agents can’t override security policies, even if compromised. Containment infrastructure shipping alongside the capabilities.
Good’s “intelligence explosion” hasn’t arrived. What arrived is an intelligence ratchet: small, auditable, empirical gains that compound when you let a model run overnight. The barrier to entry is a few hundred lines of Python and a GPU.
What happens when someone closes the loop, when the scaffold optimizer can also modify its own evaluation function?
Nobody’s published that paper yet.