Self-Tuning Context: When Agents Rewrite Their Own Instructions

What if the agent could optimize its own instructions? Every time an agent messes up, you open the CLAUDE.md, add a rule, and hope it sticks. You’re the feedback loop: watching output, diagnosing failures, rewriting instructions by hand.

Mitko Vasilev, a CTO focused on enterprise R&D and a vocal advocate for owning your own AI stack, is doing exactly that. He’s running a feedback loop that takes the agent’s skill file, tests it against real tasks on a 20-year-old codebase, and lets the system evolve a sharper version. Locally, offline. The setup:

Take the agent’s skill file (the system prompt, the CLAUDE.md, whatever governs its behavior)
Run it against real tasks on a 20-year-old codebase
Capture pass/fail, timing, and full error logs
Let the agent write its own post-mortem
Let the optimizer propose a sharper skill
Keep what moves the metric. Repeat.

The skill file is the tunable parameter. Real tasks are the evaluator. Logs are the gradient.

His key line: “It doesn’t get more confident. It gets more correct.”

What Mitko calls “The AI factory is building itself” captures something important. He’s not fine-tuning a model. He’s not writing better prompts by hand. He’s letting the agent’s operational experience feed back into its own instructions. And he’s doing it entirely on his own hardware. His broader thesis is that AI should be owned, not rented: “AI in the cloud is not aligned with you; it’s aligned with the company that owns it.” Self-tuning skills running locally is that philosophy in action.

What a skill actually learns

Think about what’s inside a typical CLAUDE.md or agent skill file. It’s a mix of instructions: “use this ORM pattern,” “always check for null returns from the legacy API,” “run tests before committing.” Most of us write these from memory and experience. We forget edge cases. We over-specify some things and under-specify others.

The self-tuning loop discovers what you missed. The agent runs against real tasks, hits a failure you didn’t anticipate, and the optimizer adds a specific instruction to handle it. After a few dozen iterations, the skill file contains hard-won knowledge that no human would have written upfront. You only learn these things by failing at them.

Mitko’s 20-year-old codebase is the perfect test. Legacy systems have undocumented quirks: a column that’s nullable but shouldn’t be, an API that returns 200 on errors, a build step that silently fails on certain file paths. A hand-written skill file might cover the obvious patterns. A self-tuned one accumulates the long tail of things that actually break.

The GEPA results back this up: Claude Haiku’s pass rate went from 79% to 98% on repository tasks. That 19-point gap isn’t about the model being smarter. It’s the same model with better instructions. The skill file learned what the model couldn’t figure out on its own.

How GEPA makes this work

GEPA (Genetic-Pareto) comes from the Stanford/Berkeley group behind DSPy. Paper accepted as oral at ICLR 2026. The core insight: traditional RL collapses execution traces into a scalar reward. It knows that something failed but not why. GEPA passes the full trace to an LLM that reads the error log and writes a diagnosis.

The loop:

Select a candidate skill from the Pareto frontier
Execute on a minibatch of real tasks, capturing full traces
Reflect: an LLM reads the logs and diagnoses why it failed, in natural language
Mutate: propose an improved skill informed by the diagnosis
Accept if improved on any dimension, update the Pareto front

They call the diagnostic feedback Actionable Side Information, the text-optimization equivalent of a gradient. The agent doesn’t just know its score dropped; it knows the SQL query timed out because it forgot to add an index hint for the legacy schema. That specificity is what makes the next skill version sharper.

The Pareto trick matters for practical skills. Instead of averaging all metrics into one score, GEPA maintains a frontier. A skill variant that’s great at database tasks but mediocre at API calls still survives. The merge proposer can later combine complementary strengths, producing a skill that handles both well. This is how a 10-line ARC-AGI agent evolved into a 300-line system that went from 32% to 89% accuracy.

What this means for context engineering

I’ve been writing about the Context Development Lifecycle. The idea that context files need the same engineering discipline as code: generate, evaluate, distribute, observe. That whole framework assumes a human in the loop crafting the context.

GEPA automates three of those four stages. The machine generates candidate context, evaluates it against real tasks, and observes the results. Only distribution (getting the right context to the right agents) still feels inherently human.

But maybe that’s fine. The CDLC doesn’t require the human to be the author. It requires discipline in the process. Automated optimization with human oversight is still a lifecycle. The human role shifts from writing the context to curating the evaluation criteria.

The question isn’t whether humans or machines should craft agent context. It’s who defines what “good” looks like.

The hybrid: hand-crafted skeleton, machine-tuned muscle

The most interesting path is probably neither pure hand-crafted context nor pure self-tuning. It’s a hybrid:

Humans write the skeleton: architectural principles, security requirements, style preferences. The things that are values, not metrics. “We use repository pattern.” “Never expose internal IDs.” “Run integration tests, not just unit tests.”
Machines tune the muscle: the specific phrasing that makes the agent follow those principles reliably, the edge case handling, the codebase-specific patterns. “When modifying the payments module, always check the legacy_flag column.” “The CI pipeline needs --no-sandbox for headless Chrome.”
Real tasks close the loop: pass/fail on actual work keeps both honest

This looks a lot like how we already manage infrastructure. Humans write the policy. Machines optimize the configuration. Monitoring tells you when things drift.

In practice, you’d start with your existing CLAUDE.md, point GEPA at your test suite, and let it sharpen the instructions over a few hundred runs. The output is still a readable text file. You can diff it, review it, reject changes you don’t like. It’s not a black box. It’s your skill file, just better.

Claude’s critique

I asked Claude to poke holes in this premise. Fair pushback:

The evaluator is the real bottleneck. GEPA optimizes brilliantly against whatever you measure. But if your evaluator is “tests pass,” you’ll get code that passes tests, not necessarily code you’d want to maintain.
Legibility decays. After 50 iterations of machine rewriting, the skill file becomes an accumulation of rules without coherent reasoning. The machine doesn’t preserve the why, just the what. Cargo cult instructions that work until the codebase changes.
The bootstrapping problem. You need a solid test suite. Teams with solid test suites already have decent agent instructions. The teams that would benefit most can’t use this without first doing the work they’re trying to avoid.
79% → 98% is the easy part. Binary pass/fail is the most favorable case. “Refactor for readability” or “design an intuitive API” don’t have clean success criteria.
You’re trading one kind of drift for another. Human-written context drifts from neglect. Machine-written context drifts from overfitting to a moment in time.

What to watch

All valid critiques. And yet, GEPA is integrated into DSPy, MLflow, and Pydantic AI. The optimize_anything API is minimal enough that you could point it at your own CLAUDE.md and a test suite today. The barrier to trying this is low.

The real question is organizational. When the agent’s instructions are machine-generated, who reviews them? Who’s accountable when the optimized skill produces code that passes tests but violates an unwritten team norm? Self-tuning context needs self-tuning governance.

We’re still early. But the trajectory is clear: context engineering is heading toward the same automation curve that code went through. First we wrote it by hand, then we tested it, then we generated it, then we optimized it. The CDLC covers stages one through three. GEPA is showing us stage four.