If context is code, can we test it in a CI pipeline? The CDLC says generate, evaluate, distribute, observe. The evaluate step is where it gets real.
I wrote about this on tessl.io: evals are the equivalent of tests for context. But they follow different rules. Seven problems came up, split between how you run evals and what you’re actually measuring.
Part 1: Running evals
Non-determinism. LLMs produce variable outputs even at temperature zero. You can’t treat evals as pass/fail gates. Instead: error budgets. Define an acceptable failure rate in advance. Run a minimum of five trials per scenario (what we do in SkillsBench). Use binary evals over granular scoring.
Ownership. Before asking who approves a non-passing eval, define what quality means. Eval quality matters as much as context quality. Shallow evals provide false confidence. Go/no-go decisions need product and standards input, not just engineering.
Layered execution. Not every eval runs on every commit. Fast feedback during iteration, comprehensive testing before merge. This mirrors unit/integration/e2e tiers. It also enables eval-driven context development: write the eval first, then write the context that passes it.
Part 2: What you’re measuring
Real usage over synthetic tests. The most valuable evals come from actual failures, not hypothetical scenarios. Vercel’s next-evals-oss converts real Next.js API gaps into targeted scenarios. Wire agents to observability (Langfuse and similar) to close the feedback loop automatically.
Staleness from external changes. A new agent version ships. A shared skill changes upstream. Your code didn’t change, your context didn’t change, but the behavior shifted. Traditional CI catches dependency changes because builds break. Context drift goes undetected. The fix: scheduled eval runs independent of commits.
The whack-a-mole problem. Adding an instruction doesn’t just add behavior. It shifts overall behavior unpredictably. As Hamel Husain argues, the right mental model treats evals as a monitoring layer, not a coverage target. No eval suite can capture all instruction interactions.
Context-specific evals. Generic vendor benchmarks miss what matters for your codebase. The only evals worth optimizing for are the ones you defined yourself. Your codebase, your conventions, your team’s standards. Optimize for someone else’s metric and Goodhart’s Law kicks in.
The flywheel connection
This feeds directly into the context flywheel. Better signals from production generate better failure cases. Better failure cases become better evals. Better evals produce better context. Better context produces better agent output. Each cycle sharpens the next.
The pipeline shape is familiar. The rules inside it are not.
Read the full post: CI/CD for Context in Agentic Coding: Same Pipeline, Different Rules on tessl.io.
Join the LinkedIn discussion.