Why Your Agent Builds the Wrong Thing Correctly
2026-04-02 · George Moon
Your agent passed lint. The tests are green. The PR looks reasonable. And it's completely wrong.
Not wrong in the way a junior developer is wrong — broken syntax, missing imports, off-by-one errors. Wrong in a way that's harder to catch: the code is locally valid but globally misguided.
This is the failure mode that matters now. And most teams don't have a system for preventing it.
The shape of agent mistakes
When a human developer makes a mistake, it's usually obvious. The build breaks. The tests fail. A reviewer catches the logic error. Traditional tooling — compilers, linters, type checkers — exists to catch these failures.
Agent mistakes look different. Matt Rickard calls them "locally valid mistakes":
> Agents fail in distinct ways from humans — they produce locally valid mistakes rather than breaking builds. Traditional tooling doesn't adequately constrain agent behavior.
> — Matt Rickard, *The Spec Layer*
You've seen this if you've worked with agents for any length of time. The patterns are consistent:
- **Disabling tests instead of fixing root causes.** The agent encounters a failing test. Instead of understanding why it fails, it comments out the assertion or rewrites the test to pass against the new (wrong) behavior.
- **Reusing patterns mindlessly.** The agent finds an existing pattern in the codebase and replicates it — even when the new context requires a different approach. It copies the authentication middleware pattern for a public endpoint.
- **Preserving old behavior while adding parallel paths.** Asked to change how notifications work, the agent adds a new notification system alongside the old one. Both now run. Neither is what you wanted.
- **Making the same decision repeatedly.** Finite context windows mean the agent re-encounters the same ambiguity on every run and resolves it differently each time.
The root cause
These aren't intelligence failures. The agent is capable of writing correct code for any of these tasks. The problem is what Rickard calls **underconstrained execution**: too much freedom at the point where the agent has to act.
Think about what an agent typically receives when tasked with a feature: a one-line description in a ticket, maybe a few sentences of context, and access to the codebase. From this, it must infer:
- What exactly should be built
- Why this approach was chosen over alternatives
- What constraints exist from prior decisions
- Which parts of the codebase are relevant precedent and which are legacy
A senior engineer fills these gaps from experience and institutional knowledge. An agent fills them from pattern matching against whatever is in the context window. When the context is thin, the agent invents plausible-sounding answers. The code compiles. The tests pass. The PR looks clean.
Six months later someone asks "why did we build it this way?" and nobody knows — because the reasoning was never recorded, and the agent that wrote the code had none to begin with.
Better prompts won't fix this
The instinct is to write more detailed prompts. Longer task descriptions. More context in the system message. This helps at the margins but doesn't solve the structural problem.
Prompts are ephemeral. They exist for one conversation, one task, one context window. The reasoning behind a decision outlives the prompt that triggered it. Next week, a different agent — or the same agent with a fresh context — faces the same ambiguity and has nothing to work from.
The fix isn't more words in the prompt. It's structured knowledge that persists across sessions, connects decisions to evidence, and flags when upstream assumptions change.
Constraining intent, not just behavior
Tests constrain behavior — they verify the code does what it does correctly. Linters constrain style. Type checkers constrain structure.
None of these constrain **intent**. They can't tell you whether the agent is solving the right problem. They catch execution errors, not reasoning errors.
What's missing is an upstream layer:
- **What research informed this decision?** Not just "we chose OAuth" but the analysis comparing OAuth to session-based auth for this specific architecture.
- **What strategic position does this implement?** Not just a feature request but the thesis that token-based auth enables stateless scaling.
- **What are the concrete requirements?** Not "add auth" but "JWT tokens with per-tenant signing keys, refresh token rotation, and rate limiting per plan."
- **Has anything changed?** If the original research is outdated or the strategic thesis has shifted, which requirements need review?
This is a knowledge graph problem. Decisions form a chain from evidence to strategy to specification to code. When you capture that chain with versioned relationships, agents can traverse it — and you can detect when links in the chain go stale.
What this looks like in practice
An agent is assigned to implement a billing feature. Instead of working from a one-line ticket, it reads a structured spec:
# Get the requirement — what exactly needs to be built
lattice get REQ-BILLING-002
# Traverse upstream — why does this requirement exist?
lattice search --related-to REQ-BILLING-002
# Check for drift — has anything changed since this was written?
lattice drift
```
The requirement links to a thesis about usage-based pricing. The thesis links to research comparing pricing models for developer tools. Every link records the version it was bound to.
If the pricing thesis changed last week — maybe early data showed usage-based pricing churns small accounts — `lattice drift` flags the three requirements derived from it. The agent knows to check before building on a potentially outdated assumption.
The agent still writes the code. But now it's writing code against a spec that traces back to evidence, not against a guess derived from pattern matching.
The difference
Without structured specs, you're relying on the agent to infer intent from code and comments. It will infer *something* — and it will be locally valid. The code will compile. The tests will pass.
With structured specs, you've constrained intent before execution begins. The agent builds to a contract, not a hunch. And when the reasoning behind that contract changes, the system tells you which downstream work needs review.
Your agent doesn't have a skills problem. It has a context problem. Give it structured knowledge and it builds what you actually need.
Get started
# Install Lattice
curl -fsSL https://forkzero.ai/lattice/install.sh | sh
# Initialize a knowledge graph in your project
lattice init --skill
```
- [Lattice on GitHub](https://github.com/forkzero/lattice)
- [Live dashboard](https://forkzero.ai/reader?url=https://forkzero.github.io/lattice/lattice-data.json)
- [Forkzero](https://forkzero.ai)