Why CLAUDE.md Stops Scaling

May 17, 2026

Originally published at mnemehq.com

CLAUDE.md is a good idea. A file that tells the model how to behave, what to avoid, which conventions to follow — better than nothing, clearly useful in early stages of a project. Most teams that use it report genuine improvement in baseline output quality.

It stops scaling at around the point where the project gets interesting.

This is not a criticism of Claude, of Anthropic, or of the teams using the file. It is a structural observation: CLAUDE.md is a context-injection mechanism, and context-injection mechanisms have a ceiling. The ceiling is not about file size. It is about what context injection cannot do, regardless of how well-written the file is.

What CLAUDE.md actually is

CLAUDE.md is a prompt. A very good prompt — one with a name, a file location, and a community norm around its contents — but a prompt. When the model reads it, what happens is identical to what happens when any text lands in the context window: the model attends to it, weights it against everything else in the window, and incorporates it into the response it generates.

That process is powerful and it is also the source of every scaling failure this article is going to describe.

Prompts do not enforce. They inform. The model’s behavior under a CLAUDE.md instruction is as reliable as the model’s behavior under any other instruction — which is to say: excellent under low-load conditions, degraded under high-load conditions, and structurally unable to provide the guarantees that architectural governance actually requires.

Failure mode 1: Context accretion

CLAUDE.md starts small. Teams add to it. By month six it is a dense, structured, internally cross-referencing document that covers everything from naming conventions to deployment philosophy to third-party dependency policy.

Every line added to CLAUDE.md competes with every other line for the model’s attention. The model does not read CLAUDE.md the way a human engineer reads a policy document — top-to-bottom, with careful retention, flagging conflicts for later resolution. It reads the entire context window at once, and its behavior is a function of that whole.

The practical result: long, dense CLAUDE.md files produce inconsistent behavior because different parts of the file dominate depending on what else is in the context window at query time. A team that added a new constraint in section 12 will find it reliably followed when the query is simple and largely ignored when the query is complex and the rest of the window is full.

This is not a bug. It is how transformers work.

Failure mode 2: No deterministic enforcement

The most important word in architectural governance is not. Not this library. Not this pattern. Not this approach in services that touch payment data.

CLAUDE.md can express “not” as an instruction. It cannot enforce it. Enforcement requires a system external to the model — a thing that checks output against constraints and rejects it when it fails. CLAUDE.md has no such system. The model is both the instruction-follower and the only checker, and a model that generated a violation does not reliably detect that it generated a violation.

Teams discover this when they find that the constraint they wrote in CLAUDE.md three months ago has been violated in eleven places, consistently, across multiple sessions, by the same model that read the constraint. The model did not forget. It was never in a position to enforce.

Failure mode 3: No decision provenance

Architectural decisions have histories. ADR-014 said “use Pydantic for validation” in January. ADR-031 said “do not use Pydantic for validation in the payments service” in March. The interaction between those two decisions — which one applies, in which scope, under which conditions — is load-bearing information for anyone generating code that touches the payments service.

CLAUDE.md cannot represent this. It is a flat document. It has no concept of decisions, scopes, supersession, or precedence. A team can write “do not use Pydantic in payments” in CLAUDE.md, but that instruction carries no provenance — no reason, no date, no relationship to the earlier instruction it overrides. When the context window is under pressure and something has to give, the instruction with no provenance gives first.

Failure mode 4: Poor scope resolution

A CLAUDE.md at the repository root applies globally. A CLAUDE.md in a subdirectory applies locally. This is a reasonable file-system approximation of scope, and it works until the architecture has scope rules that do not map cleanly to directory structure.

Real scope in a production codebase is not just about directory. It is about service boundaries, team ownership, compliance classifications, data sensitivity tiers. A payment processing module might live in the same directory as a logging utility and have completely different constraint sets. CLAUDE.md has no way to express this. The team either writes global rules that are too broad, or splits files in ways that fragment the governance and make it impossible to reason about from a single location.

Failure mode 5: Autonomous agent drift

In interactive coding, a human is reading every output. When the model violates a constraint, the human catches it (sometimes) and corrects it. CLAUDE.md was designed for this world — a world where there is always a human downstream of the output, closing the enforcement loop informally.

Autonomous agents change this. When the agent is writing code across multiple files in a single session, running tests, interpreting results, and writing more code based on those results, the human is out of the loop for the duration. Drift compounds. A constraint violation in the third file affects the code generated in the fifth file. By the time a human reviews the output, the violation is structural, not stylistic.

CLAUDE.md was not designed for this world. The teams discovering this are not doing anything wrong. They are running a governance mechanism designed for interactive sessions in an autonomous context, and finding that the informal enforcement loop the mechanism depended on is no longer there.

What the ceiling means in practice

None of the five failure modes above are catastrophic in a small codebase with two engineers and a simple architecture. CLAUDE.md works fine there. The ceiling becomes visible at specific thresholds:

When the number of architectural decisions in the project exceeds what a single document can express without internal contradiction
When multiple services have overlapping but non-identical constraints
When autonomous agents are generating code that nobody reviews at the file level
When the team needs to audit what constraint a given diff was generated under
When a new decision needs to override an old one in a specific scope

At each of these thresholds, the problem is not that CLAUDE.md was written poorly. It is that the mechanism cannot carry the load regardless of how it is written.

What comes after CLAUDE.md

The pattern that replaces CLAUDE.md is not a better CLAUDE.md. It is a different category of system: one that represents decisions as structured objects with identity, scope, and precedence; one that resolves conflicts between decisions deterministically before they reach the model; one that enforces at a hard boundary rather than relying on the model’s own compliance.

This is what architectural governance infrastructure means. CLAUDE.md is a useful on-ramp. It is not the destination.

The teams hitting the ceiling are not behind. They found the ceiling, which means they built enough to find it. The next step is to stop writing instructions into a context window and start building the layer that makes instructions enforceable.

Read this article in full at mnemehq.com/insights/why-claude-md-stops-scaling/

Discussion about this post

Ready for more?