The Mirage of Autonomous Code
Large language models have a seductive demo mode: you describe a feature, and within seconds an agent writes the code, passes a few tests, and declares victory. Those victories, however, often come in a vacuum. The experiments that produce them typically care about functional correctness — does the endpoint return the right JSON? — without asking whether the solution respects the architectural scaffolding that makes a real project maintainable and safe.
Production-grade backends live inside a thicket of structural constraints. Frameworks impose directory layouts, database models must follow an object-relational mapping convention, and API contracts are supposed to be exactly one shape, not just any shape that works. When a coding agent ignores these rules, it might deliver a working prototype that silently sabotages the next team that tries to extend it. The paper Constraint Decay: The Fragility of LLM Agents in Backend Code Generation asks a simple, unsettling question: what happens when we grade agents not only on whether the code runs, but on whether it respects the structure that real software demands?
When Architecture Bites Back
Backend generation is not a single-file puzzle. To build a modern web service you must simultaneously satisfy a web framework's idioms, a data layer with its own query rules, and a fixed API contract that the front-end team already expects. The researchers call this bundle of non-functional requirements structural constraints — rules about file organization, ORM usage, database schema mapping, and adherence to a pre-defined interface.
Their study isolates these pressures by fixing the API contract across 80 greenfield creation tasks and 20 feature-implementation tasks. Every task was executed inside one of eight popular web frameworks, from the minimalist Flask to the opinionated Django. For each generated solution they ran not only end-to-end behavioral tests (did the API respond correctly?) but also static verifiers that checked whether the code lived inside the required architectural fences.
“We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation.”
This dual lens — behavior plus architecture — reveals a gap that single-metric benchmarks conceal.

Constraint Decay: The Numbers Tell a Harsh Story
The headline result is a phenomenon the authors call constraint decay: as structural requirements accumulate, agent performance falls off a cliff.
The baseline tasks, lenient on structure, look promising. But when the full weight of a real backend specification is applied, capable agent configurations lose around 30 percentage points in assertion pass rates on average, and weaker configurations collapse toward zero. In plain terms, if you ask an agent to build something with all the constraints of a production project, it very often either fails silently or produces code that a strict reviewer would reject.
Framework sensitivity sharpens the picture. Agents thrive in explicit, thin frameworks like Flask, where there is little hidden convention to guess. In contrast, correctness plummets in convention-heavy environments such as FastAPI and Django, which expect the developer to follow implicit patterns for routing, serialization, and database integration. The agent, brilliant at generating code, stumbles when it must infer the invisible rules that human developers absorb from documentation and community culture.
The Anatomy of Failure
Digging into the wreckage, the researchers trace most failures to the data layer. Incorrect query composition and ORM runtime violations — mismatches between the generated SQL-like logic and the object-relational mapping's rules — are the leading root causes. An agent might produce a beautiful route handler but then silently misconfigure the database session, or construct a filter expression that passes a unit test yet violates the framework's lazy-loading contract. In a greenfield project, such a defect can remain invisible until performance degrades or a migration exposes a broken relationship.
“Error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes.”
This concentration of pain is particularly worrying because the data layer sits at the very core of a backend. A functional test that returns correct JSON can be satisfied by a prototype that cheats the data model, while a static verifier that demands strict ORM conformance will tear that same prototype apart. The chasm between these two evaluation modes is where constraint decay lives.
Beyond the Greenfield
These findings are not a reason to abandon LLM coding agents; they are a map of where attention must go next. The paper makes clear that today's agents, even strong ones, treat structural rules as soft suggestions. When the rules become hard, performance disintegrates. Jointly satisfying functional and structural requirements remains a key open challenge.
For teams eager to adopt autonomous coding assistants, the practical warning is unambiguous: the cost of ignoring architecture during generation is high, and it shows up first in the data layer. The experiments suggest that minimal frameworks with fewer hidden conventions are safer playgrounds, while convention-rich ecosystems demand an agent that can internalize — not just mimic — the patterns of the ecosystem.
“This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.”
The path ahead likely involves architectural prompting, verification-driven loops, and perhaps agents that treat a framework's structure as a first-class input rather than an afterthought. Until then, the demos will remain impressive, but the leap from a passing test to a production-ready backend will stay just out of reach.



