I’m looking for architectural feedback on a reproducible failure case with LLM‑based agents.
Problem:
I repeatedly asked LLM agents to generate a correct page imposition for a 12‑page A6 booklet printed as a saddle‑stitched brochure on A4.
This is a deterministic, rule‑based task with exactly one correct page order.
Observation:
Across ~200 independent attempts (different prompts, temperatures, and agent setups), not a single result was correct or directly printable.
The failures were not random: the same incorrect page‑order assumptions appeared repeatedly.
Increasing the number of agents, adding critique/review loops, or using majority voting did not improve correctness.
All agents converged on structurally wrong solutions.
Hypothesis:
This looks less like a prompt‑quality issue and more like a categorical mismatch between probabilistic language models and tasks that require strict symbolic correctness and verification.
Proposed architectural pattern (for discussion):
For this class of problems, my working assumption is that LLMs should not be allowed to directly “solve” the task.
Instead, they should operate inside a constrained architecture, for example:
- A deterministic core (rule engine / symbolic kernel) that defines what is valid
- A simulator or validator that can unambiguously accept or reject a candidate
- LLMs acting only as planners, explainers, or parameter selectors
- No LLM output is ever considered final without passing formal validation
In other words: separating “reasoning in natural language” from “execution in a formal system”.
Question:
Have you seen or designed agent architectures that explicitly enforce this separation?
Are there established patterns or references where LLMs are intentionally prevented from executing deterministic sub‑problems and instead operate under symbolic or formally verified constraints?
I’m explicitly interested in architectural patterns and failure boundaries, not prompt tuning.