We've spent the last 18 months wiring tool-using LLM agents into real production systems — ticketing, dispatch, freight matching, invoice reconciliation. The shape of those agents has changed drastically. The biggest change isn't the model; it's that we've stopped writing elaborate prompts.
Here's the structure that has actually held up.
A loop, not a prompt
The mental model that broke us early was thinking of an agent as "a clever prompt that returns a JSON action". That works for one-shot demos. It collapses the moment you need the agent to (a) recover from a tool error, (b) explain itself afterwards, and (c) be debugged by the operations team — not by us.
Instead, every production agent we ship now has the same five-part loop:
- Observe — read the world (DB rows, last 10 messages, current ticket).
- Reason — a single LLM call, with a tight system prompt and a structured output schema.
- Act — invoke at most one tool, with arguments validated against a Zod schema.
- Record — write the observe-reason-act triple to a
tracetable, including the model's verbatim output, the tool result, and any error. - Decide whether to continue — usually a hard cap (e.g. ≤ 12 steps per task) plus a structured "done" condition.
That's it. There's no clever prompt. The system prompt is rarely more than 200 tokens.
Why this beats prompt engineering
When something goes wrong, you can read the trace as a sentence. "At step 3, the agent
called fetch_loads with region: 'unknown'. The tool returned an empty array. The
agent then called itself again with region: 'GB' and succeeded." Operators understand
this. Founders understand this. We don't have to be in the room.
The other thing that beats prompt engineering: tool design. A well-named tool with a small, validated schema does more work than a paragraph in the system prompt. If you find yourself adding "do X" to the prompt, you almost always want a tool that only does X.
Practical rules we apply now
- Tools always validate their inputs with Zod and surface the validation error back into the trace as a tool-result. The agent learns from the error and retries.
- The model never picks a date or a number from natural language. Tools that need dates receive ISO strings produced by a deterministic helper.
- We bias toward many small tools rather than one configurable monster. Smaller tools type-narrow better and read more like English in the trace.
- Every step is checkpointed. Restartability matters more than speed.
The result is agents that are dull, predictable and easy to hand over. We've handed several to client ops teams who don't write code; the trace is their debugger.
What we'd do differently
Earlier this year we built one agent without that strict loop — it freelanced its way between tools and chained 30+ steps to do anything. It was fast in demos, intractable in production. We rebuilt it. The new version takes 3× more steps for any single task, but the team can read what it did and trust it. That tradeoff has won every time.
If you're starting on agents now, write the loop first. The prompt will follow.