Agents without prompts: building Claude tool-loops you can debug

We've spent the last 18 months wiring tool-using LLM agents into real production systems — ticketing, dispatch, freight matching, invoice reconciliation. The shape of those agents has changed drastically. The biggest change isn't the model; it's that we've stopped writing elaborate prompts.

Here's the structure that has actually held up.

A loop, not a prompt

The mental model that broke us early was thinking of an agent as "a clever prompt that returns a JSON action". That works for one-shot demos. It collapses the moment you need the agent to (a) recover from a tool error, (b) explain itself afterwards, and (c) be debugged by the operations team — not by us.

Instead, every production agent we ship now has the same five-part loop:

Observe — read the world (DB rows, last 10 messages, current ticket).
Reason — a single LLM call, with a tight system prompt and a structured output schema.
Act — invoke at most one tool, with arguments validated against a Zod schema.
Record — write the observe-reason-act triple to a trace table, including the model's verbatim output, the tool result, and any error.
Decide whether to continue — usually a hard cap (e.g. ≤ 12 steps per task) plus a structured "done" condition.

That's it. There's no clever prompt. The system prompt is rarely more than 200 tokens.

Why this beats prompt engineering

When something goes wrong, you can read the trace as a sentence. "At step 3, the agent called fetch_loads with region: 'unknown'. The tool returned an empty array. The agent then called itself again with region: 'GB' and succeeded." Operators understand this. Founders understand this. We don't have to be in the room.

The other thing that beats prompt engineering: tool design. A well-named tool with a small, validated schema does more work than a paragraph in the system prompt. If you find yourself adding "do X" to the prompt, you almost always want a tool that only does X.

Practical rules we apply now

Tools always validate their inputs with Zod and surface the validation error back into the trace as a tool-result. The agent learns from the error and retries.
The model never picks a date or a number from natural language. Tools that need dates receive ISO strings produced by a deterministic helper.
We bias toward many small tools rather than one configurable monster. Smaller tools type-narrow better and read more like English in the trace.
Every step is checkpointed. Restartability matters more than speed.

The result is agents that are dull, predictable and easy to hand over. We've handed several to client ops teams who don't write code; the trace is their debugger.

What we'd do differently

Earlier this year we built one agent without that strict loop — it freelanced its way between tools and chained 30+ steps to do anything. It was fast in demos, intractable in production. We rebuilt it. The new version takes 3× more steps for any single task, but the team can read what it did and trust it. That tradeoff has won every time.

If you're starting on agents now, write the loop first. The prompt will follow.

Agents without prompts: building Claude tool-loops you can debug

A loop, not a prompt

Why this beats prompt engineering

Practical rules we apply now

What we'd do differently

Scrapers in 2026: when to rent a browser and when to write your own

Studio economics: why we stayed at four people