Agentic systems: what OpenAI and Anthropic actually tell you to build


The word “agent” has been doing a lot of work lately. In a year it has meant a chatbot with a tool call, a RAG pipeline with a router, a looping planner that edits code, and a swarm of cooperating LLMs negotiating a business process. When a term covers that much territory, it usually means people are shipping different things under the same name and then disagreeing about whether “agents work.”

The two most-cited guides on the subject — Anthropic’s Building effective agents and OpenAI’s A practical guide to building agents — agree on more than their framings suggest. They disagree on vocabulary, overlap on patterns, and arrive at the same piece of advice from opposite directions: do not build an agent when a workflow will do, and do not build a workflow when a single prompt will do. The interesting content in both guides is the machinery underneath that advice — what the patterns are, when each one earns its complexity, and what keeps the whole thing from melting down in production.

This post is a synthesis. It follows Anthropic’s pattern taxonomy for the mechanics and OpenAI’s foundations-and-guardrails framing for the operational side, because each guide is stronger on one half.

The distinction that matters: workflows vs. agents

Anthropic’s most useful contribution is separating two things the industry had been conflating. A workflow is a system where an LLM is orchestrated through predefined code paths — the control flow is in your code, the LLM fills in the blanks. An agent is a system where the LLM itself decides what to do next, picks tools, observes results, and loops until it thinks it is done — the control flow is in the model.

The difference is not cosmetic. A workflow’s behavior is bounded by the graph you wrote. An agent’s behavior is bounded by the model’s judgment and the tools you exposed. That bound is larger, more capable, and substantially harder to reason about. Agents cost more per task (more tokens, more tool calls), latency more (turns are sequential), and fail in ways that are not always reproducible. A workflow that mis-routes a request is a bug you can trace. An agent that takes a weird seven-step path to the wrong answer is a behavior you have to characterize statistically.

So the first decision is not “which agent framework” but “do I actually need the model deciding control flow, or do I need something predictable that happens to use a model?” Most production “agent” systems are workflows. Most successful workflows started with a single prompt and only grew into a graph when the single prompt demonstrably stopped being enough.

Workflow pattern 1: prompt chaining

Break a task into fixed sequential steps, each prompt consuming the previous output. Use this when the task decomposes cleanly and you want to spend latency to buy accuracy.

The reason chaining works is the same reason long single prompts often fail: each step has a narrower job, so the model has fewer ways to go wrong, and the intermediate output is inspectable. If step two produces nonsense, you can see the step-one output that caused it. A three-stage pipeline — outline, draft, polish — is usually more reliable than a single “write me an essay” prompt, and the reliability is worth the extra turns.

Two things to add between steps that a single prompt cannot have: validation gates (a structural or semantic check that rejects outputs that do not meet requirements) and repair prompts (a follow-up that asks the model to fix the specific defect). Gates turn a chain from “maybe works” into “works or fails loudly,” which is the behavior you want in production.

Workflow pattern 2: routing

Classify the input once, then dispatch to a specialized prompt or model. Use this when distinct input categories benefit from different treatment — refund queries versus technical support, simple questions that a small model can handle versus complex ones that need a frontier model.

Routing is where cost and latency savings live. A router running Haiku that correctly sends 80% of traffic to Haiku and 20% to Opus is dramatically cheaper than running Opus on everything, and often no less accurate — many queries do not need a frontier model.

The failure mode of routing is that the router itself has to be reliable, and a confused router is a silent bug: the downstream prompt does its best with the wrong input and produces a confident-sounding wrong answer. Instrument the routing decision. Keep the category set small. Add a “don’t know” route that escalates rather than guesses.

Workflow pattern 3: parallelization

Two variants, same idea. Sectioning splits a task into independent subtasks that run in parallel and get stitched back together. Voting runs the same task multiple times and aggregates — majority vote, best-of-N, or a separate judge prompt.

Sectioning is the pattern behind “map N documents in parallel, then summarize,” or “review this code for security, performance, and style in parallel.” It pays off whenever subtasks are genuinely independent, because you buy latency with concurrency you would otherwise spend sequentially.

Voting pays off when a single sample has nontrivial error rate and the errors are uncorrelated. Three independent vulnerability scans catch more bugs than one, because each run attends to different things. The trap is assuming errors are uncorrelated when they are not — the same model, prompted the same way, with the same input, is not three independent experts. You get independence by varying the prompt, the temperature, or the model.

Workflow pattern 4: orchestrator-workers

A central LLM decomposes a task into subtasks at runtime, dispatches them to worker LLMs, and synthesizes the results. Unlike sectioning, the decomposition is not fixed in advance — the orchestrator decides what the subtasks are based on the input.

This is the first pattern in the list that starts to look like an agent. The orchestrator is making a control-flow decision the code does not know in advance. The reason it is still a workflow is that the orchestration is one hop: decompose, fan out, collect, return. There is no loop. There is no “the orchestrator observes the worker outputs and decides to dispatch more workers.”

The canonical use case is multi-file code changes: the orchestrator reads the task, figures out which files need editing, dispatches per-file edit workers, and aggregates the diff. A fixed pipeline cannot do this because the file list is input-dependent. A full agent loop is overkill because once the orchestrator has the plan, execution is bounded.

Workflow pattern 5: evaluator-optimizer

Generator produces output. Evaluator (another LLM, or a deterministic check) judges it against criteria. If the judgment is “not good enough,” the generator tries again with the feedback. Loop until accepted or a budget runs out.

This pattern earns its complexity only when two conditions hold: clear evaluation criteria (otherwise the evaluator is guessing) and demonstrable improvement on revision (otherwise you are paying for a loop that does not converge). Translation with nuance, complex writing, multi-round research — these tend to improve. Factual question-answering typically does not: if the first answer was wrong, the model is unlikely to discover it was wrong through self-critique.

The common mistake is using the generator as its own evaluator on the same prompt. Self-critique in the same context window is weak; the model tends to ratify what it just produced. Use a separate prompt, ideally a separate role (“you are a strict editor”), and give the evaluator access to criteria the generator did not see.

Where agents start

Past the orchestrator-workers pattern, you start entering agent territory — systems where the LLM, not your code, is in the driver’s seat. Anthropic’s guide is careful here: autonomous agents are for open-ended problems where you cannot predict the path in advance and a human is not going to be cheap to include on every step. OpenAI’s guide is blunter. Build an agent when you have one of three conditions:

  • Complex decisions. The task requires judgment, nuance, or context handling that rule-based systems mangle.
  • Brittle rule systems. You have an existing automation that breaks on edge cases and the rules are getting too expensive to maintain.
  • Heavy reliance on unstructured data. The inputs are natural language, images, or messy documents, not neat records.

If none of these apply, you do not need an agent. A workflow, a retrieval pipeline, or a single well-prompted call will be cheaper, faster, and easier to operate. The decision to use an agent is a decision to accept higher variance in exchange for capability you cannot get otherwise.

OpenAI’s three foundations

Once you have decided an agent is warranted, OpenAI’s guide reduces the design problem to three components. They are obvious in retrospect and frequently botched in practice.

Model. Pick the smallest capable model that meets the task’s accuracy bar, not the largest one available. Start with a frontier model to establish that the task is doable at all; then try smaller models at each step to see where you can substitute. A mature agent pipeline is usually heterogeneous — a frontier model on the reasoning-heavy steps, smaller models on routing and extraction. The inverse mistake — frontier everywhere — burns money and adds latency without adding reliability.

Tools. Agents are only as capable as the tools you give them. The guide splits tools into three rough categories: data tools (read state from the world — query a database, fetch a document), action tools (change state in the world — send an email, create a ticket), and orchestration tools (other agents, exposed as tools the current agent can call). The distinction matters because the risk profile differs. A data tool that misfires wastes tokens. An action tool that misfires sends the wrong email to the wrong customer.

The second tool trap is the interface. LLMs use tools through natural-language descriptions of parameters and behavior. A tool with an opaque schema, ambiguous field names, or under-documented semantics will be misused — not because the model is unable to read the spec, but because the spec is not specific enough. Treat tool definitions as prompts in their own right. Name parameters for humans. Write descriptions that anticipate the mistakes you have already seen the model make.

Instructions. The system prompt is the contract. It should state the agent’s role, the task, the constraints, the tools available and when to use them, the format of the final answer, and what to do when things go wrong. The failure mode is either too little (a vague persona and a goal, no procedure) or too much (a wall of edge-case rules the model cannot hold in mind at once). Break long procedures into numbered steps. Externalize what can be externalized — if a rule is checkable in code, enforce it in code, not in the prompt. Derive instructions from existing SOPs where they exist; your support agents already know the edge cases.

Orchestration: one agent, then maybe more

Both guides recommend starting with a single-agent loop and only adding multi-agent structure when complexity forces it. The single-agent loop is the canonical agent: system prompt, tools, run-until-done. The model receives input, decides on a tool call, the runtime executes it, the result goes back into context, and the model decides again. Termination is either a structured “final answer” output, a max-turn budget, or an explicit stop condition.

When a single agent’s prompt gets too long and too branchy to reliably follow, two multi-agent patterns are on offer.

The manager pattern (OpenAI’s framing, similar to Anthropic’s orchestrator-workers but with a loop) keeps one agent in charge and exposes the other agents as tools. The manager calls a specialist agent the same way it would call a database. This preserves a single control loop — the manager decides when the task is done — and makes each specialist’s role small and testable. The cost is an extra hop per specialist call and some duplicated context.

The decentralized pattern (OpenAI’s “handoffs”) lets agents transfer control to one another. Each agent is equipped with a “handoff to X” capability. When a conversation needs a different specialist, the current agent hands off; the new agent takes over the conversation. This is closer to a phone tree than a tree of subroutines, and it works well when the task is genuinely a sequence of specialist phases (triage → diagnosis → resolution) rather than a central task that occasionally calls out.

The pragmatic advice in both guides: do not start here. Single agent with well-scoped tools first. Split when the single agent’s prompt becomes unmaintainable, not before.

Guardrails: layered, not monolithic

OpenAI’s guardrails section is the one part of either guide that genuinely feels like an operations manual. The framing is that no single check catches everything, so you layer cheap checks in front of expensive ones and hard stops around the most consequential actions.

A practical guardrail stack, loosely in order of where the input is in the pipeline:

  • Relevance classifier. Is the request within the agent’s scope? An off-topic request is rejected before it consumes a turn.
  • Safety classifier. Prompt-injection attempts, jailbreaks, obvious abuse. Cheap to run, cheap to miss.
  • PII filter. Does the input contain data the agent should not see or should not echo? Redact before the model sees it, or before the model sends it to a tool.
  • Moderation. The same content-policy checks you would run on any user-generated text, on both inputs and model outputs.
  • Tool risk rating. Classify each tool by blast radius — read vs. write, reversible vs. irreversible, low-value vs. high-value. High-risk tools require additional checks or human approval.
  • Output validation. Does the final answer match the schema? Does it cite allowed sources? Does it pass the business rules that are easy to state in code?
  • Rules-based hard stops. Regexes, allowlists, blocklists, numeric bounds. Not every check needs an LLM; the ones that don’t, shouldn’t be one.

The layers do different jobs. The classifiers are about what the agent is willing to engage with. The tool ratings are about what the agent is allowed to do on its own. The output validators are about what leaves the system. A stack with all three is much harder to subvert than any single model-based check, because each layer would have to fail independently.

Human intervention as a first-class feature

The piece that ties the guardrails to the agent loop is the human-in-the-loop escape hatch. Both guides treat this not as a fallback for when things go wrong but as a deliberate control surface. Trigger a human handoff when:

  • The agent hits its retry or turn budget without converging.
  • A high-risk tool is about to be called — refunds over a threshold, destructive operations, anything where “the model was wrong” is an incident.
  • A guardrail flags the input or output with low confidence.
  • The conversation contains signals that warrant escalation on their own — a customer asking for a supervisor, a legal mention, a safety keyword.

Designing these handoffs well is an ops problem, not a prompt problem. Where does the escalated case go? What context does the human receive? Can the human return control to the agent, or does the case live in a queue? An agent system without a good handoff story tends to either escalate everything (eroding the efficiency that justified building the agent) or escalate nothing (producing incidents the human team only finds out about in the postmortem).

The common thread

The two guides land in the same place by different routes. Anthropic arrives there by taxonomy: here are six patterns, most are workflows, use the simplest one that fits. OpenAI arrives there by stack: here are three foundations and a guardrails layer, build them in order, do not skip to multi-agent. The shared thesis is that “agent” is not a level of intelligence you unlock by adding frameworks; it is a level of autonomy you concede, on purpose, in exchange for capability you could not get otherwise, and only after the cheaper options have been ruled out.

Three rules of thumb, drawn from both:

  1. Start with one well-prompted call. If it works, you are done. If it fails in predictable ways, add a workflow pattern that addresses the specific failure.
  2. Use the smallest system that passes your accuracy bar. This applies to models, to tool counts, to agents. Complexity is the tax you pay for capability you actually need.
  3. Make the agent-computer interface a first-class artifact. Tool names, parameter docs, system prompts, output schemas — these are the code the model is running. Write them like you would write code a colleague has to maintain.

The guides are useful precisely because they resist the framing they are nominally selling. Neither one is an argument for “build more agents.” Both are arguments for knowing exactly when you are building one, and what you are giving up to get it.