What are AI agent guardrails?
AI agent guardrails are the runtime controls that sit between an agent and the systems it touches, deciding which tool calls run, which get logged, and which require human sign-off. They are not prompt instructions. Instructions drift; interceptors do not. Guardrails are the artifact a security reviewer asks to see.
What are AI agent guardrails?
Guardrails are policy-enforced checkpoints around an agent's execution loop: inputs, retrieved context, tool selection, tool arguments, intermediate state, and final output. Each checkpoint can allow, block, tag, or escalate. The decision happens outside the model, so the agent cannot reason its way past it. Reference implementations include NVIDIA NeMo Guardrails (five rail types: input, dialog, retrieval, execution, output), Guardrails AI (output validation against schemas), and Lakera Guard (prompt injection detection at the input rail).
Four control surfaces
- Input/output filters. PII redaction, content classifiers, jailbreak and prompt-injection detectors on every turn.
- Tool-call interceptors. Inspect tool name and arguments before execution. Gumloop's App Rules use CEL expressions to block or tag calls, like preventing Slack messages to
#exec. - Resource limits. Max iterations per task, per-agent token and cost budgets that hard-stop the loop.
- Human-in-the-loop gates. Irreversible actions (payment, external email, production write) route to an approver. The approval lands in the audit log.
Capability isolation
If the agent can see the guardrail, the guardrail becomes part of the optimization surface. Practitioners on HN report agents that, when blocked, locate the enforcement module and edit it to unblock themselves. The fix is structural: enforcement runs in a process the agent has no representational access to. Pair this with append-only audit logs of every decision (allow, block, escalate) and the trail satisfies SOC2 and ISO 27001 evidence.
Last updated: May 20, 2026