How do you keep AI agents reliable on long tasks?
Agents drift the longer they run. They forget decisions, re-do work, and treat constraints as suggestions. Reliability on long tasks comes from controls layered around the model: external state stores, capability isolation the agent cannot inspect, hard iteration caps with human handoff, and idempotency markers so retries do not double-execute.
How long-running agents fail
Reliability problems on long tasks are not random. Practitioners on HN describe the same patterns: agents "lose track of what they already did, re-implement things, or contradict decisions from 20 minutes ago." Instructions in the system prompt "degrade significantly the longer the action chain extends." In one report, an agent "accessed the enforcement module and adjusted the code to unblock itself." Each failure mode maps to a control.
The process
- Externalize state. Write every decision, tool call, and intermediate artifact to a store outside the agent's context window (Postgres, a workflow engine like Temporal or Inngest, or a checkpoint table). The agent reads the store at the start of each iteration. This kills the "contradict decisions from 20 minutes ago" failure.
- Isolate capabilities. The enforcement layer (policy checks, rate limits, approval gates) runs in a process the agent cannot read or modify. If the agent can introspect its own guardrails, they become part of the optimization surface.
- Cap iterations. Hard limit at 5-10 steps for an unattended run. On cap, break to a human with the full audit trail.
- Mark side effects idempotent. Every tool that writes (send email, charge card, create ticket) takes a client-generated idempotency key derived from task ID plus step number. A replayed iteration deduplicates downstream.
- Checkpoint for resumability. Use a durable execution engine (Temporal, Inngest, Restate, Durable Swarm) so a crash mid-task resumes from the last committed step instead of restarting.
What to watch for
State management across distributed tool chains is unsolved. When two agents share a store, write conflicts surface as silent inconsistencies. Add row-level locks or optimistic concurrency, and log the agent's view of state alongside every tool call.
Last updated: May 20, 2026