Design

Why fail-closed wins

“Fail-closed” means: when a system is unsure, it refuses — instead of letting things through when in doubt. For autonomous AI agents, that's the difference between a thwarted and a successful attack.

~4 min read · Design

Fail-open vs. fail-closed

Every protection system has to decide what to do when it's not sure. There are two stances:

Fail-open: let it through when in doubt. Convenient, never in the way — but every gap in judgment becomes an open door.
Fail-closed: block or ask when in doubt. Occasionally inconvenient — but an unclear case never quietly turns into damage.

For a door lock the choice is obvious: a locking system that simply springs open on a power outage is no safeguard. With AI agents, the same logic is often forgotten.

An agent acts autonomously and repeatedly. A single wrongly waved-through “memory” takes effect not once, but on every future decision.

Why agents in particular need fail-closed

Persistence: whatever gets into the Memory Files stays there. A fail-open mistake isn’t fleeting — it settles in.
Autonomy: nobody is watching every action. The safe default has to be built in, not dependent on human vigilance.
Ambiguity: malicious entries look like legitimate ones. A model will often be “rather unsure” — and that's exactly when it must not wave things through.

How PoisonZero implements fail-closed

PoisonZero scores every change with two numbers: a danger level (0–1) and a confidence (0–1, how reliable the judgment is). From that follows a clear, conservative logic:

# Decision logic (simplified)
danger ≥ block threshold          → revert
danger ≤ safe threshold
   AND confidence ≥ minimum       → allow
else (everything unclear between)  → ask_user

The decisive point is the last line: whatever isn't judged unambiguously harmless and sufficiently confident does not get through. It gets rolled back or brought to you. You set the thresholds per agent yourself — clearly explained, with sliders instead of config files.

The price, and why it's worth it

Fail-closed occasionally means a query that, in hindsight, turns out to be unnecessary. That's the price. Against it stands a poisoned entry that quietly skews every future decision. This asymmetry — small, visible inconvenience versus large, invisible damage — is the whole reason fail-closed wins.

Safe by default.

PoisonZero picks the safe side when in doubt — and lets you draw the line yourself.

Try 14 days free

All articles