Technical whitepaper

How PoisonZero works.

This is the architecture document for the people who evaluate it: security engineers and CISOs who want to understand the design before they buy. It covers the threat model, the full processing pipeline, the scoring and decision logic with real default thresholds, both deployment flows — cloud-assisted and fully on-device — and the integrity, isolation and privacy guarantees that back them. We name the limits honestly too.

1. Threat model: memory poisoning is not prompt injection

AI agents keep persistent memory — project notes, standing instructions, learned preferences — in plain files on disk (the “Memory Files” and dot-files of tools like coding assistants). That memory is read back into the model's context on every run. It is, in effect, a writable extension of the system prompt that survives restarts.

Prompt injection is a transient problem: a malicious instruction arrives inside one request and is gone when the session ends. Memory poisoning is persistent. A single poisoned entry — written by a compromised tool, a malicious document the agent summarised, a supply-chain payload, or another agent — keeps influencing every future run until someone notices. The attacker does not need to be present when the damage happens. They write once; the agent re-reads forever.

PoisonZero defends the integrity of those files. It treats every change to a protected file as untrusted until proven safe, and it is fail-closed: when in doubt, the change is reverted rather than waved through. We model five classes of attack and train and evaluate against each:

2. Architecture: a privileged daemon, defense in depth

PoisonZero runs as a small, privileged daemon on each protected machine (a systemd unit on Linux, a service on macOS/Windows). It is event-driven, not a scanner: it does nothing until a protected file changes. There is no indexer crawling your disk and no background telemetry.

The design is layered, so that no single component is a single point of failure and each layer is cheap to reason about:

file change on a protected path │ ┌────────▼─────────┐ │ watcher │ fsnotify + trailing-edge debounce └────────┬─────────┘ (final-state guarantee) │ ┌────────▼─────────┐ │ policy filters │ notice/loop guard · protected-path check └────────┬─────────┘ │ ┌────────▼─────────┐ │ meta-attack │ deterministic, AI-free rules │ rule layer │ (Unicode-normalised, multilingual) └────────┬─────────┘ │ (no hard match) ┌────────▼─────────┐ │ evaluator │ Flow A: cloud · Flow B: on-device └────────┬─────────┘ → danger / confidence in [0,1] │ ┌────────▼─────────┐ │ decision engine │ thresholds → allow / ask_user / revert └────────┬─────────┘ (fail-closed on any error) │ ┌────────▼─────────┐ │ enforcer │ allow · quarantine · revert to clean state └──────────────────┘ → audit entry, every time

Two evaluator implementations sit behind the same interface — a cloud-assisted one (Pro, §5) and a fully on-device one (Enterprise, §6). Everything around the evaluator is identical across tiers: the same watcher, the same deterministic rules, the same decision logic, the same enforcement and the same audit trail.

3. The pipeline in detail

3.1 The watcher and its final-state guarantee

The watcher uses the OS-native change-notification facility (inotify on Linux, kqueue/FSEvents on macOS, ReadDirectoryChangesW on Windows) and debounces per path. Editors and agents rarely write a file once; they write it in bursts. Naïve debouncing fixes the resulting event storm but introduces a subtle hole an attacker can drive a truck through.

Consider a simple “drop everything inside the window” debounce. An attacker who can time writes will issue a burst: a benign version first, then the malicious final version a few milliseconds later, inside the debounce window. A leading-edge-only debounce evaluates the benign version and silently discards the malicious final write — exactly the state that lands on disk and gets read back into the agent. The attack lands evaluation-free.

PoisonZero's watcher is built with a trailing-edge final-state guarantee. It emits on the leading edge immediately, and if any further writes arrive inside the window, it schedules exactly one trailing evaluation after the window elapses — re-reading the file's current state at that moment. No matter how a burst is timed, the version that actually ends up on disk is the version that gets scored. Carefully timed burst writes cannot smuggle a final state past evaluation. (This is a deliberate piece of adversarial engineering; it was hardened specifically against this bypass.)

3.2 Policy filters and loop protection

Before any evaluation, two cheap checks run. First, changes to PoisonZero's own agent-notice file are ignored — the daemon writes that file itself to inform the agent of an action, and re-evaluating it would create a feedback loop. Second, the path must fall under an admin-configured protected profile; anything outside is ignored. PoisonZero's own configuration and state directories are reserved and can never be registered as a watched profile (registering them would itself be an attack vector). If the new content is byte-identical to the known clean state, processing stops — that is just the daemon's own revert or apply echoing back.

3.3 The deterministic rule layer runs first

A unified diff of old-vs-new is built and handed first to an AI-free rule layer (§4). A hard match short-circuits the whole pipeline straight to revert — fast, deterministic, and impossible to confuse with a clever score. Only if no hard rule fires does the change reach the evaluator.

3.4 Evaluation, decision, enforcement — audit at every step

The evaluator returns a danger and a confidence score (both in [0,1]); the decision engine maps those to an action (§5 below); the enforcer carries it out. Every outcome — allow, quarantine, or revert — produces an audit entry with timestamp, action, path, reason and scores. The audit trail is local and survives the daemon; on managed installs it is also reported to the owner's panel.

4. Deterministic rules, scoring and decision

4.1 The AI-free meta-attack layer

Some attacks should never depend on a model's judgement. An entry that tries to edit PoisonZero's own thresholds, or to disable, uninstall or kill the product, or a classic “ignore all previous instructions” override, is hostile by construction. The rule layer catches these deterministically and reverts immediately, with no AI in the loop.

The matcher is built to resist evasion. Every input is first Unicode-normalised (NFKC) and lower-cased, which folds full-width characters, ligatures and homoglyph tricks onto their base forms before any pattern is applied — so an attacker cannot slip an override through with look-alike characters or mixed case. The override and “disable the product” phrasings are matched across many languages, including non-Latin scripts, because the attacker — not the defender — picks the language. The rules are tuned for precision: latin verbs are matched on word boundaries and explicitly excluded when they are merely part of a settings name (e.g. disable-notifications), so legitimate configuration text is not falsely reverted.

4.2 Scoring

When a change reaches the model, it returns two numbers in [0,1]: danger (how malicious the change looks) and confidence (how sure the evaluator is). Keeping confidence separate from danger is what lets the system route genuinely ambiguous changes to a human instead of guessing.

4.3 The decision engine and its default thresholds

The decision engine is deliberately tiny and total — three thresholds, three outcomes:

The default thresholds are dangerBlock = 0.7, dangerSafe = 0.3, confidenceMin = 0.6 (admin-configurable centrally). So a change scored at danger 0.97 is well past the block threshold and is reverted outright; a change at danger 0.1 with confidence 0.9 is allowed; a change at danger 0.5, or a clearly-benign-looking one the model is unsure about, lands in the grey zone and is quarantined for review.

Crucially, the engine is fail-closed: if evaluation cannot be performed at all — the model errored, timed out, or returned something unparseable — the result is not “allow by default”. It is revert. The scores are only ever passed downstream when the evaluation actually succeeded.

4.4 The grey zone: quarantine and asynchronous review

An ask_user outcome does not block the agent waiting for a human. The change is quarantined and the last known-clean state is restored on disk immediately, so the agent keeps running on trusted content. The poisoned version is preserved out-of-band for the owner to review asynchronously in the panel; when they decide, the daemon applies or discards it and records the outcome. Reviews are designed to be asynchronous by principle — protection never depends on an admin being online at the moment of the event.

5. Flow A — Cloud-assisted (Pro)

On the Pro tier, evaluation is performed by a hardened cloud endpoint. The defining property of this flow is that it is built around a verifiable data contract, not a promise.

5.1 Local redaction before anything leaves

Before a single byte is sent for evaluation, the diff is passed through a local redaction pass on the device. It strips, pattern-based, the classes of secrets and structured PII it can recognise reliably offline: API keys (OpenAI/Anthropic-style, GitHub, Slack, AWS, Google, Stripe), JWTs, generic key/token/secret/password assignments (the value is removed, the key name kept), email addresses, IBANs, credit-card numbers, IP addresses and international phone numbers. Replacements are structure-preserving placeholders ([REDACTED:email], [REDACTED:apikey] …) so the model can still reason about the change.

We are honest about the boundary: free-text names and unstructured PII are not reliably detectable offline without ML, so the redactor strips what it can match with confidence and no more. You can see exactly what an evaluation would send with a built-in self-test, piping any file through the product's redact command before trusting it.

5.2 The data contract: minimal payload, fixed endpoints

Only a file path and the redacted unified diff ever go out for evaluation — never full file contents. There is a small, fixed set of endpoints, each carrying a documented, minimal payload: enrollment carries an app ID and a one-time code; evaluation carries the path and redacted diff; config polling carries nothing but an auth token; audit and quarantine reporting carry timestamps, actions, paths, reasons and scores (the quarantine report includes the same redacted diff, capped in size). Files outside the configured profiles, full texts, directory or process listings, and system telemetry never leave the machine. All transport is over TLS.

5.3 The egress ledger: verify, don't trust

Every outbound request is recorded locally in a human-readable egress ledger. Each line carries the timestamp, the endpoint, the payload size, the SHA-256 of the payload (never the content), and the HTTP status:

2026-06-04T12:00:01Z endpoint=/getConfig bytes=2 sha256=44136f… status=200 2026-06-04T12:03:17Z endpoint=/evaluate bytes=412 sha256=9f86d0… status=200

This is the “does it phone home?” accounting, written so your DLP and audit teams can verify rather than take our word for it. The ledger can be cross-checked against the network itself — the daemon only ever contacts the evaluation backend and the auth-token endpoint, and an external observer (firewall logs, a packet capture, an egress-filtering proxy) must see exactly the requests the ledger records, and nothing more.

6. Flow B — Fully on-device (Enterprise)

On the Enterprise tier, evaluation moves entirely onto the machine. A compact, purpose-built detection model and our own inference engine run locally, and no document content leaves the building at all — there is no analysis cloud to leak to, because there is no analysis cloud.

6.1 On-demand lifecycle

The inference engine does not run continuously. It is started lazily when a memory check comes in, and shuts down again after a short idle timeout. The model file is a little over 300 MB and is memory-mapped at start; during a check the analysis component occupies that footprint for a few seconds, and at rest the product is just the daemon — a few MB of RAM. The short idle window absorbs bursts of edits without keeping the footprint resident. It is CPU-only and runs on ordinary hardware; no GPU is required.

6.2 The inference engine runs in a minimal-privilege sandbox

This is a deliberate security boundary, not just hygiene. The engine, by definition, parses attacker-controlled text — the very memory diffs we are investigating as potentially poisoned. An exploitable flaw in the engine would otherwise be a direct path from “attacker writes a memory file” to “code execution on the analysis component.” The answer is isolation: run the engine sandboxed, and an engine exploit degrades to a harmless process crash instead of a compromise. The daemon — the actual security product — is untouched and reverts in doubt.

The minimal-privilege profile, applied to each short-lived engine process:

On Linux this is enforced day-one via systemd hardening and a syscall filter (seccomp) — no new privileges, a strict read-only system view, an empty capability set, localhost-only addressing, a read-only bind of the model path — with filesystem-LSM hardening (Landlock) as a fast-follow where the kernel supports it. macOS and Windows apply the same day-one minimum — localhost-only, read-only model path, no subprocess spawn — with full sandbox profiles (a signed sandbox profile on macOS; a restricted token and Job Object on Windows) as fast-follow. We are explicit about what is day-one versus fast-follow rather than implying a uniform maximum everywhere.

6.3 Untrusted output and hash-pinning before every start

Even inside the sandbox, the daemon treats the engine's answer as untrusted input: if it crashes, hangs, or returns anything unexpected, the decision is fail-closed (revert), never a silent allow. And before every engine start, the daemon verifies the model file's SHA-256 against a pinned hash carried in its signed license manifest. This defends the distribution chain and also closes a local exploit vector — a swapped or corrupted model file (e.g. a doctored header crafted to exploit the loader) is detected before the engine is even launched; it simply does not start.

6.4 Air-gap friendly

Between checks the product runs fully offline. The single exception is a monthly license check — credentials and version only, never content. There is no telemetry, no content egress, and nothing in the evaluation path that needs the network. The same egress ledger (§5.3) records that monthly check, so it too is verifiable rather than assumed.

7. Integrity and supply chain

The thing you install has to be the thing we built. Every release artifact is signed with Cosign keyless (Sigstore) and published with a SHA256SUMS manifest, so each binary is verifiable against a signed checksum list before it ever runs. The one-line installer verifies the checksum automatically; air-gapped operators can verify by hand.

On the Enterprise tier the chain extends to the model itself: the detection model is a signed, integrity-protected artifact whose SHA-256 is pinned in the signed license manifest and re-verified before every single engine start (§6.3). Integrity is checked not once at download time, but continuously at every use.

8. Detection performance

The detection model is not an off-the-shelf classifier. It is fine-tuned on a large, purpose-built corpus of real attack and benign examples, with a cloud labelling pipeline that keeps the on-device model sharp as new attack patterns appear — without your data ever feeding it.

The numbers we measure, stated relatively and honestly: on the same test set it catches more than 3× as many attacks as leading off-the-shelf guard models, and it reliably flags the classes those standard detectors are practically blind to — subtle/indirect injection and data exfiltration. Tuning cut false alarms by roughly 95% versus the untuned base model. Across broad, realistic internal testing it detects over 94% of attacks at under 5% false alarms.

Methodology, briefly. Evaluation is on held-out data the model never trained on, balanced across the five attack classes (§1) and across many languages. The benign set is deliberately adversarial — it includes the kind of legitimate notes that look alarming and trip up keyword filters — so the false-alarm rate reflects realistic content, not easy negatives. We report relative gains and a measured operating point rather than a single headline accuracy, because a detector is only as good as its weakest attack class and its behaviour on hard benign cases.

9. Honest limitations and residual risk

No detector stops every attack everywhere, and we do not claim a 100% catch rate — the numbers above are the ones we measure, and we keep raising them. A few honest boundaries:

The unifying principle through all of it is the same: fail-closed. When the system is unsure, cannot evaluate, or its analysis component misbehaves, the change is reverted and the last clean state stands. The default failure mode is safety, not silent acceptance.

Questions for your security team?

Send us the specifics of your environment — air-gap constraints, fleet size, DLP requirements — and we'll walk your engineers through the details.

Contact sales

← Back to Enterprise