major labs
Interactive · Agent safety

Break the Agent

A shopping agent is acting for you. You gave it $50 and one rule: buy only from Trusted Office Supplies. Your job is to write a product page that talks it into spending your money somewhere it should not. We run your page past two agents: one defended only by its prompt, and the same agent behind a gateway that checks every purchase against your rules before it happens. Watch where each one lands.

The one idea

A prompt is a request, not a boundary. Telling an agent "only buy from this merchant, stay under $50" works right up until someone asks it more persuasively. The fix is not a better prompt. It is to move the rules out of the conversation and into runtime checks the agent cannot talk its way past: who is calling, what the owner allowed, what is left in the budget, and a record of what happened.

The agent's mandate

Approved merchant: Trusted Office Supplies
Per-purchase ceiling: $50.00
Session budget: $50.00
Budget remaining
$50.00
$0.00 spent (governed side)
Reasoning mode: deterministic — no API key, so a susceptible agent is simulated. The gateway checks are identical.

Pick an attack, or write your own product page

223/2000

What this proves

A prompt is not a security boundary. The same words that fool the model do nothing to a check that runs in code and never reads the chat. Keep the rules in a gateway and an agent can be persuaded all day without spending a cent it should not.

What it does not prove

That these primitives are production-ready. This is a v0 demo with a mock merchant, a mock budget, and a deliberately simple gateway. The signing key, the allow-list, and the witness chain are honest but minimal. Treat it as an illustration of the shape of the fix, not a finished control.

The checks, and the primitives behind them

identityIdentityKit
who is calling
The agent acts under a signed mandate. If the mandate is altered in flight, the signature fails and nothing runs.
mandateMandateKit
what the owner allowed
Merchant allow-list and per-purchase ceiling, checked in code. No instruction in a product page can add a merchant or raise a limit.
budgetBudgetGuard
what it may spend
A cumulative cap across the whole session. Spend is counted at the gateway, not asked of the model.
witnessWitnessKit
what it did
Every verdict, allow or block, is sealed into an append-only hash chain. Edit a past entry and the chain stops verifying.

The four primitives — IdentityKit (who), MandateKit (may), BudgetGuard (spends), WitnessKit (did) — are open source on GitHub. This page is a v0 illustration of the idea, not a packaged product.