Break the Agent
A shopping agent is acting for you. You gave it $50 and one rule: buy only from Trusted Office Supplies. Your job is to write a product page that talks it into spending your money somewhere it should not. We run your page past two agents: one defended only by its prompt, and the same agent behind a gateway that checks every purchase against your rules before it happens. Watch where each one lands.
The one idea
A prompt is a request, not a boundary. Telling an agent "only buy from this merchant, stay under $50" works right up until someone asks it more persuasively. The fix is not a better prompt. It is to move the rules out of the conversation and into runtime checks the agent cannot talk its way past: who is calling, what the owner allowed, what is left in the budget, and a record of what happened.
The agent's mandate
Pick an attack, or write your own product page
What this proves
A prompt is not a security boundary. The same words that fool the model do nothing to a check that runs in code and never reads the chat. Keep the rules in a gateway and an agent can be persuaded all day without spending a cent it should not.
What it does not prove
That these primitives are production-ready. This is a v0 demo with a mock merchant, a mock budget, and a deliberately simple gateway. The signing key, the allow-list, and the witness chain are honest but minimal. Treat it as an illustration of the shape of the fix, not a finished control.
The checks, and the primitives behind them
| identity | IdentityKit who is calling | The agent acts under a signed mandate. If the mandate is altered in flight, the signature fails and nothing runs. |
| mandate | MandateKit what the owner allowed | Merchant allow-list and per-purchase ceiling, checked in code. No instruction in a product page can add a merchant or raise a limit. |
| budget | BudgetGuard what it may spend | A cumulative cap across the whole session. Spend is counted at the gateway, not asked of the model. |
| witness | WitnessKit what it did | Every verdict, allow or block, is sealed into an append-only hash chain. Edit a past entry and the chain stops verifying. |
The four primitives — IdentityKit (who), MandateKit (may), BudgetGuard (spends), WitnessKit (did) — are open source on GitHub. This page is a v0 illustration of the idea, not a packaged product.