major labs
Essay 04Commerce10 min read

Inside the commerce layer

By Charlie Major · 2026-06-16

The commerce layer is the loudest in the agentic stack. AP2 is in FIDO. ACP is live on Etsy. Mastercard's Verifiable Intent is in production. The x402 protocol cleared 165 million transactions last month. The protocols are forming hard.

What sits on top of those protocols is mostly hand-rolled. The verifier that checks a mandate actually authorizes a basket. The audit trail that survives a chargeback inquiry. The refund flow that does not exist yet. The budget governance that catches the agent burning seven figures in a weekend. Each one is a gap operators are filling alone today and the field is wide open for shared tooling. Major Labs ships into four of those gaps with MandateKit and BudgetGuard in Q4 2026. This essay walks each gap and names what the products actually do.


The mandate scope verification problem

A mandate is a signed payload from a user that authorizes an agent to spend on their behalf inside specific limits. The payload names the agent, the amount cap, the allowed categories, the expiry, sometimes the allowed merchants, and the user's cryptographic signature confirming all of it. AP2 ships this. Verifiable Intent ships this. Most of the agent-payment protocols converging in the FIDO working group ship this.

What none of them ship is the verifier that does the hardest part of the job. The protocol guarantees the mandate exists, the signature is valid, the amount is under the cap, the category matches a string in the allowed list, and the expiry has not passed. That is the easy part. The hard part is comparing the agent's stated intent to the actual basket and deciding whether they match.

Concretely. An agent holds a valid mandate that authorizes "Apparel up to $500." The agent reports its intent as "buy running shoes for half marathon training." The basket presented to the merchant is a $480 luxury watch from a retailer that classifies itself as Apparel.

Every protocol-level check passes. The signature is valid. The cap is unbroken. The category string matches. The expiry is in the future. The transaction would clear if you used the protocol's reference verification logic.

The transaction is also clearly a violation of the user's authorization. Nobody who authorized "running shoes for a half marathon" meant to authorize a $480 watch.

The semantic alignment between intent and basket is where mandate verification actually lives. The reference protocol logic does not address it. Operators rolling their own verifier either include a basket-intent check or they do not. Most do not. The ones who do, all use slightly different scoring. The result is an interoperability problem that becomes a fraud surface within twelve months of AP2 going mainstream.

MandateKit ships the reference verifier that closes this gap.


The refund and dispute void

The buy flow ships. The return flow does not.

The card networks have evolved a sophisticated framework for chargeback reason codes over forty years. Visa publishes hundreds of them. Mastercard publishes hundreds. Each one defines who is liable for the dispute, what evidence the merchant has to produce, what time windows apply. The framework exists because the human-initiated, card-present-or-card-not-present transaction had decades to surface the dispute categories that needed codes.

The agent-initiated transaction has none of this. There is no chargeback reason code for "agent purchased wrong item." There is no code for "agent exceeded user-intended scope despite passing mandate scope check." There is no code for "agent paid the wrong merchant because of an MCP server compromise." The merchant who receives a chargeback from an agent-initiated transaction today is fighting it under codes designed for a human cardholder, which is the wrong frame.

The refund flow itself is worse. If an agent buys a $400 item and the user wants it returned, the merchant runs the normal return process. The refund goes back to the underlying funding source. The agent's mandate balance is not automatically credited. The agent does not know the refund happened unless the merchant notifies it. The next purchase the agent attempts can be over the cap because the agent's view of available balance is stale.

This is the deepest gap in commerce because the regulators will look here first. The CFPB in the United States, the EBA in the European Union, and the FCA in the UK have all signaled that agent-initiated transactions need clearer dispute frameworks. The category waits for the first regulatory mover.

Major Labs is not shipping a refund product in 2026. The category is too early, the regulators have not picked their angle, and shipping the wrong refund primitive locks operators into a structure they will need to migrate off. We are publishing about the gap in the State of Agent Commerce Q4 2026 report. We are not building into it yet.


Audit trails that survive a processor inquiry

When a chargeback gets disputed, the merchant has to produce an evidence pack. Visa and Mastercard publish detailed requirements for what counts as evidence. The pack typically includes the transaction record, the authorization signal, delivery proof for physical goods, customer authentication data (3DS, AVS, cardholder verification method), and the merchant's policy disclosures at the time of sale.

For agent-initiated transactions, the evidence pack also needs:

  • The agent's identity and credentials at the time of the transaction
  • The mandate that authorized the spend, including the signature and signing time
  • The user's underlying authorization for the mandate (proof the user actually created it)
  • The agent's reasoning chain or summary showing how it decided to buy
  • The basket evidence including the SKU, the merchant's product description at the time of sale, and any item-substitution events

Nobody has a standardized format for this evidence pack. Operators producing it today are exporting fields from their own logs, formatting them into a PDF, and attaching the PDF to the chargeback response. The acquirers accepting these submissions do not have a structured way to parse them. The card networks have not added agent evidence fields to their reason code matrices.

MandateKit ships the evidence pack format as a reference implementation. The format is JSON, signed by the operator, and constructable from the verification log MandateKit already generates. When a merchant disputes an agent-initiated chargeback, the operator pulls the evidence pack from MandateKit's audit log and submits it as supporting documentation. The format is open and any acquirer or network can adopt it.

This is a small piece of the verifier product's footprint, but it is the piece that gets cited first by chargeback teams.


What MandateKit does

MandateKit is the open-source SDK and hosted service for AP2 Verifiable Intent at the layer above the protocol.

The SDK ships in Python and TypeScript and exposes four primitives.

The compiler. Natural-language agent constraints become a JSON Schema mandate and then an EdDSA-signed payload. A user typing "Allow this agent to buy running shoes from any apparel retailer up to $500 per transaction, expires June 30" produces a mandate the protocol can accept. The compiler is open-source and runs locally; the user's private signing key never leaves their device.

The verifier. Given a mandate and a candidate transaction, the verifier returns a scope-match score from 0 to 1, an intent-basket alignment score from 0 to 1, the matched constraints, and a rationale string explaining the result. Operators can set their own decision thresholds on top of the scores. The reference implementation uses a fine-tuned 3-7B parameter model for the intent-basket alignment step; the rest of the verifier is deterministic.

The registry. A hosted service that stores active mandates and propagates revocation events within seconds rather than the hours or days most implementations achieve today. Operators query the registry to check whether a mandate is still active before authorizing a transaction. The registry is multi-tenant and the operator's mandates are isolated by default.

The audit log. Every verification creates a structured log record including the mandate hash, the transaction details, the scores, the rationale, and the timestamp. The log is the source of truth for the evidence pack described above. Operators can export the log for their own retention or rely on MandateKit's hosted retention.

Pricing. The SDK is free and open source. The hosted registry and audit log are free for development use, $99-$499 per month for team tiers, $999-$4,999 per month for enterprise tiers with per-mandate fees on top at high volumes.


What BudgetGuard does

BudgetGuard is the spend governance gateway. It sits between an operator's code and the LLM API, watches every call, and stops the runaway patterns that have cost operators seven-figure surprise bills.

The product wraps any of the major LLM clients (Anthropic, OpenAI, Google, self-hosted) with a thin middleware layer. The layer adds five capabilities.

Per-task budgets. A task is defined by the operator (a conversation, a customer session, a workflow execution). BudgetGuard tracks token spend against the task budget and refuses calls that would exceed it. The refusal is graceful — the LLM client returns a structured error the operator's code can handle, rather than the API failing in unexpected ways.

Loop detection. When an agent calls the LLM with prompts that hash-match prior calls in the same task at high frequency, BudgetGuard flags the pattern as a loop. The operator can set thresholds for loop detection sensitivity. The default catches the $1.6 million weekend pattern within the first hundred dollars rather than the first million.

Spend velocity anomaly detection. Token spend rates that exceed the operator's baseline by 10x trigger a velocity alert. The detection is per-task, per-customer, and per-vendor. An agent suddenly burning Anthropic tokens at ten times its usual rate fires an alert before the bill arrives.

Kill switches. Programmatic and dashboard-driven shutoffs at task, customer, and operator level. The kill switch propagates within one second from dashboard to API gateway. An operator who sees an anomaly in their dashboard can stop the spend before the next call clears.

Per-customer cost attribution. Every LLM call carries a customer identifier (set by the operator). BudgetGuard aggregates spend per customer and exposes the data through dashboards, exports, and a billing API. Operators selling agent-powered products to their own customers can attribute costs accurately, which is the foundation of any usage-based pricing model.

The $1.6 million weekend pattern is the canonical example of why this category needs to exist. An agent at a mid-market SaaS company got stuck in a loop on a Friday evening. Nobody noticed until Monday morning. The Anthropic bill for the weekend was $1.6 million. The operator had no per-task budget cap, no loop detection, no spend velocity alert, and no kill switch. BudgetGuard exists to make that combination impossible to ship by accident.

Pricing. The SDKs are free and open source. Hosted dashboard and alerting at free tier under $X monthly spend, $99 per month team tier, $499 per month agency tier with multi-org support. Enterprise tier pricing on top of usage at scale.


What ships and when

Both products land in Q4 2026.

MandateKit ships first, in late October, to align with the expected AP2 spec finalization in Q3 2026. The SDK is open-source from day one. The hosted registry and audit log open for paid tiers two weeks later.

BudgetGuard ships in November. The Python and TypeScript SDKs ship together. The hosted dashboard opens in private beta for the first fifty Discord members. Public availability follows in early December.

Both products are built on top of work the Bench is already doing. The Verifier agent that runs inside MandateKit is the same model architecture that will eventually power Major Labs Identity. The audit log format that MandateKit ships will be the foundation of the evidence pack standard we push at the FIDO and EMVCo working groups in 2027.

The order in this layer matters. Discovery closes first because the pain is acute today and the data is collectable in weeks. Commerce closes second because the platforms have shipped the rails and the gaps above them are obvious. The buyer for MandateKit and BudgetGuard is the operator already in production with agentic commerce, who has felt at least one of the failure modes above. Those operators exist now, in number, and they are buying tools.

The next essay goes inside the observability layer. Per-customer cost attribution at scale, why Helicone's retreat to maintenance mode is a category signal not a company signal, and what the production agent trace audit standard needs to look like.

See you Tuesday.

— Charlie

Charlie Major writes Major Matters and joined Mastercard in April 2026. Major Labs is independent of Mastercard and operates separately from Major Matters. Any opinions in these essays are Charlie's own.

Essay 05 · Now live
Inside the observability layer

Per-customer cost attribution at scale, why Helicone's retreat is a category signal not a company signal, and what the production agent trace audit standard needs to look like.

Get every essay

Two essays a week. Quarterly State of reports drop here first. No marketing, no fluff.