Essay 05Observability10 min read

Inside the observability layer

By Charlie Major · 2026-06-19

In March 2026, Helicone went into maintenance mode. That sentence carries more weight than it looks.

Helicone was the developer-friendly, low-cost, open-source entrant in LLM observability. Free tier, $20 a month for serious usage, self-hostable, comprehensive dashboards. It was the tool the long-tail developer reached for when LangSmith was too LangChain-flavored and Arize was too enterprise-priced. Its retreat is not a story about a company. It is a story about a category.

The middle of LLM observability is structurally unstable.

This essay walks the layer, names why the middle is dying, and explains what BudgetGuard does that pure observability tools cannot.

Status · Roadmap

BudgetGuard is in build, planned for November 2026. This essay describes the product we are committing to in public and the design behind it, not a shipping one. Dates and specifics will move.

What Helicone's retreat actually signals

The category has plenty of capable tools. Langfuse, the canonical open-source option. LangSmith, LangChain's commercial complement. Arize and its open-source Phoenix. Braintrust, popular at the frontier labs for evaluation. Weights and Biases Weave. A long tail of smaller tools that wrap OpenTelemetry with vendor-specific dashboards. The category is not under-served. It is mis-served.

The mis-service comes from a structural mismatch. Observability tools, in their classic form, show you what happened. They surface the trace, the timing, the cost. They are dashboards. The dashboard is necessary. The dashboard is not sufficient.

The operator running LLM-powered features in production does not have a dashboard problem. The operator has a governance problem. They need the dashboard plus the ability to act on what the dashboard shows them, in real time, without writing custom middleware. They need budget caps that get enforced before the call goes out, not noticed after the bill arrives. They need anomaly alerts that fire on spend velocity, not just error rates. They need kill switches that propagate in seconds. The dashboard is the floor. The governance is the product.

Helicone's positioning was a great dashboard at a great price. The market signal is that great dashboards at great prices are not a category that sustains commercial focus. The buyer who used Helicone in production graduated to either an enterprise platform (because the governance gap was unbearable) or rolled their own (because the existing tools were not solving the right problem at any price). The middle eroded.

The category is consolidating around two ends. Enterprise observability that bundles evals, safety scoring, and procurement-friendly governance. And vendor-native dashboards from Anthropic, OpenAI, and Google that show you what their model did but cannot show you the cross-vendor picture. The middle is dying because pure observation is a feature, not a product.

That gap is where BudgetGuard sits, and it is why we deliberately did not call BudgetGuard an observability tool.

Per-customer cost attribution at scale

When an operator's product uses LLMs to serve their own customers, the cost of every end-user interaction needs to attribute back to the customer who triggered it. Sounds simple. It is not.

A typical operator's LLM cost goes through at least four layers of indirection. The end user sends a prompt. The operator's code routes it through a multi-step agent workflow. Each step may call a different model (Claude for reasoning, GPT for embeddings, Gemini for vision, a local Llama for classification). Each call generates input and output tokens. The output of step one becomes the input of step two, plus whatever system prompt the operator adds. By step five, the customer's original prompt has produced fifty thousand tokens across four vendors, and the operator's monthly bill has line items in four different invoices with different formats and different aggregation levels.

To attribute that cost back to the original customer, the operator needs:

A consistent identifier that flows through every call. Most operators tag this on the prompt itself, in a metadata field. The challenge is that the field is vendor-specific. Anthropic's metadata field is not OpenAI's.

A timestamp aligned across vendors. When the Anthropic call happens at 14:03:11.482 and the OpenAI call at 14:03:11.503, both belong to the same customer interaction. The operator's system has to know that.

A token accounting model that distinguishes the customer's prompt from the system prompt the operator added. A 200-token customer prompt that triggers a 2,400-token system prompt should not bill the customer for 2,600 tokens; the operator absorbs the 2,400.

A retention strategy. At even modest scale (10,000 customers, 50 interactions per day per customer, 5 LLM calls per interaction), the operator generates 2.5 million attribution records per day. Six months of those, at full fidelity, is half a billion rows. The operator either pays for the storage or makes hard choices about aggregation.

A query model that supports the business questions the finance team will ask. Cost per customer last quarter. Cost per feature this month. Customers approaching their plan limit this week. Variance versus forecast. Each question is a different aggregation over the same underlying data.

Most operators roll their own attribution pipeline today. Some do it well. Many do it badly. The bad ones look fine in month one and break around month nine when the storage bill or the query latency makes the system unmaintainable.

BudgetGuard will ship per-customer attribution as a first-class primitive. The customer identifier travels through every LLM call, attached at the SDK layer and preserved across vendors. The aggregation runs in the hosted layer at scale. The query model supports the business questions out of the box. Operators do not have to build it.

The production agent trace audit standard

This is the deeper gap, and the one that gets cited first when the regulators arrive.

An agent trace is the record of what the agent did during a session. The user's prompt, the model calls the agent made, the tools the agent invoked, the decisions the agent took, the final action. For human-debugging purposes, a trace is a log file. For audit purposes, a trace has to be something more.

A production agent trace audit standard needs to specify:

The fields a trace must contain. At minimum: the agent identity, the mandate context, the session identifier, the prompt history, the tool calls with arguments and results, the model calls with parameters, the final action and its outcome.

The format the trace is serialized in. Today, every tool emits a different shape. OpenTelemetry- compatible JSON is the closest thing to a common ground, but the LLM-specific fields are not standardized inside it.

The signature mechanism that makes the trace tamper-evident. Without a signature, the operator can edit a trace after the fact and the auditor cannot tell. With a signature, the operator can prove the trace is exactly what was recorded at the time of action.

The retention rules. Different regulatory regimes will impose different retention obligations. The EU AI Act draft enforcement guidance suggests two years for high-risk agentic systems. US regulators have not committed to a number. Operators need a retention model flexible enough to accommodate both.

The cross-vendor consistency requirement. The trace standard must work when the agent uses Claude for one step and GPT for the next. Vendor-specific trace formats undermine the whole point.

The fact that none of this is standardized today is not because the engineering is hard. It is because nobody has decided to publish a draft. Major Labs publishes the draft as part of the State of Agent Commerce Q4 2026 report, with BudgetGuard implementing it as the reference. The format is open. Any other observability or governance tool can adopt it. The intent is not to own the standard. The intent is to make sure one exists before the regulators ship one we cannot live with.

What BudgetGuard does that pure observability does not

The product is in Essay 04, but the difference between BudgetGuard and the observability category deserves its own framing here.

Observability tools tell you what happened. BudgetGuard prevents what should not happen.

Per-task budgets enforce caps before the call goes out. The operator sets a budget per customer session, per workflow execution, or per feature. BudgetGuard tracks spend against the budget and refuses calls that would exceed it. The refusal is graceful. The LLM client returns a structured error the operator's code can handle. No surprise bill. No after-the-fact discovery. The cap is enforced upstream of the cost.

Loop detection runs at call time. When an agent calls the LLM with prompts that match prior calls in the same session at high frequency, BudgetGuard flags it as a loop and can either alert or terminate. The default catches the $1.6 million weekend pattern within the first hundred dollars rather than the first million. Observability dashboards would have shown the same pattern, in a chart, on Monday morning.

Spend velocity anomalies fire alerts in real time. An agent suddenly burning Anthropic tokens at ten times its usual rate triggers a velocity alert and the operator's on-call sees it within seconds. The same pattern in a pure observability tool would show as a spike in a chart that someone has to be looking at to notice.

Kill switches propagate in under a second. An operator who sees an anomaly hits the kill switch in the dashboard, and the next API call from the affected scope is refused at the gateway. Observability tools do not have kill switches. They have charts.

The category lesson is that dashboards are necessary but not sufficient. BudgetGuard is what the long-tail operator actually needed when they were paying Helicone $20 a month. Helicone was a great dashboard. BudgetGuard is the governance layer that operates on the same data.

What ships when

BudgetGuard is due to ship in November 2026, alongside MandateKit. Both into the commerce-layer cycle.

The observability work this essay names is not a separate product. It is BudgetGuard's default operating mode. The per-customer attribution is a first-class primitive. The audit-ready trace format is the log BudgetGuard produces by default. The kill switches and budget caps are the governance the observability category does not provide.

State of Agent Commerce Q4 2026 publishes the draft audit trace standard and the first cohort of anonymised production data from BudgetGuard deployments. The report becomes the reference document that the next round of observability tools have to either adopt or actively avoid.

The next essay goes inside the provenance layer. The August 2 EU AI Act enforcement date as a category catalyst, what an audit-ready synthetic media disclosure receipt looks like, and why Major Labs is publishing about provenance in 2026 but not yet shipping into it.

See you Friday.

— Charlie

Charlie Major writes Major Matters and joined Mastercard in April 2026. Major Labs is independent of Mastercard and operates separately from Major Matters. Any opinions in these essays are Charlie's own.

Essay 06 · Now live

Inside the provenance layer

The August 2 EU AI Act enforcement date as a category catalyst, what an audit-ready synthetic media disclosure receipt looks like, and why Major Labs is publishing about provenance in 2026 but not yet shipping into it.

Get every essay

New essays regularly. Quarterly State of reports drop here first. No marketing, no fluff.