This week, a team called Cactus Compute published Needle, a 26-million-parameter model distilled from Gemini that does function calling. Twenty-six million parameters. For context, GPT-2 was 1.5 billion. Needle is 58 times smaller and it beats FunctionGemma (270M), Qwen (600M), and Granite (350M) on single-shot tool calling benchmarks.
It runs at 6,000 tokens per second on a phone. It fits on a smartwatch. MIT licensed, weights public, fine-tunable for new tools in 45 minutes on 16 TPUs.
Within 24 hours it had 637 points on Hacker News. People understood what it meant.
Tool calling — the ability for a model to choose which function to invoke and what arguments to pass — is no longer a cloud capability. It’s a local capability. It runs wherever compute exists.
The Discovery Layer Is Already There
Here’s what makes Needle consequential rather than cute: the infrastructure for agents to discover and pay for services already exists and is growing exponentially.
Coinbase’s x402 Bazaar, exposed as an MCP server, now indexes over 10,000 paid endpoints. Providers include Firecrawl, OpenAI, Anthropic, Alchemy, and CoinGecko. An agent searches for a capability, finds the endpoint, pays with USDC, gets the result. The protocol handles payment transparently via HTTP 402.
The x402 ecosystem has processed over 165 million transactions, with nearly half a million agents transacting across the protocol since launch. Two weeks ago when I wrote Agents Can Pay, those numbers were a fraction of what they are now.
AWS launched AgentCore Payments on May 7th, built directly on x402. Circle launched its Agent Stack on May 11th — agent wallets, CLI, marketplace, nanopayments. Both are production infrastructure.
The payment stack isn’t approaching completion. It’s complete. Any agent can discover a service and pay for it, in real time, for fractions of a cent.
The Combinatorial Problem
Cloud-hosted agents operate in managed environments. The operator controls which tools the agent can access, sets spending limits per session, logs every invocation through an audit trail. AWS AgentCore does this explicitly: managed wallet provisioning, policy-based spending governance, unified audit trails.
This model assumes a knowable number of agents under centralized control.
Needle breaks that assumption.
When function calling runs on a phone, a watch, or a pair of glasses, the agent doesn’t live inside AWS. There is no centralized wallet. There is no AgentCore console logging its decisions. The agent acts locally, discovers services via MCP, and pays through whatever wallet the app developer embedded.
If you have N managed cloud agents and M paid services, the trust evaluation space is N×M. Governance is tractable because N is small and known. Enterprises might run hundreds or thousands of agents.
But when N includes every phone, every wearable, every smart home device running a Needle-class model, N becomes unknowable. The trust surface doesn’t grow linearly with the number of devices — it grows combinatorially with the number of potential agent-service interactions.
Ten thousand endpoints. A billion devices. That’s the number the governance layer needs to handle.
What Breaks
The three-letter-acronym stack the industry is building for agent trust was designed for cloud agents. Each component makes an assumption that edge deployment violates.
KYA (Know Your Agent) requires registration. ERC-8004 gives each agent a minted NFT on Ethereum as its identity. This works when an enterprise provisions agents in a registry. It doesn’t work when a watch app spins up a function-calling model on boot. The agent has no persistent identity. It was compiled into the binary.
Session-scoped policy engines assume a known execution environment. OpenBox intercepts HTTP, database, and file system operations at runtime. On a managed server, this is straightforward. On a consumer device running a 26-megabyte model, there is no OpenBox agent sitting in the middleware. The execution environment is the app itself.
Spending limits assume a managed wallet. AWS AgentCore provisions wallets automatically and enforces per-session caps. An edge agent might use a local wallet, or the user’s Coinbase wallet via an SDK, or a Privy embedded wallet. The spending governance doesn’t belong to any centralized operator.
None of these approaches are wrong. They work in their intended deployment model. The problem is that the deployment model just expanded to include every device with a CPU.
Service-Side Trust
The alternative model is already understood in financial services. Banks don’t rely on the borrower to self-report their creditworthiness. The bank evaluates the borrower based on behavioral data — transactions completed, commitments honored, constraints respected, over time.
For agent commerce to work at edge scale, the trust evaluation has to happen at the service, not the agent. When an unknown agent sends a request to a paid endpoint, the service needs a way to answer: should I serve this request? What quality of service does this agent deserve? What spending authority should I extend?
That answer can’t come from the agent’s self-reported identity (which may be fabricated or absent). It can’t come from session-scoped policy (which the service can’t see). It has to come from behavioral history: what this agent, or agents like it, have done across interactions.
This is the L4 behavioral trust layer: a cross-session, cross-operator record of agent behavior that services can query at request time. Not identity. Not policy. Behavior.
The Math on the Gap
The agent payment stack took roughly 18 months to go from concept to Linux Foundation standard with 22 institutional backers. The identity stack (ERC-8004, KYA, Visa TAP) is consolidating into standards this quarter.
The behavioral trust stack has one production player and zero standards.
Meanwhile, x402 transaction volume keeps compounding. Needle just made function calling available on devices that cost $200. a16z’s January 2026 framework argues that as agent execution costs trend toward zero, verification becomes the binding constraint for agentic commerce — a hypothesis the May launches are now stress-testing.
Every one of these developments widens the gap between what agents can do and what governance can verify they should do.
The agents are shrinking. The trust problem isn’t.
This continues the series on trust infrastructure for autonomous agents. Previous: Agents Can Pay. That’s Not the Problem., Agent Identity Shipped. Behavior Didn’t.. We’re building Commit — behavioral trust data for the agent economy.