The End of On-Demand AI: Why Businesses Need a Token Capacity Strategy Now
Businesses that treat AI token access as an elastic utility — available on demand, priced simply, and reliably scalable — are operating on an assumption that is no longer accurate. On May 19, 2026, OpenAI launched a multi-year committed capacity product, joining Google, Microsoft, and AWS in signaling that guaranteed throughput now requires a contract. The capacity crunch is real, structural, and getting worse. Mid-market companies without a token capacity plan are the most exposed.
Author: Roy Gatling, RMG Associates — linkedin.com/in/roygatling
Published: 2026-05-20
Last updated: 2026-05-20
Freshness note: Based on OpenAI's Guaranteed Capacity announcement of May 19, 2026, Q1 2026 earnings disclosures from Alphabet, Microsoft, and Amazon, and current rate limit data verified April 2026.
What is the AI capacity crunch, and is it actually a problem?
The AI capacity crunch refers to the gap between enterprise demand for LLM inference and the physical infrastructure available to serve it. Google Cloud topped $20 billion in quarterly revenue for the first time in Q1 2026 — and then disclosed it could have grown even faster if it had enough capacity to meet demand. Google's cloud backlog nearly doubled in 90 days to $460 billion, a rate that analysts described as looking "literally fake."
This is not a temporary blip. Of the roughly 12 GW of US data center capacity expected to come online in 2026, only about a third is actively under construction. The rest faces multi-quarter postponements due to power infrastructure delays. The constraint is physical, not financial. Microsoft sits on over $80 billion in unfulfilled Azure orders — not because customers lack interest, but because there is not enough electricity to power the GPUs. Nvidia's newest AI chips are backordered through 2027.
For the businesses consuming these services through APIs, the crunch is not abstract. Supply constraints "manifest elsewhere through pricing changes, usage caps, throttling or performance trade-offs. The shortage doesn't disappear; it just becomes less visible inside a service layer," according to engineering leaders at cloud security firm Acre Security.
What did OpenAI just announce, and why does it matter?
OpenAI announced a new offering called Guaranteed Capacity on May 19, 2026, which allows customers to secure long-term access to compute to power their AI products, agents, and workflows. Customers can choose between one, two, and three-year commitments, with discounts that increase based on annual spend levels.
This is not an isolated product decision. It is a market signal. OpenAI CEO Sam Altman wrote on X: "Customers are increasingly asking us for certainty on capacity. As models get better, we expect that the world will be capacity-constrained for some time."
Guaranteed Capacity includes certainty of access to compute based on spend levels, and customers can draw down from this commitment across the portfolio of OpenAI products. It is designed for production systems, customer-facing applications, and AI agents.
Altman's phrasing is worth reading carefully. "Capacity-constrained for some time" is not a temporary disruption message. It is a supply-chain forecast. The implication for enterprise buyers: on-demand access during peak periods is no longer a safe assumption for mission-critical workloads.
How are other providers handling capacity and pricing?
OpenAI's announcement arrives into a market that has already been moving toward committed-capacity models for more than a year.
Google Vertex AI updated its Provisioned Throughput offering in February 2026, standardizing the reserved-capacity experience across 200+ models, adding private preview support for Anthropic Claude models, and introducing flexible 1-week terms for high-impact short-term windows like product launches.
AWS Bedrock's Provisioned Throughput reserves dedicated model capacity for a workload, requires a 1-month or 6-month commitment, and eliminates throttling risk at a predictable cost.
Azure Provisioned Throughput Units (PTUs) flip the billing model from per-token to reserved capacity at a fixed hourly rate — similar to reserved cloud instances. PTUs can reduce per-token costs by up to 70% for sustained workloads, starting at approximately $2,448 per month.
Managed API providers (no infrastructure ownership required)
| Provider | On-Demand Risk | Committed Option | Typical Discount | Commitment Term |
|---|---|---|---|---|
| OpenAI (direct) | Rate-limited by tier; throttled at peak | Guaranteed Capacity (new) | Increases with annual spend | 1-3 years |
| Azure OpenAI | Subject to throttling during peak demand | Provisioned Throughput Units (PTUs) | Up to 70% at sustained usage | Monthly or annual |
| AWS Bedrock | On-demand matches direct API pricing | Provisioned Throughput | Reduces throttling risk | 1 or 6 months |
| Google Vertex AI | Free tier generous; PAYG removes daily caps | Provisioned Throughput | 20-45% vs on-demand | 1 week to annual |
| Anthropic (direct) | Tier-gated; lower ceiling than OpenAI | Enterprise agreements; no public PTU product yet | Negotiated | Custom |
GPU cloud providers (self-hosted open-weight models)
| Provider | Best For | H100 On-Demand (per GPU/hr) | Committed Discount | Notable Constraint |
|---|---|---|---|---|
| CoreWeave | Large-scale enterprise inference; Kubernetes-native clusters | Custom quoted (~40-50% below AWS) | Up to 60% reserved; annual commit | No self-serve signup; requires account manager |
| Lambda Labs | Production inference; developer-friendly; no egress fees | $2.49-$3.78/hr (H100) | Negotiated reserved clusters | Inventory limits during peak demand |
| RunPod | Variable workloads; serverless inference endpoints | $1.99-$2.69/hr (H100) | Spot instances 40-65% below on-demand | Less suitable for compliance-sensitive workloads |
| Vast.ai | Budget-first research and batch training | $1.87+/hr (H100; varies by host) | Interruptible instances | Variable reliability; peer-to-peer marketplace |
Sources: Provider documentation, gpu.fm, DEV Community GPU Showdown (March 2026), Spheron Blog (May 2026). GPU pricing fluctuates; verify before committing.
Should your business host its own LLM instead of using a managed API?
Self-hosting an open-weight model on a GPU cloud removes rate limits, eliminates per-token pricing, and puts your capacity ceiling entirely under your control. For the right workload profile, it is meaningfully cheaper. For the wrong one, it is significantly more expensive and operationally complex.
The core question is utilization. Managed APIs charge per token — you pay only for what you use. GPU cloud hosts charge per hour of compute, whether that GPU is serving requests or sitting idle. If your AI workload runs continuously at high volume, self-hosting can cut effective costs by 60-90% compared to managed API pricing. If it runs intermittently, you are paying for idle compute.
The case for self-hosting
GPU cloud providers like CoreWeave, Lambda Labs, and RunPod price H100 GPU hours at $2-4/hr for on-demand access and significantly less on reserved terms. A single H100 can serve Llama 4 or Mistral inference at volume with no per-token ceiling, no throttling risk, and no rate limit tier to negotiate. For high-throughput, predictable workloads — document processing, internal copilots with heavy usage, or agentic pipelines running 24/7 — the economics can favor self-hosted infrastructure within 6-12 months of sustained usage.
CoreWeave, the largest GPU-native cloud, prices at approximately 40-50% of AWS for comparable instances, offers free egress within its network, and carries enterprise SLAs with SOC2 and ISO 27001 compliance. Lambda Labs offers H100 access at transparent pricing with no egress fees and a simpler developer experience suited to teams without dedicated MLOps capability.
The case against self-hosting (for most mid-market companies)
Self-hosting transfers the capacity problem from provider rate limits to your own infrastructure management. You own model deployment, scaling, monitoring, inference optimization, and version management. For most mid-market firms, this requires engineering headcount that does not yet exist — or engineering time that is currently allocated elsewhere.
Open-weight models (Llama 4, Mistral, Qwen) are capable, but they are not Claude Opus or GPT-5. For workloads where frontier model quality matters — complex reasoning, nuanced writing, multi-step agent tasks — the quality gap is a real cost that does not appear in the GPU hourly rate.
There is also an availability problem. B200 GPUs remain constrained through mid-2026 with most providers maintaining waitlists. CoreWeave requires meeting with an account manager before onboarding. Lambda Labs reports inventory limits during peak demand. The self-hosted path has its own capacity ceiling — it is just a different one.
The hybrid architecture most mid-market companies should consider
Route high-volume, repeatable, lower-complexity workloads (classification, summarization, data extraction) to a self-hosted open-weight model on GPU cloud infrastructure. Route frontier-quality requirements (customer-facing generation, complex reasoning, agentic orchestration) to managed APIs with committed capacity. This splits the cost exposure without sacrificing quality where it matters.
| Factor | Managed API (OpenAI / Anthropic) | GPU Cloud + Self-Hosted Model |
|---|---|---|
| Upfront cost | None | GPU hours + engineering setup |
| Per-token cost | $1-25/M tokens depending on model | Effectively $0 (absorbed into hourly rate) |
| Rate limit risk | Yes — tier-gated, throttled at peak | No token limits; ceiling is your GPU count |
| Model quality | Frontier models available | Open-weight models; quality gap on complex tasks |
| Operational burden | Minimal | High — MLOps, scaling, monitoring required |
| Data privacy | Data processed by provider | Data stays on your infrastructure |
| Break-even point | N/A (pure OpEx) | Typically 6-12 months at sustained high utilization |
| Compliance | Provider-certified (SOC2, HIPAA tiers) | Your responsibility; CoreWeave offers SOC2 |
The break-even calculation depends heavily on utilization rate. At 50%+ sustained GPU utilization for a workload that does not require frontier model quality, self-hosted infrastructure on a neocloud is likely the more efficient long-term structure.
Why do rate limits matter more than most teams realize?
Most mid-market AI implementations are designed and tested at low volume. The rate limit reality of production scale is rarely modeled in advance.
Anthropic Tier 4 maxes at 4,000 requests per minute and 400,000 tokens per minute. OpenAI GPT-5.4 at Tier 4 allows 10,000 requests per minute and 30 million input tokens per minute. That is a 75x difference in input throughput between the two leading providers.
Tier access is not automatic. Teams operating at scale should request rate limit increases proactively, before reaching capacity during peak periods. Being throttled in production costs more in engineering time and SLA impact than the deposit required to raise a tier.
Agentic workloads make this worse. An autonomous agent might chain 10-20 sequential API calls to complete a single task — tool lookups, retrieval-augmented generation queries, multi-step reasoning, and final completions — all in a rapid burst. If any call in that chain hits a rate limit, the entire agentic workflow fails.
In our work with mid-market firms deploying AI agents in production, token capacity planning almost never appears in the initial project scope. Teams focus on model selection and prompt design, then encounter throttling or cost surprises after deployment — at exactly the moment when the business is counting on the system.
What hidden pricing variables are businesses missing?
The sticker price per million tokens is the most visible cost — and often not the most important one. Several pricing mechanics are now compounding costs in ways that typical budget planning does not account for.
Output tokens cost 4x input tokens on average. In 2026 market data, the median output-to-input price ratio is around 4x. Lengthy completions can balloon costs even when inputs are small. Systems designed for comprehensive outputs — reports, code generation, analyses — are hit the hardest.
Long-context surcharges apply to the full request, not just the overflow. Anthropic and Google both apply steep surcharges when a request crosses 200,000 tokens. The surcharge applies to the entire request, not just the tokens above the threshold. A request at 199K tokens costs roughly $0.60 on Sonnet. A request at 201K tokens costs $1.21. That is a 2x jump for 2,000 extra tokens.
Tokenizer changes silently inflate costs. Claude Opus 4.7 ships with a new tokenizer that can generate up to 35% more tokens for the same input text compared to Opus 4.6. Per-token rates are identical, but effective cost per request can increase by up to 35%. Teams should benchmark workloads before migrating.
Tier thresholds require capital commitment. Raising your Anthropic tier from Tier 1 to Tier 4 requires clearing monthly spend thresholds — which means you need to be running significant volume before you can access the throughput you need to run significant volume. This is a chicken-and-egg problem that requires deliberate planning.
What is the right capacity planning framework for a mid-market company?
The RMG approach to AI capacity planning treats token throughput the same way operational leaders treat manufacturing capacity or server headroom: model it, commit to what you know, build flexible reserves for what you do not.
Step 1: Audit current and projected token consumption.
Pull usage data from every AI-connected system: chatbots, copilots, agents, batch processing jobs, and API integrations. Separate input from output, and identify which workloads run on-demand versus batch-eligible. Most teams discover they are running their cheapest workloads (synchronous, real-time) and their most expensive ones on the same on-demand tier.
Step 2: Model agentic growth separately.
Agentic workflows are not a linear extension of chat usage. If your roadmap includes agents, estimate token consumption at the task level, not the request level. A task that chains 15 API calls at an average of 3,000 tokens each is a 45,000-token event, not a 3,000-token one.
Step 3: Map to provider tiers and identify your ceiling.
Determine which tier you currently sit in for each provider, what the rate limit ceiling is at that tier, and at what usage level the ceiling becomes a constraint. Build the calculation before it becomes an incident.
Step 4: Evaluate committed capacity against your model.
PTUs generally only make financial sense once you consistently use 60-70% or more of the provisioned capacity. For Azure, that break-even sits around 150-200 million tokens per month for flagship models. Committed capacity is not right for every workload — but for production systems with predictable demand, the risk of on-demand throttling and the certainty of pricing both favor commitment.
Step 5: Implement a multi-provider fallback.
Routing across multiple providers aggregates effective rate limits. OpenAI Tier 3 plus Anthropic Tier 4 plus Google pay-as-you-go yields roughly 9,000 effective requests per minute with built-in failover. Multi-provider architecture requires an abstraction layer (an AI gateway or SDK that handles routing and retry logic), but it is the most defensible position against any single provider's capacity ceiling.
What should a CEO or COO do in the next 30 days?
This is not an infrastructure problem. It is a business continuity question with a finite planning window — OpenAI's Guaranteed Capacity program is available only while the current allocation lasts.
Assign ownership.
Token capacity planning sits at the intersection of engineering, finance, and operations. Assign one owner. Committees produce audits, not decisions.
Quantify current monthly token spend across all providers.
This number is almost always unknown at the executive level and frequently unknown at the department level. Finout, CloudZero, and similar platforms can pull this across Anthropic, OpenAI, and Azure in a single dashboard.
Identify your mission-critical AI workloads.
These are the systems where downtime or throttling has direct revenue or customer impact. These are your committed-capacity candidates. Everything else can stay on demand.
Ask your AI vendors directly about their capacity tier and upgrade path.
Any vendor who has integrated AI into a product they sell on your behalf should be able to answer this question. If they cannot, that is a vendor risk signal.
Run a 90-day projection.
Token consumption in mid-market firms deploying agents is typically growing 40-70% quarter over quarter. If your current tier has a 400K token-per-minute ceiling and you are running at 60% of it today, you have roughly two quarters before on-demand throttling becomes a production incident — not a planning consideration.
Conclusion
The on-demand era of AI access is not ending — but it is being tiered. Providers are reserving their guaranteed capacity for customers who commit to it. On-demand access will remain available, but at lower priority and higher variability as provider infrastructure runs closer to ceiling.
Mid-market companies are in a decisive window. The companies that model their token consumption now, identify their mission-critical AI systems, and secure appropriate capacity commitments before demand peaks will have a structural operational advantage over those that discover this problem during an incident.
Token capacity is not an infrastructure decision. It is a business continuity decision with a planning lead time measured in quarters, not days.
Executive FAQ
Frequently asked questions about AI token capacity and committed throughput.
What is OpenAI Guaranteed Capacity?
OpenAI Guaranteed Capacity, announced May 19, 2026, is a multi-year committed-capacity program. Customers pay for reserved access to OpenAI compute across its product portfolio. Commitments run one to three years, with discounts that increase with annual spend. It is designed for production workloads, customer-facing applications, and AI agents.
How is this different from standard API pricing?
Standard API access is pay-as-you-go and subject to rate limits by tier. On-demand instances are the first throttled when provider capacity is strained. Committed capacity guarantees throughput and pricing predictability. The trade-off: you pay a commitment regardless of utilization.
Is my company large enough to need committed AI capacity?
If you have production AI workloads — customer-facing chatbots, AI-assisted internal tools, agentic workflows — and those workloads have measurable business impact when unavailable, capacity planning is appropriate regardless of company size. The committed capacity products from most providers have meaningful minimums (Azure PTUs start around $2,448/month), so the question is whether the cost of commitment is less than the cost of unplanned throttling.
Can I avoid committed capacity by using multiple providers?
Yes, with caveats. Multi-provider routing distributes load and provides failover. It does not guarantee throughput for any single workload, and it adds architectural complexity. For workloads where latency and reliability guarantees matter, multi-provider routing is a complement to committed capacity, not a replacement.
Should my company consider self-hosting an LLM on a GPU cloud instead of using managed APIs?
It depends on utilization and workload type. GPU cloud providers like CoreWeave, Lambda Labs, and RunPod charge $2-4/hr per H100 GPU with no per-token ceiling — which can be significantly cheaper than managed APIs at sustained high volume. The break-even is typically 6-12 months at 50%+ GPU utilization. The tradeoff is operational: self-hosting requires MLOps capability your team may not have, and open-weight models carry a quality gap versus frontier models on complex tasks. A hybrid approach — routing high-volume, repeatable tasks to self-hosted infrastructure and complex/agentic work to managed APIs — is often the most cost-efficient structure for mid-market companies.
What is the cheapest path to higher rate limits?
Advance your usage tier with each provider by meeting their monthly spend thresholds. This requires sustained usage, not a one-time payment. For Anthropic and OpenAI, tiers unlock progressively higher rate limits. Proactively request limit increases through the provider console before you hit the ceiling, not after.
About the author
Roy Gatling is the founder of RMG Associates LLC, an AI strategy and implementation consultancy serving mid-to-large organizations. RMG's work spans AI strategy, implementation, and training programs. Learn more at rmgassociatesllc.com.
Need help modeling token capacity for production AI?
RMG helps mid-market leadership teams audit consumption, evaluate committed capacity, and design multi-provider architectures before throttling becomes a production incident.
Discuss Your AI Capacity PlanFeatured guide
Start with where most AI programs actually break down
Why Your AI Transformation Is Being Overcomplicated (And How to Fix the Partner Problem) — the operating logic for picking partners and pacing transformation so execution matches mid-market realities.
Read the flagship guide