Reference Guide

LLM Rate Limits in 2026: How API Caps, Usage Tiers, and Cache Costs Work

Rate limits are the thing nobody reads until a production job starts throwing 429s. This guide explains how requests-per-minute, tokens-per-minute, daily caps, and cache write costs work across OpenAI, Anthropic Claude, and Google Gemini in 2026, and how to raise them. For sticker prices, see LLM pricing per million tokens.

Last updated: 2026-06-24

Key Takeaways

  • API rate limits are measured mainly in requests per minute (RPM) and tokens per minute (TPM); hitting one returns an HTTP 429 error.
  • OpenAI and Anthropic both gate limit increases behind usage tiers tied to cumulative spend and account age, not an instant upgrade button.
  • Usage limits (monthly spend caps and daily token caps) are separate from rate limits; you can hit one without the other.
  • Anthropic cache writes cost a premium over base input (commonly around 25 percent more), while cache reads run roughly 90 percent cheaper.
  • Specific numeric limits change often; treat every figure here as a 2026 snapshot and confirm on the provider's own page.

Every LLM provider throttles how much you can send. The limit is not really about the model. It is about protecting shared GPU capacity, and it scales with how much the provider trusts your account. A brand-new free account gets a tiny slice. An account that has paid invoices for months and spent thousands gets a much larger one. This page covers the units those limits come in, how each major provider structures them in 2026, and the practical moves that get you more headroom.

The Units: RPM, TPM, RPD, and Concurrency

Four numbers describe almost every API rate limit. Requests per minute (RPM) caps how many separate API calls you can make. Tokens per minute (TPM) caps the total tokens (input plus output) flowing through in a minute. Requests per day (RPD) is a coarser daily ceiling that mostly bites free and low tiers. Concurrency caps how many requests can be in flight at once, which matters for batch and parallel workloads.

TPM is usually the one that surprises people. A single long-context request can carry hundreds of thousands of input tokens, so a handful of big calls can blow through a TPM ceiling even though your RPM is nowhere near the cap. When you design retry and queueing logic, budget against TPM first.

OpenAI Rate Limits (2026)

OpenAI organizes accounts into usage tiers, Tier 1 through Tier 5. You move up by spending cumulatively and waiting out a minimum time since your first successful payment. Each tier unlocks higher RPM and TPM per model, with reasoning models (the o-series) generally getting tighter limits than the GPT-4.1 family because they consume more compute per call. The free tier is heavily restricted, often a few requests per minute on the reasoning models, which is fine for testing and useless for production.

TierUnlocked byPractical headroom
FreeSign up, no paymentA few RPM on reasoning models. Testing only.
Tier 1First payment madeLow hundreds of RPM, basic TPM. Small apps.
Tier 2$50+ spent, 7+ daysHigher TPM. Early production.
Tier 3$100+ spentComfortable for most production traffic.
Tier 4-5$1,000+ spent over timeHigh RPM/TPM. Heavy production and batch.

Tier thresholds and per-model limits shift through the year. Check your live limits in the OpenAI dashboard under Limits, and read the official rate-limits guide for current numbers.

Anthropic Claude Rate Limits and Cache Write Cost (2026)

Anthropic uses a comparable tiered model for Claude. New accounts start low; cumulative spend and account age unlock higher RPM and TPM. Claude also distinguishes input TPM from output TPM in its limit accounting, which matters because output is the slower, scarcer resource. As with OpenAI, the way to a bigger limit is to keep paying and, for large workloads, to ask the sales team directly.

The cache write cost question comes up constantly, so here is the shape of it. Anthropic prompt caching lets you store a stable prompt prefix (a long system prompt, a document, few-shot examples) and reuse it across calls. Writing that prefix into the cache costs a premium over the base input rate, commonly around 25 percent more for a standard cache lifetime, with a higher premium for longer-lived caches. Reading from the cache on later requests is the payoff: cache reads run roughly 90 percent cheaper than base input. So you pay a little extra once to write, then save heavily on every repeat. For an Opus 4.6 prompt where base input is $5 per million tokens, the cache write is billed above that and cache reads land near $0.50 per million. Confirm the exact multipliers and cache durations on Anthropic's pricing page, since they have changed more than once.

Google Gemini Rate Limits (2026)

Gemini's free tier is the most generous of the three for prototyping, offering Gemini 2.5 Flash under modest caps (on the order of low-tens of requests per minute and around a million tokens per minute, with a daily request ceiling). Paid tiers raise these substantially. Google also adjusts limits by model, with the Pro tier capped tighter than Flash. Because Google prices and limits long-context requests differently above a context threshold, a single very large request can count against your quota more heavily than its request count suggests.

Rate Limits vs Usage Limits vs Image Limits

These three get conflated. Rate limits are short-window throughput caps (RPM, TPM). Usage limits are total-consumption caps over a billing period: a monthly spend ceiling you set yourself to avoid runaway bills, plus any daily token cap your tier imposes. Image generation limits are a consumer-app concept; ChatGPT Plus and the Gemini app cap how many images you can generate per day, which is unrelated to the API token quota a developer sees. If your question is "why did my chatbot stop at 40 images," that is an app image limit, not an API rate limit.

How to Raise Your Limits Without Waiting

First, prepay credit and keep invoices current; tier increases are largely automatic once you cross spend and time thresholds. Second, batch non-urgent work through the provider's batch endpoint, which uses a separate, more generous quota and costs half. Third, cache aggressively so the same workload consumes fewer fresh tokens. Fourth, implement exponential backoff with jitter on 429s so a brief spike does not cascade into a thundering-herd retry storm. Fifth, for genuinely large or spiky production traffic, email the provider's sales team with your projected TPM; custom limits exist but you have to ask. None of these is a magic instant upgrade, but together they buy real headroom.

Frequently Asked Questions

What are LLM API rate limits?

Rate limits cap how much you can send to an API in a window of time. The two most common units are requests per minute (RPM) and tokens per minute (TPM). Some providers also enforce a daily token cap and a concurrent-request limit. If you exceed a limit the API returns an HTTP 429 error and you retry after a short backoff. Limits are tied to your account's usage tier.

How do I increase my OpenAI or Claude rate limit?

Both providers raise limits automatically as your account ages and your spend grows. OpenAI uses usage tiers from Tier 1 to Tier 5, unlocked by cumulative spend and time since first payment. Anthropic uses a similar tiered system. To move up faster, add credit, pay invoices on time, and for large workloads contact sales to request a custom limit. There is no instant button.

What is the difference between rate limits and usage limits?

Rate limits control throughput in a short window (requests or tokens per minute). Usage limits control total consumption over a billing period, often as a dollar ceiling you set to avoid surprise bills, plus any daily token cap your tier imposes. You can hit a usage limit without ever hitting a rate limit, and vice versa.

How much does cache write cost on Anthropic Claude?

Anthropic charges a one-time premium to write a prompt prefix into the cache, then a steep discount to read it later. As of 2026 the cache write is billed at a markup over base input (commonly around 25 percent more for a standard lifetime), while cache reads run roughly 90 percent cheaper than base input. The exact multipliers and cache durations change, so confirm current figures on Anthropic's pricing page.

Do rate limits apply to ChatGPT and Gemini consumer apps too?

Yes, but differently. Consumer apps cap messages per window and image generations per day rather than tokens per minute. Free tiers are the most restricted; paid plans raise the caps. These app-level limits are separate from the API rate limits a developer hits when building on the same models, so do not assume your API quota matches the chat product.

Sources

API limits and prices move monthly. We track all of it.

Weekly data from 22,000+ job postings. Free.

2,700+ subscribers. Unsubscribe anytime.

Updated April 2026

LangGraph became the default for stateful agent workflows in early 2026. CrewAI hit 2.0 with improved memory and tool use. Pydantic AI gained traction for typed LLM pipelines.