Guide

AI Token Pricing 2026 - Cost Per 1M Tokens for Every Major LLM

By Rome Thorndike · April 6, 2026 · 14 min read

Every LLM provider quotes prices in tokens. Most developers have a rough sense that tokens are "kind of like words." That rough understanding leads to rough cost estimates, which lead to budget surprises.

This guide explains token pricing precisely: what tokens are, how much they cost across every major provider, and how to calculate your actual spend before you commit to a model.

Tokens: The Basics

A token is a chunk of text that the model processes as a single unit. For English text:

  • 1 token is roughly 4 characters or about 0.75 words
  • 1,000 tokens is roughly 750 words
  • 1 million tokens is roughly 750,000 words (about 10 novels)
  • A typical email (200 words) is about 270 tokens
  • A 10-page report (3,000 words) is about 4,000 tokens

Code tokenizes differently than prose. Python code typically produces 1.5-2x more tokens per line than English text because of syntax characters, indentation, and variable names. JSON and XML are particularly token-hungry formats.

Per-1M-Token Pricing: Every Major Model

All prices current as of April 2026. Input tokens are what you send; output tokens are what the model generates.

Frontier Models (Highest Capability)

ModelInput / 1M TokensOutput / 1M TokensEffective Ratio
Claude Opus 4$15.00$75.005.0x
o3 (OpenAI)$2.00$8.004.0x
GPT-4.1$2.00$8.004.0x
GPT-4o$2.50$10.004.0x
Gemini 2.5 Pro$1.25$10.008.0x

Mid-Tier Models (Best Price-Performance)

ModelInput / 1M TokensOutput / 1M TokensEffective Ratio
Claude Sonnet 4$3.00$15.005.0x
Mistral Large 2$2.00$6.003.0x
Cohere Command R+$2.50$10.004.0x
o4-mini (OpenAI)$1.10$4.404.0x

Budget Models (High Volume / Low Cost)

ModelInput / 1M TokensOutput / 1M TokensEffective Ratio
GPT-4.1 mini$0.40$1.604.0x
Claude Haiku 3.5$0.80$4.005.0x
Gemini 2.5 Flash$0.15$0.604.0x
Gemini 2.0 Flash$0.10$0.404.0x
GPT-4.1 nano$0.10$0.404.0x
Mistral Small$0.10$0.303.0x
Cohere Command R$0.15$0.604.0x
Llama 4 Maverick (hosted)$0.20$0.603.0x
Llama 4 Scout (hosted)$0.10$0.252.5x

Why Output Tokens Cost More

Every provider charges more for output tokens than input tokens. The ratio ranges from 2.5x (Llama 4 Scout) to 8x (Gemini 2.5 Pro). This isn't arbitrary pricing. Generating output requires the model to run its full forward pass for each token sequentially, while input tokens can be processed in parallel batches.

This asymmetry has a practical implication: controlling output length is the single most effective cost optimization. A prompt that generates 500 tokens of output costs 2-5x less than one that generates 2,000 tokens, even with identical input.

Real-World Cost Calculations

Abstract per-million pricing is hard to reason about. Here are concrete examples.

Example 1: Email Classification System

You're classifying incoming customer emails by topic and priority.

  • System prompt: 500 tokens (fixed per request)
  • Average email: 270 tokens
  • Output (classification + reasoning): 50 tokens
  • Volume: 5,000 emails/day
ModelDaily CostMonthly Cost
GPT-4.1$9.70$291
GPT-4.1 mini$1.94$58
GPT-4.1 nano$0.49$15
Gemini 2.5 Flash$0.73$22
Claude Haiku 3.5$4.08$122

For a simple classification task, the difference between the cheapest and most expensive reasonable option is 20x. If classification accuracy is similar across models (test this with your data), there's no reason to pay $291/month when $15/month works.

Example 2: Content Generation Platform

You're generating blog post drafts from outlines.

  • System prompt: 1,000 tokens
  • User input (outline + instructions): 500 tokens
  • Output (draft article): 4,000 tokens (~3,000 words)
  • Volume: 200 articles/day
ModelDaily CostMonthly Cost
GPT-4.1$7.00$210
Claude Sonnet 4$12.90$387
Gemini 2.5 Pro$8.38$251
GPT-4.1 mini$1.40$42
Gemini 2.5 Flash$0.53$16

Content generation is output-heavy, so the output token price dominates. Claude Sonnet's $15/1M output pricing makes it expensive for this workload despite competitive input pricing.

Example 3: RAG-Based Q&A System

Knowledge base chatbot retrieving context from documents.

  • System prompt: 800 tokens
  • Retrieved context: 3,000 tokens (average 4 chunks)
  • User question: 50 tokens
  • Answer: 300 tokens
  • Volume: 2,000 queries/day
ModelDaily CostMonthly Cost
GPT-4.1$20.20$606
Claude Sonnet 4$32.10$963
GPT-4.1 mini$4.04$121
Gemini 2.5 Flash$1.52$46
Llama 4 Maverick (hosted)$1.90$57

RAG systems are input-heavy (large context windows), so input token pricing matters more. Prompt caching can dramatically reduce costs here since the system prompt repeats every request.

Per-1K Tokens vs. Per-1M Tokens

Some documentation quotes prices per 1K tokens, others per 1M tokens. The industry has largely standardized on per-1M-token pricing, but you'll still encounter per-1K pricing in older docs.

Conversion: divide the per-1M price by 1,000 to get per-1K pricing. GPT-4.1 at $2.00/1M input = $0.002/1K input.

Hidden Costs to Account For

Reasoning Token Overhead

OpenAI's o3 and o4-mini models generate internal reasoning tokens that you're billed for as output tokens but never see. A simple question might generate 500 visible output tokens but consume 3,000 reasoning tokens internally. Your actual cost is 7x what you'd estimate from the visible output alone.

Retry and Error Costs

Rate limits, timeouts, and malformed outputs mean you'll retry some percentage of requests. Budget for 5-15% overhead depending on your error handling logic and the provider's reliability.

Embedding Costs

If you're building RAG systems, you also pay for embedding your documents. OpenAI's text-embedding-3-small costs $0.02/1M tokens. See our Best Embedding Models 2026 guide for a full comparison.

Fine-Tuning Costs

Training tokens for fine-tuning cost 2-8x more than inference tokens. OpenAI charges $25.00/1M training tokens for GPT-4.1 mini. Factor this into your total cost of ownership if you plan to fine-tune.

Cost Optimization Checklist

Apply these in order of impact:

  1. Right-size your model. Test cheaper models first. You might not need GPT-4.1 when GPT-4.1 mini or Gemini Flash produces acceptable results for your task.
  2. Minimize output tokens. Request JSON instead of prose. Set max_tokens. Use structured output schemas to prevent verbose generation.
  3. Enable prompt caching. Both OpenAI and Anthropic cache system prompts automatically. Make sure your system prompt is stable to benefit from caching.
  4. Batch when possible. OpenAI's Batch API is 50% cheaper. Anthropic offers similar discounts for batched requests.
  5. Compress context. Use RAG to retrieve only relevant sections. Summarize long contexts before passing them to the model.
  6. Monitor and alert. Set spending alerts in your provider dashboard. Review token usage weekly to catch inefficiencies early.

For a comprehensive comparison of all model pricing, see our LLM Pricing Comparison (April 2026) guide.

RT
About the Author

Rome Thorndike is the founder of the Prompt Engineer Collective, a community of over 1,300 prompt engineering professionals, and author of The AI News Digest, a weekly newsletter with 2,700+ subscribers. Rome brings hands-on AI/ML experience from Microsoft, where he worked with Dynamics and Azure AI/ML solutions, and later led sales at Datajoy (acquired by Databricks).

Get smarter about AI tools, careers & strategy. Every week.

AI News Digest covers industry moves & tool updates. AI Pulse covers salary data & career strategy. Both free.

2,700+ subscribers. Unsubscribe anytime.