Infrastructure

Tokenizer

Quick Answer: The component that converts raw text into the sequence of tokens a model can process, and converts tokens back into text.

Tokenizer is the component that converts raw text into the sequence of tokens a model can process, and converts tokens back into text. Different models use different tokenizers — a word might be one token or split into multiple sub-word tokens depending on the tokenizer's vocabulary.

Example

The word 'unbelievable' might be tokenized as ['un', 'believ', 'able'] (3 tokens). Common words like 'the' are typically 1 token. Non-English text and code often use more tokens per character than English prose.

Why It Matters

Tokenizer differences explain why the same text costs different amounts across models. Understanding tokenization helps prompt engineers estimate costs, stay within context limits, and optimize prompt length.

How It Works

A tokenizer converts raw text into the numerical token IDs that a language model can process. Most modern tokenizers use subword algorithms like Byte-Pair Encoding (BPE) or SentencePiece that learn a vocabulary by finding frequently occurring character sequences in training data.

The tokenizer's vocabulary directly affects model behavior. Common English words might be single tokens, while rare words or non-English text get split into multiple subword tokens. This explains why models handle common language better than specialized jargon, and why API costs vary by language (Chinese text uses roughly 2x more tokens than English for the same content).

Each model family has its own tokenizer with a different vocabulary. GPT-4's tokenizer (cl100k_base) has 100,256 tokens. Claude and Llama use different tokenizers. This means the same text produces different token counts and costs across providers. OpenAI's tiktoken library and Hugging Face's tokenizers library let you count tokens before making API calls.

Common Mistakes

Common mistake: Assuming token counts are the same across different models

Always count tokens using the specific model's tokenizer. The same text can be 100 tokens in one model and 130 in another.

Common mistake: Ignoring tokenization when debugging unexpected model behavior

If a model struggles with specific words or patterns, check how they tokenize. Unusual tokenization (splitting a word into many pieces) often correlates with poor performance on that input.

Career Relevance

Tokenizer understanding is important for cost optimization, debugging model behavior, and multilingual AI applications. It's a practical skill that separates experienced AI practitioners from beginners.

Related Terms

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →