Core Concepts

Self-Attention

Quick Answer: The mechanism inside transformer models that allows each token in a sequence to look at and weigh the relevance of every other token when computing its representation.
Self-Attention is the mechanism inside transformer models that allows each token in a sequence to look at and weigh the relevance of every other token when computing its representation. Self-attention is what lets language models understand context, resolve ambiguities, and capture long-range dependencies in text.

Example

In the sentence 'The cat sat on the mat because it was tired,' self-attention lets the model determine that 'it' refers to 'cat' (not 'mat') by computing high attention weights between 'it' and 'cat' based on semantic compatibility.

Why It Matters

Self-attention is the core innovation that makes transformers (and therefore all modern LLMs) work. Understanding it helps you grasp why models excel at some tasks, why context window size matters, and why certain prompt structures are more effective than others.

How It Works

Self-attention works by computing three vectors for each token: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I carry?). For each token, the model computes attention scores by comparing its Query with every other token's Key. These scores determine how much each token's Value contributes to the current token's updated representation.

The 'self' in self-attention means the model attends to other positions within the same input sequence, as opposed to cross-attention where it attends to a separate input. Multi-head attention runs multiple attention computations in parallel, each focusing on different aspects of the relationships (syntax, semantics, coreference, etc.).

The computational cost of self-attention scales quadratically with sequence length, because every token must attend to every other token. This is why extending context windows is challenging and expensive. Techniques like FlashAttention, sparse attention, and sliding window attention reduce this cost while preserving most of the capability. Understanding this trade-off explains why longer context windows cost more per token and why models process long inputs more slowly.

Common Mistakes

Common mistake: Assuming the model pays equal attention to all parts of the input

Models attend more to certain positions and patterns. Content at the beginning and end of prompts typically receives more attention than content in the middle.

Common mistake: Not considering attention patterns when designing prompts

Place the most important instructions and context where the model is most likely to attend to them: at the start and end of your prompt.

Common mistake: Conflating attention weights with model reasoning

Attention patterns show what the model focuses on, but they don't fully explain why it reaches a particular conclusion. Use attention as one interpretability signal among many.

Career Relevance

Self-attention knowledge demonstrates deep understanding of how LLMs work. It's valuable for technical interviews, prompt optimization, and communicating with ML engineers about model behavior and limitations.

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →