Self-Attention
Example
Why It Matters
Self-attention is the core innovation that makes transformers (and therefore all modern LLMs) work. Understanding it helps you grasp why models excel at some tasks, why context window size matters, and why certain prompt structures are more effective than others.
How It Works
Self-attention works by computing three vectors for each token: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I carry?). For each token, the model computes attention scores by comparing its Query with every other token's Key. These scores determine how much each token's Value contributes to the current token's updated representation.
The 'self' in self-attention means the model attends to other positions within the same input sequence, as opposed to cross-attention where it attends to a separate input. Multi-head attention runs multiple attention computations in parallel, each focusing on different aspects of the relationships (syntax, semantics, coreference, etc.).
The computational cost of self-attention scales quadratically with sequence length, because every token must attend to every other token. This is why extending context windows is challenging and expensive. Techniques like FlashAttention, sparse attention, and sliding window attention reduce this cost while preserving most of the capability. Understanding this trade-off explains why longer context windows cost more per token and why models process long inputs more slowly.
Common Mistakes
Common mistake: Assuming the model pays equal attention to all parts of the input
Models attend more to certain positions and patterns. Content at the beginning and end of prompts typically receives more attention than content in the middle.
Common mistake: Not considering attention patterns when designing prompts
Place the most important instructions and context where the model is most likely to attend to them: at the start and end of your prompt.
Common mistake: Conflating attention weights with model reasoning
Attention patterns show what the model focuses on, but they don't fully explain why it reaches a particular conclusion. Use attention as one interpretability signal among many.
Career Relevance
Self-attention knowledge demonstrates deep understanding of how LLMs work. It's valuable for technical interviews, prompt optimization, and communicating with ML engineers about model behavior and limitations.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →