Transformer Attention: Full Conceptual Breakdown #
This document summarizes an in-depth discussion on attention mechanisms in Transformers, with a special focus on vocabulary embeddings, Q/K/V matrices, and multi-head attention.
π 1. Understanding the Self-Attention Image #
- The image shows a single-head self-attention computation.
- Each row is a token (element) at a position, with a feature vector (embedding).
- The attention weights (left column) are used to compute a weighted sum over these vectors.
- The final output vector is shown at the bottom β this is the attention output for one token.
π 2. Element vs. Position #
- Element: the actual word or token in the input sequence.
- Position: the index of the element in the sequence.
- Though tightly coupled (1:1), they are conceptually different.
- Transformers rely on positional encoding to retain order, since attention alone is orderless.
π€ 3. How Attention Scores Are Computed #
-
Input embeddings X are projected into:
- Queries (Q)
- Keys (K)
- Values (V)
-
Attention score between token i and j:
score = dot(Q[i], K[j]) / sqrt(d_k)
-
Apply softmax to get weights.
-
Multiply each Value by its weight and sum β gives the final output vector.
π§ 4. What Is X in the Diagram? #
- The large matrix on the right of the image is the input embedding matrix
X
. - Shape:
sequence_length Γ embedding_dim
- It is built by looking up each tokenβs vector from the vocabulary embedding matrix.
π 5. What Is Multi-Head Attention? #
- Single-head attention is shown in the image.
- Multi-head attention:
- Splits
X
into smaller chunks (d_model / n_heads
) - Computes self-attention in parallel on each chunk (head)
- Concatenates results from all heads
- Applies a final linear projection
- Splits
π‘ 6. Vocab Embedding Matrix vs. Q/K/V #
- Vocabulary embedding matrix:
- Initialized randomly
- Trained to map each token to a vector
- Q, K, V:
- Computed from
X
using learned matricesW_Q
,W_K
,W_V
- Not stored in the vocabulary matrix
- Are trainable and persistent
- Computed from
β»οΈ 7. Lifetime of W_Q, W_K, W_V #
- These matrices are:
- Initialized once
- Trained over time
- Reused across batches
- They are not reset per input or per batch.
- Gradients update them through backpropagation.
π₯ 8. Is Vocabulary Matrix Also Trainable? #
β Yes.
- It is randomly initialized and trained alongside the rest of the model.
- Each token lookup retrieves a vector from this matrix.
- This matrix evolves to encode semantic relationships between words.
π¦ 9. Use Cases After Training #
Goal | Uses Vocab Matrix | Uses W_Q/K/V |
---|---|---|
Inference on new sentence | β | β |
Static embedding for a token | β | β |
Contextual embedding in sentence | β | β |
π 10. Dimensions of X, Q, K, V, and Attention #
Let:
L
= sequence lengthd_model
= embedding dimension (e.g. 512)n_heads
= number of attention headsd_k = d_model / n_heads
Component | Shape |
---|---|
Input X | (L, d_model) |
W_Q, W_K, W_V | (d_model, d_model) |
Q, K, V (stacked) | (n_heads, L, d_k) |
Attention output (head) | (L, d_k) |
Concatenated heads | (L, d_model) |
Final output | (L, d_model) |
β 11. Why Isnβt the Final Output a Distribution Over Vocabulary? #
This is a great question that highlights a common confusion.
-
The output of multi-head attention (and the full Transformer stack) is:
(L, d_model)
-
But the vocabulary distribution comes after applying a final linear layer:
W_vocab β β^(d_model Γ vocab_size) logits = output Γ W_vocab β (L, vocab_size)
-
Then softmax gives:
probability distribution over vocabulary for each token position
Stage | Output Shape |
---|---|
Multi-head Attention | (L, d_model) |
Final Linear Projection | (L, vocab_size) |
Softmax | (L, vocab_size) |
So the discrepancy is resolved when we remember that attention is only a component β the final vocabulary distribution is computed later in the model pipeline.
Prepared as a study summary by ChatGPT based on a thread of detailed conceptual questions.