Transformer Attention: Full Conceptual Breakdown #

This document summarizes an in-depth discussion on attention mechanisms in Transformers, with a special focus on vocabulary embeddings, Q/K/V matrices, and multi-head attention.

📌 1. Understanding the Self-Attention Image #

The image shows a single-head self-attention computation.
Each row is a token (element) at a position, with a feature vector (embedding).
The attention weights (left column) are used to compute a weighted sum over these vectors.
The final output vector is shown at the bottom — this is the attention output for one token.

🔍 2. Element vs. Position #

Element: the actual word or token in the input sequence.
Position: the index of the element in the sequence.
Though tightly coupled (1:1), they are conceptually different.
Transformers rely on positional encoding to retain order, since attention alone is orderless.

🤖 3. How Attention Scores Are Computed #

Input embeddings X are projected into:
- Queries (Q)
- Keys (K)
- Values (V)
Attention score between token i and j:
```
score = dot(Q[i], K[j]) / sqrt(d_k)
```
Apply softmax to get weights.
Multiply each Value by its weight and sum → gives the final output vector.

🧠 4. What Is X in the Diagram? #

The large matrix on the right of the image is the input embedding matrix X.
Shape: sequence_length × embedding_dim
It is built by looking up each token’s vector from the vocabulary embedding matrix.

🔄 5. What Is Multi-Head Attention? #

Single-head attention is shown in the image.
Multi-head attention:
- Splits X into smaller chunks (d_model / n_heads)
- Computes self-attention in parallel on each chunk (head)
- Concatenates results from all heads
- Applies a final linear projection

🔡 6. Vocab Embedding Matrix vs. Q/K/V #

Vocabulary embedding matrix:
- Initialized randomly
- Trained to map each token to a vector
Q, K, V:
- Computed from X using learned matrices W_Q, W_K, W_V
- Not stored in the vocabulary matrix
- Are trainable and persistent

♻️ 7. Lifetime of W_Q, W_K, W_V #

These matrices are:
- Initialized once
- Trained over time
- Reused across batches
They are not reset per input or per batch.
Gradients update them through backpropagation.

📥 8. Is Vocabulary Matrix Also Trainable? #

✅ Yes.

It is randomly initialized and trained alongside the rest of the model.
Each token lookup retrieves a vector from this matrix.
This matrix evolves to encode semantic relationships between words.

📦 9. Use Cases After Training #

Goal	Uses Vocab Matrix	Uses W_Q/K/V
Inference on new sentence	✅	✅
Static embedding for a token	✅	❌
Contextual embedding in sentence	✅	✅

📐 10. Dimensions of X, Q, K, V, and Attention #

Let:

L = sequence length
d_model = embedding dimension (e.g. 512)
n_heads = number of attention heads
d_k = d_model / n_heads

Component	Shape
Input X	(L, d_model)
W_Q, W_K, W_V	(d_model, d_model)
Q, K, V (stacked)	(n_heads, L, d_k)
Attention output (head)	(L, d_k)
Concatenated heads	(L, d_model)
Final output	(L, d_model)

❓ 11. Why Isn’t the Final Output a Distribution Over Vocabulary? #

This is a great question that highlights a common confusion.

The output of multi-head attention (and the full Transformer stack) is:
```
(L, d_model)
```

But the vocabulary distribution comes after applying a final linear layer:

W_vocab ∈ ℝ^(d_model × vocab_size)
logits = output × W_vocab  → (L, vocab_size)

Then softmax gives:

probability distribution over vocabulary for each token position

Stage	Output Shape
Multi-head Attention	(L, d_model)
Final Linear Projection	(L, vocab_size)
Softmax	(L, vocab_size)

So the discrepancy is resolved when we remember that attention is only a component — the final vocabulary distribution is computed later in the model pipeline.

Prepared as a study summary by ChatGPT based on a thread of detailed conceptual questions.