Transformer Attention: Full Conceptual Breakdown

Transformer Attention: Full Conceptual Breakdown #

This document summarizes an in-depth discussion on attention mechanisms in Transformers, with a special focus on vocabulary embeddings, Q/K/V matrices, and multi-head attention.


πŸ“Œ 1. Understanding the Self-Attention Image #

  • The image shows a single-head self-attention computation.
  • Each row is a token (element) at a position, with a feature vector (embedding).
  • The attention weights (left column) are used to compute a weighted sum over these vectors.
  • The final output vector is shown at the bottom β€” this is the attention output for one token.

πŸ” 2. Element vs. Position #

  • Element: the actual word or token in the input sequence.
  • Position: the index of the element in the sequence.
  • Though tightly coupled (1:1), they are conceptually different.
  • Transformers rely on positional encoding to retain order, since attention alone is orderless.

πŸ€– 3. How Attention Scores Are Computed #

  1. Input embeddings X are projected into:

    • Queries (Q)
    • Keys (K)
    • Values (V)
  2. Attention score between token i and j:

    score = dot(Q[i], K[j]) / sqrt(d_k)
    
  3. Apply softmax to get weights.

  4. Multiply each Value by its weight and sum β†’ gives the final output vector.


🧠 4. What Is X in the Diagram? #

  • The large matrix on the right of the image is the input embedding matrix X.
  • Shape: sequence_length Γ— embedding_dim
  • It is built by looking up each token’s vector from the vocabulary embedding matrix.

πŸ”„ 5. What Is Multi-Head Attention? #

  • Single-head attention is shown in the image.
  • Multi-head attention:
    • Splits X into smaller chunks (d_model / n_heads)
    • Computes self-attention in parallel on each chunk (head)
    • Concatenates results from all heads
    • Applies a final linear projection

πŸ”‘ 6. Vocab Embedding Matrix vs. Q/K/V #

  • Vocabulary embedding matrix:
    • Initialized randomly
    • Trained to map each token to a vector
  • Q, K, V:
    • Computed from X using learned matrices W_Q, W_K, W_V
    • Not stored in the vocabulary matrix
    • Are trainable and persistent

♻️ 7. Lifetime of W_Q, W_K, W_V #

  • These matrices are:
    • Initialized once
    • Trained over time
    • Reused across batches
  • They are not reset per input or per batch.
  • Gradients update them through backpropagation.

πŸ“₯ 8. Is Vocabulary Matrix Also Trainable? #

βœ… Yes.

  • It is randomly initialized and trained alongside the rest of the model.
  • Each token lookup retrieves a vector from this matrix.
  • This matrix evolves to encode semantic relationships between words.

πŸ“¦ 9. Use Cases After Training #

Goal Uses Vocab Matrix Uses W_Q/K/V
Inference on new sentence βœ… βœ…
Static embedding for a token βœ… ❌
Contextual embedding in sentence βœ… βœ…

πŸ“ 10. Dimensions of X, Q, K, V, and Attention #

Let:

  • L = sequence length
  • d_model = embedding dimension (e.g. 512)
  • n_heads = number of attention heads
  • d_k = d_model / n_heads
Component Shape
Input X (L, d_model)
W_Q, W_K, W_V (d_model, d_model)
Q, K, V (stacked) (n_heads, L, d_k)
Attention output (head) (L, d_k)
Concatenated heads (L, d_model)
Final output (L, d_model)

❓ 11. Why Isn’t the Final Output a Distribution Over Vocabulary? #

This is a great question that highlights a common confusion.

  • The output of multi-head attention (and the full Transformer stack) is:

    (L, d_model)
    
  • But the vocabulary distribution comes after applying a final linear layer:

    W_vocab ∈ ℝ^(d_model Γ— vocab_size)
    logits = output Γ— W_vocab  β†’ (L, vocab_size)
    
  • Then softmax gives:

    probability distribution over vocabulary for each token position
    
Stage Output Shape
Multi-head Attention (L, d_model)
Final Linear Projection (L, vocab_size)
Softmax (L, vocab_size)

So the discrepancy is resolved when we remember that attention is only a component β€” the final vocabulary distribution is computed later in the model pipeline.


Prepared as a study summary by ChatGPT based on a thread of detailed conceptual questions.