Positional Embeddings in Transformers
A deep dive into absolute, relative, sine/cosine, and rotational positional embeddings — with interactive playgrounds.
Positional Embeddings in Transformers
Transformers process tokens in parallel — they have no inherent sense of order. Positional embeddings inject information about where each token sits in a sequence, allowing the model to reason about order and distance.
1. Types of Positional Embedding
There are two broad families of positional embeddings:
Absolute Positional Embeddings
Each position i in the sequence is assigned a fixed or learned vector. The embedding is added directly to the token embedding before it enters the transformer.
- The model sees position
0,1,2, ...nas independent, absolute coordinates. - Learned absolute embeddings (used in BERT, GPT) — the position vectors are parameters trained end-to-end.
- Fixed absolute embeddings (used in the original Transformer) — the position vectors are computed using sine and cosine functions (see §3).
Limitation: They don't generalise well to sequence lengths longer than those seen during training.
Relative Positional Embeddings
Instead of encoding the absolute position of each token, these encode the relative distance between tokens — e.g., "token A is 3 positions before token B."
- Used in T5, Shaw et al. (2018), ALiBi, RoPE (see §5).
- The attention score between two tokens is modified by a bias/function of their distance
i - j. - Generalises better to unseen sequence lengths.
| Absolute | Relative | |
|---|---|---|
| Encodes | Position index | Distance between tokens |
| Generalisation | Poor beyond training length | Better |
| Examples | BERT, GPT-2 | T5, RoPE, ALiBi |
2. How Sine Waves Work — Frequency, Amplitude & Phase
Before diving into sine/cosine embeddings, it helps to build intuition about sine waves themselves.
A sine wave is defined as:
| Parameter | Symbol | Effect |
|---|---|---|
| Amplitude | Controls the height (scale) of the wave | |
| Frequency | Controls how many cycles per unit — higher frequency = tighter wave | |
| Phase | Shifts the wave left or right along the axis | |
| Position | The input — in our case, the token position in the sequence |
Key Intuitions
- Low frequency sine waves vary slowly — they capture coarse, long-range position information.
- High frequency sine waves vary quickly — they capture fine-grained, local position information.
- Unique Mapping: By combining many sine waves at different frequencies, you can uniquely represent any position in a sequence.
Sine and Cosine: Phase Shifts and Orthogonality
Cosine is not a separate entity; it is mathematically a sine wave shifted by ( radians). This relationship is defined by the identity:
Why the Offset Matters
They are essentially the same wave but offset in phase. This creates a distinct progression as they move through a cycle:
| Function | |||||
|---|---|---|---|---|---|
| 0 | 1 | 0 | -1 | 0 | |
| 1 | 0 | -1 | 0 | 1 |
The Power of the Pair
The original Transformer uses both and at every frequency because together they provide two orthogonal views of the same position. This ensures that:
- Uniqueness: No two positions produce the same total embedding vector. If you used only sine, positions and would both result in , making them indistinguishable to the model. The cosine component breaks this symmetry.
- Relative Positioning: For any fixed offset , can be represented as a linear function of . This allows the model to easily attend to relative positions.
Sine captures the current state in the cycle, while Cosine captures the "momentum" or direction. Together, they uniquely pin down any point on a unit circle.
🎛️ Interactive: Sine Wave Explorer
3. Sine & Cosine Positional Embeddings
The original "Attention Is All You Need" paper (Vaswani et al., 2017) proposed fixed positional embeddings using interleaved sine and cosine functions across embedding dimensions:
- : The position of the token in the sequence ().
- : The dimension index, ranging from to .
- : The total embedding dimension (e.g., 512).
Why This Works
- Multiscale Encoding: Each dimension uses a different wavelength. Since the wavelength increases geometrically, the first dimensions encode high-frequency "fine-grained" position info, while later dimensions encode low-frequency "global" info.
- Relative Positioning: The authors chose this specific geometric progression because, for any fixed offset , can be represented as a linear function of . This allows the model to easily learn to attend by relative positions.
- Uniqueness: By using both sine and cosine (orthogonal functions) for each frequency, the model ensures that every position vector in a sequence is unique and that the "direction" of the position is preserved.
Wavelength Range
The wavelength is derived from the period of the sine/cosine functions: .
| Dimension Index () | Wavelength () | Scale |
|---|---|---|
| (First pair) | Very short (High freq) | |
| (Last pair) | Very long (Low freq) |
Implementation Tip: Log-Space Computation
Calculating the denominator directly can cause numerical overflow. In practice, we compute it in log-space:
Sine/Cosine Embedding Visualiser
4. Rotation Matrices
Rotary embeddings leverage the geometric properties of 2D rotation matrices. A rotation matrix rotates a 2D vector by an angle around the origin.
Vector Transformation
When we apply to a 2D column vector , the result is:
Key Properties
- Length-preserving (Isometry): . Rotating a vector never changes its magnitude, only its direction. This ensures that positional encoding doesn't "explode" the values of the hidden states.
- Composable: . Rotating by and then by is mathematically equivalent to a single rotation by .
- Invertible: . To "undo" a rotation, you simply rotate in the opposite direction.
Why this is the 'Secret Sauce' for RoPE
The composability property is exactly what allows Transformers to capture relative positions.
If token is at position and token is at position , their dot product (attention score) will only depend on the relative distance . This is because:
This makes the model naturally translation-invariant!
🎛️ Interactive: Rotation Matrix Explorer
5. Rotary Positional Embeddings (RoPE)
RoPE (Su et al., 2021) is the backbone of modern LLMs like LLaMA, Mistral, and GPT-NeoX. Instead of adding a positional vector to the token embedding, RoPE encodes position by rotating the Query and Key vectors in the attention mechanism.
Core Idea
For a token at position , we rotate its query vector by an angle proportional to . When we compute the dot product between a query at and a key at , the rotation "interlocks" to reveal the relative distance.
The attention score (dot product) then becomes:
The Key Insight
Because the dot product depends only on , the attention score is a function of the relative distance between tokens. This makes the model naturally translation-invariant without needing the complex lookup tables used in older relative bias methods.
How It's Applied in Practice
The embedding dimension is split into pairs. Each pair of coordinates is treated as a point in a 2D plane and rotated by its own frequency :
This follows the same geometric progression as the original Transformer, but uses these values as rotation speeds rather than additive constants.
Comparison: Sine/Cosine vs. RoPE
| Feature | Sine/Cosine (Absolute) | RoPE (Rotary) |
|---|---|---|
| Application | Added to the input embedding | Multiplied (rotated) into and |
| Encoding Type | Absolute Position | Relative Position (via dot product) |
| Attention | Depends on and independently | Depends only on distance |
| Generalization | Poor for longer sequences | Excellent (supports context scaling) |
| Common Use | Vanilla Transformer, BERT | LLaMA, Mistral, PaLM, Phi |
Step-by-Step: Applying RoPE
-
Input: Take a query vector for a token at position .
-
Pairing: Split the vector into consecutive pairs: .
-
Rotation: For each pair , calculate the rotation angle .
-
Transformation: Apply the 2D rotation:
-
Attention: Repeat for the key vector at position . The resulting dot product now inherently encodes the relative position.
🎛️ Interactive: RoPE Playground
Positions
Dimension Space
Notice how moving both sliders by the same amount keeps the Attention Score exactly the same. RoPE inherently preserves relative distance!