A deep dive into absolute, relative, sine/cosine, and rotational positional embeddings — with interactive playgrounds.

Positional Embeddings in Transformers

Transformers process tokens in parallel — they have no inherent sense of order. Positional embeddings inject information about where each token sits in a sequence, allowing the model to reason about order and distance.

1. Types of Positional Embedding

There are two broad families of positional embeddings:

Absolute Positional Embeddings

Each position i in the sequence is assigned a fixed or learned vector. The embedding is added directly to the token embedding before it enters the transformer.

The model sees position 0, 1, 2, ... n as independent, absolute coordinates.
Learned absolute embeddings (used in BERT, GPT) — the position vectors are parameters trained end-to-end.
Fixed absolute embeddings (used in the original Transformer) — the position vectors are computed using sine and cosine functions (see §3).

Limitation: They don't generalise well to sequence lengths longer than those seen during training.

Relative Positional Embeddings

Instead of encoding the absolute position of each token, these encode the relative distance between tokens — e.g., "token A is 3 positions before token B."

Used in T5, Shaw et al. (2018), ALiBi, RoPE (see §5).
The attention score between two tokens is modified by a bias/function of their distance i - j.
Generalises better to unseen sequence lengths.

	Absolute	Relative
Encodes	Position index	Distance between tokens
Generalisation	Poor beyond training length	Better
Examples	BERT, GPT-2	T5, RoPE, ALiBi

2. How Sine Waves Work — Frequency, Amplitude & Phase

Before diving into sine/cosine embeddings, it helps to build intuition about sine waves themselves.

A sine wave is defined as:

y(t) = A \cdot sin(2π \cdot f \cdot t + \phi)

Parameter	Symbol	Effect
Amplitude	$A$	Controls the height (scale) of the wave
Frequency	$f$	Controls how many cycles per unit — higher frequency = tighter wave
Phase	$\phi$	Shifts the wave left or right along the axis
Position	$t$	The input — in our case, the token position in the sequence

Key Intuitions

Low frequency sine waves vary slowly — they capture coarse, long-range position information.
High frequency sine waves vary quickly — they capture fine-grained, local position information.
Unique Mapping: By combining many sine waves at different frequencies, you can uniquely represent any position in a sequence.

Sine and Cosine: Phase Shifts and Orthogonality

Cosine is not a separate entity; it is mathematically a sine wave shifted by $90^\circ$ ( $\frac{\pi}{2}$ radians). This relationship is defined by the identity:

cos(t) = sin\left(t + \frac{\pi}{2}\right)

Why the Offset Matters

They are essentially the same wave but offset in phase. This creates a distinct progression as they move through a cycle:

Function	$t=0$	$t=\frac{\pi}{2}$	$t=\pi$	$t=\frac{3\pi}{2}$	$t=2\pi$
$\sin(t)$	0	1	0	-1	0
$\cos(t)$	1	0	-1	0	1

The Power of the Pair

The original Transformer uses both $\sin$ and $\cos$ at every frequency because together they provide two orthogonal views of the same position. This ensures that:

Uniqueness: No two positions produce the same total embedding vector. If you used only sine, positions $0$ and $\pi$ would both result in $0$ , making them indistinguishable to the model. The cosine component breaks this symmetry.
Relative Positioning: For any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ . This allows the model to easily attend to relative positions.

Sine captures the current state in the cycle, while Cosine captures the "momentum" or direction. Together, they uniquely pin down any point on a unit circle.

🎛️ Interactive: Sine Wave Explorer

Sine

Cosine

Amplitude (A): 1.0

Frequency (f): 1.0

Phase (φ): 0.00π

Show Cosine Shift

3. Sine & Cosine Positional Embeddings

The original "Attention Is All You Need" paper (Vaswani et al., 2017) proposed fixed positional embeddings using interleaved sine and cosine functions across embedding dimensions:

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

$pos$ : The position of the token in the sequence ( $0, 1, 2, \dots$ ).
$i$ : The dimension index, ranging from $0$ to $\frac{d_{model}}{2} - 1$ .
$d_{model}$ : The total embedding dimension (e.g., 512).

Why This Works

Multiscale Encoding: Each dimension $i$ uses a different wavelength. Since the wavelength increases geometrically, the first dimensions encode high-frequency "fine-grained" position info, while later dimensions encode low-frequency "global" info.
Relative Positioning: The authors chose this specific geometric progression because, for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ . This allows the model to easily learn to attend by relative positions.
Uniqueness: By using both sine and cosine (orthogonal functions) for each frequency, the model ensures that every position vector in a sequence is unique and that the "direction" of the position is preserved.

Wavelength Range

The wavelength $\lambda_i$ is derived from the period of the sine/cosine functions: $\lambda_i = 2\pi \cdot 10000^{\frac{2i}{d_{model}}}$ .

Dimension Index ( $i$ )	Wavelength ( $\lambda_i$ )	Scale
$i = 0$ (First pair)	$2\pi \approx 6.28$	Very short (High freq)
$i = \frac{d_{model}}{2} - 1$ (Last pair)	$\approx 2\pi \cdot 10000 \approx 62,832$	Very long (Low freq)

Implementation Tip: Log-Space Computation

Calculating the denominator $10000^{\frac{2i}{d_{model}}}$ directly can cause numerical overflow. In practice, we compute it in log-space:

$\text{div\_term} = \exp\left(2i \cdot -\frac{\ln(10000)}{d_{model}}\right)$

Sine/Cosine Embedding Visualiser

Dimensions (d_model): 64Sequence Length (pos): 50

0Position (pos)49

0Dimension (i)63

High Freq (Fine)Low Freq (Coarse)

-1

4. Rotation Matrices

Rotary embeddings leverage the geometric properties of 2D rotation matrices. A rotation matrix $R(\theta)$ rotates a 2D vector by an angle $\theta$ around the origin.

R(\theta) = \begin{pmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{pmatrix}

Vector Transformation

When we apply $R(\theta)$ to a 2D column vector $\mathbf{x} = [x_1, x_2]^T$ , the result is:

R(\theta) \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} x_1\cos\theta - x_2\sin\theta \\ x_1\sin\theta + x_2\cos\theta \end{pmatrix}

Key Properties

Length-preserving (Isometry): $\|R(\theta)\mathbf{x}\| = \|\mathbf{x}\|$ . Rotating a vector never changes its magnitude, only its direction. This ensures that positional encoding doesn't "explode" the values of the hidden states.
Composable: $R(\theta_1) \cdot R(\theta_2) = R(\theta_1 + \theta_2)$ . Rotating by $\theta_1$ and then by $\theta_2$ is mathematically equivalent to a single rotation by $(\theta_1 + \theta_2)$ .
Invertible: $R(\theta)^{-1} = R(-\theta) = R(\theta)^T$ . To "undo" a rotation, you simply rotate in the opposite direction.

Why this is the 'Secret Sauce' for RoPE

The composability property is exactly what allows Transformers to capture relative positions.

If token $A$ is at position $m$ and token $B$ is at position $n$ , their dot product (attention score) will only depend on the relative distance $(m-n)$ . This is because:

$(R_m \mathbf{x})^T (R_n \mathbf{y}) = \mathbf{x}^T R_m^T R_n \mathbf{y} = \mathbf{x}^T R_{m-n} \mathbf{y}$

This makes the model naturally translation-invariant!

🎛️ Interactive: Rotation Matrix Explorer

Original vRotated v'

Rotation Angle (θ): 45°

X: 1.0Y: 0.0

Matrix Multiplication

0.710.71

-0.710.71

1.000.00

0.710.71

Original vRotated v'

5. Rotary Positional Embeddings (RoPE)

RoPE (Su et al., 2021) is the backbone of modern LLMs like LLaMA, Mistral, and GPT-NeoX. Instead of adding a positional vector to the token embedding, RoPE encodes position by rotating the Query and Key vectors in the attention mechanism.

Core Idea

For a token at position $m$ , we rotate its query vector $\mathbf{q}$ by an angle proportional to $m$ . When we compute the dot product between a query at $m$ and a key at $n$ , the rotation "interlocks" to reveal the relative distance.

\tilde{\mathbf{q}}_m = R_m \mathbf{q}_m, \quad \tilde{\mathbf{k}}_n = R_n \mathbf{k}_n

The attention score (dot product) then becomes:

\langle \tilde{\mathbf{q}}_m, \tilde{\mathbf{k}}_n \rangle = \mathbf{q}_m^T R_m^T R_n \mathbf{k}_n = \mathbf{q}_m^T R_{n-m} \mathbf{k}_n

The Key Insight

Because the dot product depends only on $R_{n-m}$ , the attention score is a function of the relative distance between tokens. This makes the model naturally translation-invariant without needing the complex lookup tables used in older relative bias methods.

How It's Applied in Practice

The embedding dimension $d$ is split into $d/2$ pairs. Each pair of coordinates $[x_{2i}, x_{2i+1}]$ is treated as a point in a 2D plane and rotated by its own frequency $\theta_i$ :

$\theta_i = 10000^{-\frac{2i}{d}}$

This follows the same geometric progression as the original Transformer, but uses these values as rotation speeds rather than additive constants.

Comparison: Sine/Cosine vs. RoPE

Feature	Sine/Cosine (Absolute)	RoPE (Rotary)
Application	Added to the input embedding	Multiplied (rotated) into $Q$ and $K$
Encoding Type	Absolute Position	Relative Position (via dot product)
Attention	Depends on $m$ and $n$ independently	Depends only on distance $m - n$
Generalization	Poor for longer sequences	Excellent (supports context scaling)
Common Use	Vanilla Transformer, BERT	LLaMA, Mistral, PaLM, Phi

Step-by-Step: Applying RoPE

Input: Take a query vector $\mathbf{q} \in \mathbb{R}^d$ for a token at position $m$ .
Pairing: Split the vector into $d/2$ consecutive pairs: $(q_0, q_1), (q_2, q_3), \dots, (q_{d-2}, q_{d-1})$ .
Rotation: For each pair $i$ , calculate the rotation angle $\phi = m \cdot \theta_i$ .
Transformation: Apply the 2D rotation:
$\begin{aligned} \tilde{q}_{2i} &= q_{2i} \cos(\phi) - q_{2i+1} \sin(\phi) \\ \tilde{q}_{2i+1} &= q_{2i} \sin(\phi) + q_{2i+1} \cos(\phi) \end{aligned}$
Attention: Repeat for the key vector $\mathbf{k}$ at position $n$ . The resulting dot product $\tilde{\mathbf{q}}_m \cdot \tilde{\mathbf{k}}_n$ now inherently encodes the relative position.

🎛️ Interactive: RoPE Playground

Positions

Query Position (m): 5Key Position (n): 2

Dimension Space

Dimension Pair (i): 0Higher 'i' = lower frequency rotation.

Base Frequency (θᵢ)1.0000 rad

Relative Distance (m - n)3

Attention Score (q · k)-0.990

Notice how moving both sliders by the same amount keeps the Attention Score exactly the same. RoPE inherently preserves relative distance!

Positional Embeddings in Transformers

Positions

Dimension Space

On this page