My App
Deep Neural Networks

Positional Embeddings in Transformers

A deep dive into absolute, relative, sine/cosine, and rotational positional embeddings — with interactive playgrounds.

Positional Embeddings in Transformers

Transformers process tokens in parallel — they have no inherent sense of order. Positional embeddings inject information about where each token sits in a sequence, allowing the model to reason about order and distance.

1. Types of Positional Embedding

There are two broad families of positional embeddings:

Absolute Positional Embeddings

Each position i in the sequence is assigned a fixed or learned vector. The embedding is added directly to the token embedding before it enters the transformer.

  • The model sees position 0, 1, 2, ... n as independent, absolute coordinates.
  • Learned absolute embeddings (used in BERT, GPT) — the position vectors are parameters trained end-to-end.
  • Fixed absolute embeddings (used in the original Transformer) — the position vectors are computed using sine and cosine functions (see §3).

Limitation: They don't generalise well to sequence lengths longer than those seen during training.

Relative Positional Embeddings

Instead of encoding the absolute position of each token, these encode the relative distance between tokens — e.g., "token A is 3 positions before token B."

  • Used in T5, Shaw et al. (2018), ALiBi, RoPE (see §5).
  • The attention score between two tokens is modified by a bias/function of their distance i - j.
  • Generalises better to unseen sequence lengths.
AbsoluteRelative
EncodesPosition indexDistance between tokens
GeneralisationPoor beyond training lengthBetter
ExamplesBERT, GPT-2T5, RoPE, ALiBi

2. How Sine Waves Work — Frequency, Amplitude & Phase

Before diving into sine/cosine embeddings, it helps to build intuition about sine waves themselves.

A sine wave is defined as:

y(t)=Asin(2πft+ϕ)y(t) = A \cdot sin(2π \cdot f \cdot t + \phi)
ParameterSymbolEffect
AmplitudeAAControls the height (scale) of the wave
FrequencyffControls how many cycles per unit — higher frequency = tighter wave
Phaseϕ\phiShifts the wave left or right along the axis
PositionttThe input — in our case, the token position in the sequence

Key Intuitions

  • Low frequency sine waves vary slowly — they capture coarse, long-range position information.
  • High frequency sine waves vary quickly — they capture fine-grained, local position information.
  • Unique Mapping: By combining many sine waves at different frequencies, you can uniquely represent any position in a sequence.

Sine and Cosine: Phase Shifts and Orthogonality

Cosine is not a separate entity; it is mathematically a sine wave shifted by 9090^\circ (π2\frac{\pi}{2} radians). This relationship is defined by the identity:

cos(t)=sin(t+π2)cos(t) = sin\left(t + \frac{\pi}{2}\right)

Why the Offset Matters

They are essentially the same wave but offset in phase. This creates a distinct progression as they move through a cycle:

Functiont=0t=0t=π2t=\frac{\pi}{2}t=πt=\pit=3π2t=\frac{3\pi}{2}t=2πt=2\pi
sin(t)\sin(t)010-10
cos(t)\cos(t)10-101

The Power of the Pair

The original Transformer uses both sin\sin and cos\cos at every frequency because together they provide two orthogonal views of the same position. This ensures that:

  1. Uniqueness: No two positions produce the same total embedding vector. If you used only sine, positions 00 and π\pi would both result in 00, making them indistinguishable to the model. The cosine component breaks this symmetry.
  2. Relative Positioning: For any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}. This allows the model to easily attend to relative positions.

Sine captures the current state in the cycle, while Cosine captures the "momentum" or direction. Together, they uniquely pin down any point on a unit circle.

🎛️ Interactive: Sine Wave Explorer

+1-1t
Sine
Cosine

3. Sine & Cosine Positional Embeddings

The original "Attention Is All You Need" paper (Vaswani et al., 2017) proposed fixed positional embeddings using interleaved sine and cosine functions across embedding dimensions:

PE(pos,2i)=sin(pos100002idmodel)PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right) PE(pos,2i+1)=cos(pos100002idmodel)PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
  • pospos: The position of the token in the sequence (0,1,2,0, 1, 2, \dots).
  • ii: The dimension index, ranging from 00 to dmodel21\frac{d_{model}}{2} - 1.
  • dmodeld_{model}: The total embedding dimension (e.g., 512).

Why This Works

  • Multiscale Encoding: Each dimension ii uses a different wavelength. Since the wavelength increases geometrically, the first dimensions encode high-frequency "fine-grained" position info, while later dimensions encode low-frequency "global" info.
  • Relative Positioning: The authors chose this specific geometric progression because, for any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}. This allows the model to easily learn to attend by relative positions.
  • Uniqueness: By using both sine and cosine (orthogonal functions) for each frequency, the model ensures that every position vector in a sequence is unique and that the "direction" of the position is preserved.

Wavelength Range

The wavelength λi\lambda_i is derived from the period of the sine/cosine functions: λi=2π100002idmodel\lambda_i = 2\pi \cdot 10000^{\frac{2i}{d_{model}}}.

Dimension Index (ii)Wavelength (λi\lambda_i)Scale
i=0i = 0 (First pair)2π6.282\pi \approx 6.28Very short (High freq)
i=dmodel21i = \frac{d_{model}}{2} - 1 (Last pair)2π1000062,832\approx 2\pi \cdot 10000 \approx 62,832Very long (Low freq)

Implementation Tip: Log-Space Computation

Calculating the denominator 100002idmodel10000^{\frac{2i}{d_{model}}} directly can cause numerical overflow. In practice, we compute it in log-space:

div_term=exp(2iln(10000)dmodel)\text{div\_term} = \exp\left(2i \cdot -\frac{\ln(10000)}{d_{model}}\right)


Sine/Cosine Embedding Visualiser

0Position (pos)49
0Dimension (i)63
High Freq (Fine)Low Freq (Coarse)
-1
+1

4. Rotation Matrices

Rotary embeddings leverage the geometric properties of 2D rotation matrices. A rotation matrix R(θ)R(\theta) rotates a 2D vector by an angle θ\theta around the origin.

R(θ)=(cosθsinθsinθcosθ)R(\theta) = \begin{pmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{pmatrix}

Vector Transformation

When we apply R(θ)R(\theta) to a 2D column vector x=[x1,x2]T\mathbf{x} = [x_1, x_2]^T, the result is:

R(θ)(x1x2)=(x1cosθx2sinθx1sinθ+x2cosθ)R(\theta) \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} x_1\cos\theta - x_2\sin\theta \\ x_1\sin\theta + x_2\cos\theta \end{pmatrix}

Key Properties

  • Length-preserving (Isometry): R(θ)x=x\|R(\theta)\mathbf{x}\| = \|\mathbf{x}\|. Rotating a vector never changes its magnitude, only its direction. This ensures that positional encoding doesn't "explode" the values of the hidden states.
  • Composable: R(θ1)R(θ2)=R(θ1+θ2)R(\theta_1) \cdot R(\theta_2) = R(\theta_1 + \theta_2). Rotating by θ1\theta_1 and then by θ2\theta_2 is mathematically equivalent to a single rotation by (θ1+θ2)(\theta_1 + \theta_2).
  • Invertible: R(θ)1=R(θ)=R(θ)TR(\theta)^{-1} = R(-\theta) = R(\theta)^T. To "undo" a rotation, you simply rotate in the opposite direction.

Why this is the 'Secret Sauce' for RoPE

The composability property is exactly what allows Transformers to capture relative positions.

If token AA is at position mm and token BB is at position nn, their dot product (attention score) will only depend on the relative distance (mn)(m-n). This is because:

(Rmx)T(Rny)=xTRmTRny=xTRmny(R_m \mathbf{x})^T (R_n \mathbf{y}) = \mathbf{x}^T R_m^T R_n \mathbf{y} = \mathbf{x}^T R_{m-n} \mathbf{y}

This makes the model naturally translation-invariant!


🎛️ Interactive: Rotation Matrix Explorer

Original vRotated v'
Matrix Multiplication
0.710.71
-0.710.71
×
1.000.00
=
0.710.71
Original vRotated v'

5. Rotary Positional Embeddings (RoPE)

RoPE (Su et al., 2021) is the backbone of modern LLMs like LLaMA, Mistral, and GPT-NeoX. Instead of adding a positional vector to the token embedding, RoPE encodes position by rotating the Query and Key vectors in the attention mechanism.

Core Idea

For a token at position mm, we rotate its query vector q\mathbf{q} by an angle proportional to mm. When we compute the dot product between a query at mm and a key at nn, the rotation "interlocks" to reveal the relative distance.

q~m=Rmqm,k~n=Rnkn\tilde{\mathbf{q}}_m = R_m \mathbf{q}_m, \quad \tilde{\mathbf{k}}_n = R_n \mathbf{k}_n

The attention score (dot product) then becomes:

q~m,k~n=qmTRmTRnkn=qmTRnmkn\langle \tilde{\mathbf{q}}_m, \tilde{\mathbf{k}}_n \rangle = \mathbf{q}_m^T R_m^T R_n \mathbf{k}_n = \mathbf{q}_m^T R_{n-m} \mathbf{k}_n

The Key Insight

Because the dot product depends only on RnmR_{n-m}, the attention score is a function of the relative distance between tokens. This makes the model naturally translation-invariant without needing the complex lookup tables used in older relative bias methods.


How It's Applied in Practice

The embedding dimension dd is split into d/2d/2 pairs. Each pair of coordinates [x2i,x2i+1][x_{2i}, x_{2i+1}] is treated as a point in a 2D plane and rotated by its own frequency θi\theta_i:

θi=100002id\theta_i = 10000^{-\frac{2i}{d}}

This follows the same geometric progression as the original Transformer, but uses these values as rotation speeds rather than additive constants.

Comparison: Sine/Cosine vs. RoPE

FeatureSine/Cosine (Absolute)RoPE (Rotary)
ApplicationAdded to the input embeddingMultiplied (rotated) into QQ and KK
Encoding TypeAbsolute PositionRelative Position (via dot product)
AttentionDepends on mm and nn independentlyDepends only on distance mnm - n
GeneralizationPoor for longer sequencesExcellent (supports context scaling)
Common UseVanilla Transformer, BERTLLaMA, Mistral, PaLM, Phi

Step-by-Step: Applying RoPE

  1. Input: Take a query vector qRd\mathbf{q} \in \mathbb{R}^d for a token at position mm.

  2. Pairing: Split the vector into d/2d/2 consecutive pairs: (q0,q1),(q2,q3),,(qd2,qd1)(q_0, q_1), (q_2, q_3), \dots, (q_{d-2}, q_{d-1}).

  3. Rotation: For each pair ii, calculate the rotation angle ϕ=mθi\phi = m \cdot \theta_i.

  4. Transformation: Apply the 2D rotation:

    q~2i=q2icos(ϕ)q2i+1sin(ϕ)q~2i+1=q2isin(ϕ)+q2i+1cos(ϕ)\begin{aligned} \tilde{q}_{2i} &= q_{2i} \cos(\phi) - q_{2i+1} \sin(\phi) \\ \tilde{q}_{2i+1} &= q_{2i} \sin(\phi) + q_{2i+1} \cos(\phi) \end{aligned}
  5. Attention: Repeat for the key vector k\mathbf{k} at position nn. The resulting dot product q~mk~n\tilde{\mathbf{q}}_m \cdot \tilde{\mathbf{k}}_n now inherently encodes the relative position.


🎛️ Interactive: RoPE Playground

Positions

Dimension Space

Base Frequency (θᵢ)1.0000 rad
Relative Distance (m - n)3
Attention Score (q · k)-0.990

Notice how moving both sliders by the same amount keeps the Attention Score exactly the same. RoPE inherently preserves relative distance!

On this page