Recurrent Neural Networks (RNNs)

Understanding how neural networks process sequential data and maintain memory over time.

While Convolutional Neural Networks (CNNs) are highly effective for spatial data such as images, they are not well-suited for sequential data, where the order of inputs carries essential meaning. Tasks such as language modeling, time-series forecasting, and speech recognition require models that can retain and utilize past information.

Recurrent Neural Networks (RNNs) address this limitation by introducing the concept of temporal memory. They process data sequentially, maintaining an internal state—known as the hidden state—that captures information from previous time steps.

1. Temporal Structure of RNNs

An RNN can be understood more clearly when visualized as an unrolled computational graph across time. Rather than a loop, it becomes a sequence of identical processing units (cells), each corresponding to a time step.

Unrolled RNN Architecture

Visualizing hidden state propagation

Spacer

↑

h₀

↑

Spacer

→

Output Y_1Softmax

↑

Cellh_1

↑

Input X_1"Deep"

▶

Output Y_2?

↑

Cellh_2

↑

Input X_2"Learning"

▶

Output Y_3?

↑

Cellh_3

↑

Input X_3"Rocks"

Current Calculation

h_1 = σ( W·h_0 + U·"Deep")

At each time step ( t ), the RNN cell takes two inputs:

The current input vector ( $x_t$ )
The hidden state from the previous step ( $h_{t-1}$ )

This structure enables the network to propagate information forward through time, effectively forming a chain of dependencies.

2. Mathematical Formulation

Hidden State Update

The hidden state serves as the memory of the network and is updated at each time step using the following equation:

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Symbol	Description
$h_t$	Current hidden state
$h_{t-1}$	Previous hidden state
$x_t$	Input at time step $t$
$W_{hh}, W_{xh}$	Weight matrices
$b_h$	Bias vector

The use of a non-linear activation function such as tanh allows the network to model complex, non-linear relationships in sequential data.

Output Computation

When an output is required at a given time step, it is computed from the hidden state:

y_t = W_{hy} h_t + b_y

This formulation allows the RNN to produce outputs either at every time step (e.g., sequence labeling) or only at the final step (e.g., sequence classification).

Inside the RNN Cell

Vector computation & hidden state propagation

h_t-1

xₜ

Wₕ

Wₓ

⊕

tanh

hₜ

Wᵧ

yₜ

Phase 1: Inputs Arrive

Gathering xₜ and hₜ₋₁

A defining characteristic of RNNs is that their parameters are shared across all time steps. Unlike feed-forward networks, where each layer has distinct weights, RNNs reuse the same matrices $W_{hh}, W_{xh}, W_{hy}$ throughout the sequence.

This design provides two key advantages:

Scalability to Variable-Length Sequences: The model can process sequences of arbitrary length without modifying its architecture.
Improved Generalization: Patterns learned at one position in a sequence can be applied universally. For example, if the model learns that the word "not" negates meaning, this rule applies regardless of its position.

4. Limitations: The Vanishing Gradient Problem

LSTM Gated Architecture

Memory Cells & Hidden Projections

Cₜ₋₁

hₜ₋₁

xₜ

tanh

⊗

⊕

tanh

⊗

Cₜ

Wᵧ

yₜ

Inputs Arrive

Gathering xₜ, hₜ₋₁, and Cₜ₋₁

Despite their conceptual elegance, standard RNNs face a significant challenge during training: the vanishing gradient problem.

In Backpropagation Through Time (BPTT), gradients are propagated backward through many time steps. Due to repeated multiplication, these gradients can shrink exponentially, effectively approaching zero.

Consequences:

The model struggles to learn long-range dependencies.
Early inputs in a sequence have minimal influence on later predictions.
The network exhibits a strong bias toward recent inputs (short-term memory).

5. Advanced Architectures

To overcome these limitations, more sophisticated recurrent architectures have been developed:

Long Short-Term Memory (LSTM): Introduces gating mechanisms to regulate information flow and preserve long-term dependencies.
Gated Recurrent Unit (GRU): A simplified variant of LSTM with fewer parameters while retaining similar performance.

These models explicitly control what information to retain, update, or discard, significantly improving the ability to model long sequences.

Conclusion

Recurrent Neural Networks represent a foundational approach for modeling sequential data by incorporating memory into neural computation. While standard RNNs are limited by training difficulties, their core idea has inspired more powerful architectures such as LSTMs and GRUs, which remain essential in modern deep learning systems.

Recurrent Neural Networks (RNNs)

Unrolled RNN Architecture

Inside the RNN Cell

LSTM Gated Architecture

On this page