Recurrent Neural Networks (RNNs)
Understanding how neural networks process sequential data and maintain memory over time.
While Convolutional Neural Networks (CNNs) are highly effective for spatial data such as images, they are not well-suited for sequential data, where the order of inputs carries essential meaning. Tasks such as language modeling, time-series forecasting, and speech recognition require models that can retain and utilize past information.
Recurrent Neural Networks (RNNs) address this limitation by introducing the concept of temporal memory. They process data sequentially, maintaining an internal state—known as the hidden state—that captures information from previous time steps.
1. Temporal Structure of RNNs
An RNN can be understood more clearly when visualized as an unrolled computational graph across time. Rather than a loop, it becomes a sequence of identical processing units (cells), each corresponding to a time step.
Unrolled RNN Architecture
Visualizing hidden state propagation
At each time step ( t ), the RNN cell takes two inputs:
- The current input vector ( )
- The hidden state from the previous step ( )
This structure enables the network to propagate information forward through time, effectively forming a chain of dependencies.
2. Mathematical Formulation
Hidden State Update
The hidden state serves as the memory of the network and is updated at each time step using the following equation:
| Symbol | Description |
|---|---|
| Current hidden state | |
| Previous hidden state | |
| Input at time step | |
| Weight matrices | |
| Bias vector |
The use of a non-linear activation function such as tanh allows the network to model complex, non-linear relationships in sequential data.
Output Computation
When an output is required at a given time step, it is computed from the hidden state:
This formulation allows the RNN to produce outputs either at every time step (e.g., sequence labeling) or only at the final step (e.g., sequence classification).
3. Parameter Sharing Across Time
Inside the RNN Cell
Vector computation & hidden state propagation
A defining characteristic of RNNs is that their parameters are shared across all time steps. Unlike feed-forward networks, where each layer has distinct weights, RNNs reuse the same matrices throughout the sequence.
This design provides two key advantages:
-
Scalability to Variable-Length Sequences: The model can process sequences of arbitrary length without modifying its architecture.
-
Improved Generalization: Patterns learned at one position in a sequence can be applied universally. For example, if the model learns that the word "not" negates meaning, this rule applies regardless of its position.
4. Limitations: The Vanishing Gradient Problem
LSTM Gated Architecture
Memory Cells & Hidden Projections
Despite their conceptual elegance, standard RNNs face a significant challenge during training: the vanishing gradient problem.
In Backpropagation Through Time (BPTT), gradients are propagated backward through many time steps. Due to repeated multiplication, these gradients can shrink exponentially, effectively approaching zero.
Consequences:
- The model struggles to learn long-range dependencies.
- Early inputs in a sequence have minimal influence on later predictions.
- The network exhibits a strong bias toward recent inputs (short-term memory).
5. Advanced Architectures
To overcome these limitations, more sophisticated recurrent architectures have been developed:
-
Long Short-Term Memory (LSTM): Introduces gating mechanisms to regulate information flow and preserve long-term dependencies.
-
Gated Recurrent Unit (GRU): A simplified variant of LSTM with fewer parameters while retaining similar performance.
These models explicitly control what information to retain, update, or discard, significantly improving the ability to model long sequences.
Conclusion
Recurrent Neural Networks represent a foundational approach for modeling sequential data by incorporating memory into neural computation. While standard RNNs are limited by training difficulties, their core idea has inspired more powerful architectures such as LSTMs and GRUs, which remain essential in modern deep learning systems.