Understanding Convolutional Neural Networks
A deep dive into the architecture and mathematics of CNNs — the backbone of modern computer vision.
Convolution Operation
At the core of every CNN lies the convolution operation — a surprisingly simple mechanism with profound representational power.
A small matrix called a kernel (or filter) slides across the input image. At each spatial position, it computes:
- Element-wise multiplication between the kernel and the overlapping image patch
- Summation of those products into a single scalar
The result is a feature map — a spatial representation that highlights specific patterns such as edges, corners, or textures.
Input (5x5)
Feature Map (3x3)
Kernel: [0, 0]|Step 1 of 9
Visualization of a kernel sliding over an input matrix to produce a feature map.
Mathematical Formulation
For a 2D input image and a kernel , the convolution operation producing an output feature map is defined as:
Key Details
- : spatial location in the output feature map
- : height and width of the kernel
- The kernel is applied to a local receptive field of the input
With Bias Term
In practice, a learnable bias is added:
Multi-Channel Convolution — RGB Tensors
Grayscale images are 2D matrices. Real-world images are 3D tensors — width × height × channels (e.g., R, G, B).
In multi-channel convolution:
- Each channel is convolved independently with a corresponding kernel slice
- The per-channel results are summed to produce a single output value
- The full output is another feature map at that spatial location
This extends naturally: a layer with K kernels produces a depth-K output tensor, where each slice encodes a distinct learned feature.
RGB Input Tensor
Pooled Feature Map
3D tensor representation of RGB convolution across three input channels.
Mathematical Formulation
For inputs with ( C ) channels:
- Each channel has its own kernel slice
- Results are summed across channels
Output Size Formula
The spatial dimensions of the output feature map are determined by:
Where:
- : padding
- : stride
Key Insight
The kernel's weights are learned, not hand-crafted. During training, the network discovers which patterns are most discriminative for the task.
Multiple Kernels — Learning a Filter Bank
A single kernel detects a single pattern. In practice, a convolutional layer applies multiple kernels in parallel, each learning a different feature detector.
3-Channel Input
Filter B
Filter A
Depth-2 Feature Map
A bank of learned kernels applied simultaneously, each producing its own feature map.
Common learned kernels include:
- Horizontal / vertical edge detectors — early layers
- Texture and frequency patterns — mid layers
- Semantic part detectors (eyes, wheels, text) — deep layers
CNN Architecture Pipeline
A standard CNN is composed of three primary stages, stacked in sequence.
1. Convolution and Activation
- Extracts local spatial patterns
- ReLU introduces non-linearity and avoids vanishing gradients
2. Pooling (Spatial Downsampling)
- Reduces spatial resolution
- Improves translation invariance
- Keeps strongest activations
3. Fully Connected Layers
End-to-End Flow
Hierarchical Feature Learning
One of the most important properties of deep CNNs is hierarchical representation:
| Depth | What Is Learned |
|---|---|
| Early layers | Edges, corners, colour blobs |
| Mid layers | Textures, shapes, object parts |
| Deep layers | Semantic concepts (faces, wheels, text) |
This hierarchy emerges automatically from data — not from explicit programming. CNNs learn what to detect, not merely where to detect it.
Applications
Convolutional architectures underpin virtually every state-of-the-art system in visual perception:
- Image classification — ResNet, EfficientNet, ConvNeXt
- Object detection — YOLO, DETR, Faster R-CNN
- Semantic segmentation — DeepLab, SegFormer
- Video understanding — 3D ConvNets, SlowFast
- Medical imaging — tumour detection, radiology report generation
Playgroud
Output: ((5 + 2*1 - 3) / 1) + 1 = 5
Input (7x7)
Output (5x5)
Pos: [0, 0]
Step: 1 of 25
Summary
Convolutional Neural Networks achieve their power through three core principles:
- Local connectivity — each neuron sees only a small spatial region
- Weight sharing — the same kernel is applied everywhere, drastically reducing parameters
- Hierarchical composition — simple features compose into complex representations across depth
The Foundational Insight
Convolution is not merely a mathematical operation. It is a mechanism for learning spatial structure efficiently at scale — the foundational insight that made deep learning practical for vision.