Understanding Convolutional Neural Networks

A deep dive into the architecture and mathematics of CNNs — the backbone of modern computer vision.

Convolution Operation

At the core of every CNN lies the convolution operation — a surprisingly simple mechanism with profound representational power.

A small matrix called a kernel (or filter) slides across the input image. At each spatial position, it computes:

Element-wise multiplication between the kernel and the overlapping image patch
Summation of those products into a single scalar

The result is a feature map — a spatial representation that highlights specific patterns such as edges, corners, or textures.

Input (5x5)

→

Feature Map (3x3)

Kernel: [0, 0]|Step 1 of 9

Visualization of a kernel sliding over an input matrix to produce a feature map.

Mathematical Formulation

For a 2D input image $I \in \mathbb{R}^{H \times W}$ and a kernel $K \in \mathbb{R}^{k_h \times k_w}$ , the convolution operation producing an output feature map $S$ is defined as:

S(i,j)=\sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} I(i+m,j+n)\cdot K(m,n)

Key Details

$(i, j)$ : spatial location in the output feature map
$( k_h, k_w )$ : height and width of the kernel
The kernel is applied to a local receptive field of the input

With Bias Term

In practice, a learnable bias is added:

S(i,j) = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} I(i+m, j+n)\cdot K(m,n) + b

Multi-Channel Convolution — RGB Tensors

Grayscale images are 2D matrices. Real-world images are 3D tensors — width × height × channels (e.g., R, G, B).

In multi-channel convolution:

Each channel is convolved independently with a corresponding kernel slice
The per-channel results are summed to produce a single output value
The full output is another feature map at that spatial location

This extends naturally: a layer with K kernels produces a depth-K output tensor, where each slice encodes a distinct learned feature.

RGB Input Tensor

→

Pooled Feature Map

Step 1:Dot product (3 channels)

Index: 0

3D tensor representation of RGB convolution across three input channels.

Mathematical Formulation

For inputs with ( C ) channels:

S(i,j)=\sum_{c=0}^{C-1}\sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} I_c(i+m,j+n)\cdot K_c(m,n)

Each channel has its own kernel slice
Results are summed across channels

Output Size Formula

The spatial dimensions of the output feature map are determined by:

H_{out}=\left\lfloor\frac{H+2P-k_h}{S}\right\rfloor+1,\quad W_{out}=\left\lfloor\frac{W+2P-k_w}{S}\right\rfloor+1

Where:

$P$ : padding
$S$ : stride

Key Insight

The kernel's weights are learned, not hand-crafted. During training, the network discovers which patterns are most discriminative for the task.

Multiple Kernels — Learning a Filter Bank

A single kernel detects a single pattern. In practice, a convolutional layer applies multiple kernels in parallel, each learning a different feature detector.

3-Channel Input

→

Filter B

Filter A

Depth-2 Feature Map

Feature A (Edge Detection)Input * Kernel_A

Feature B (Color Blobs)Input * Kernel_B

A bank of learned kernels applied simultaneously, each producing its own feature map.

Common learned kernels include:

Horizontal / vertical edge detectors — early layers
Texture and frequency patterns — mid layers
Semantic part detectors (eyes, wheels, text) — deep layers

CNN Architecture Pipeline

A standard CNN is composed of three primary stages, stacked in sequence.

1. Convolution and Activation

f(x) = \max(0, x)

Extracts local spatial patterns
ReLU introduces non-linearity and avoids vanishing gradients

2. Pooling (Spatial Downsampling)

Reduces spatial resolution
Improves translation invariance
Keeps strongest activations

3. Fully Connected Layers

\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

End-to-End Flow

Hierarchical Feature Learning

One of the most important properties of deep CNNs is hierarchical representation:

Depth	What Is Learned
Early layers	Edges, corners, colour blobs
Mid layers	Textures, shapes, object parts
Deep layers	Semantic concepts (faces, wheels, text)

This hierarchy emerges automatically from data — not from explicit programming. CNNs learn what to detect, not merely where to detect it.

Applications

Convolutional architectures underpin virtually every state-of-the-art system in visual perception:

Image classification — ResNet, EfficientNet, ConvNeXt
Object detection — YOLO, DETR, Faster R-CNN
Semantic segmentation — DeepLab, SegFormer
Video understanding — 3D ConvNets, SlowFast
Medical imaging — tumour detection, radiology report generation

Playgroud

Image: 5

Kernel: 3

Padding: 1

Stride: 1

Output: ((5 + 2*1 - 3) / 1) + 1 = 5

Input (7x7)

→

Output (5x5)

Pos: [0, 0]

Step: 1 of 25

Summary

Convolutional Neural Networks achieve their power through three core principles:

Local connectivity — each neuron sees only a small spatial region
Weight sharing — the same kernel is applied everywhere, drastically reducing parameters
Hierarchical composition — simple features compose into complex representations across depth

The Foundational Insight

Convolution is not merely a mathematical operation. It is a mechanism for learning spatial structure efficiently at scale — the foundational insight that made deep learning practical for vision.

Understanding Convolutional Neural Networks

On this page