My App
Deep Neural Networks

Understanding Convolutional Neural Networks

A deep dive into the architecture and mathematics of CNNs — the backbone of modern computer vision.

Convolution Operation

At the core of every CNN lies the convolution operation — a surprisingly simple mechanism with profound representational power.

A small matrix called a kernel (or filter) slides across the input image. At each spatial position, it computes:

  1. Element-wise multiplication between the kernel and the overlapping image patch
  2. Summation of those products into a single scalar

The result is a feature map — a spatial representation that highlights specific patterns such as edges, corners, or textures.

Input (5x5)

Feature Map (3x3)

Kernel: [0, 0]|Step 1 of 9

Visualization of a kernel sliding over an input matrix to produce a feature map.


Mathematical Formulation

For a 2D input image IRH×WI \in \mathbb{R}^{H \times W} and a kernel KRkh×kwK \in \mathbb{R}^{k_h \times k_w}, the convolution operation producing an output feature map SS is defined as:

S(i,j)=m=0kh1n=0kw1I(i+m,j+n)K(m,n)S(i,j)=\sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} I(i+m,j+n)\cdot K(m,n)

Key Details

  • (i,j)(i, j): spatial location in the output feature map
  • (kh,kw)( k_h, k_w ): height and width of the kernel
  • The kernel is applied to a local receptive field of the input

With Bias Term

In practice, a learnable bias is added:

S(i,j)=m=0kh1n=0kw1I(i+m,j+n)K(m,n)+bS(i,j) = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} I(i+m, j+n)\cdot K(m,n) + b

Multi-Channel Convolution — RGB Tensors

Grayscale images are 2D matrices. Real-world images are 3D tensors — width × height × channels (e.g., R, G, B).

In multi-channel convolution:

  • Each channel is convolved independently with a corresponding kernel slice
  • The per-channel results are summed to produce a single output value
  • The full output is another feature map at that spatial location

This extends naturally: a layer with K kernels produces a depth-K output tensor, where each slice encodes a distinct learned feature.

RGB Input Tensor

Pooled Feature Map

Step 1:Dot product (3 channels)
Index: 0

3D tensor representation of RGB convolution across three input channels.


Mathematical Formulation

For inputs with ( C ) channels:

S(i,j)=c=0C1m=0kh1n=0kw1Ic(i+m,j+n)Kc(m,n)S(i,j)=\sum_{c=0}^{C-1}\sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} I_c(i+m,j+n)\cdot K_c(m,n)
  • Each channel has its own kernel slice
  • Results are summed across channels

Output Size Formula

The spatial dimensions of the output feature map are determined by:

Hout=H+2PkhS+1,Wout=W+2PkwS+1H_{out}=\left\lfloor\frac{H+2P-k_h}{S}\right\rfloor+1,\quad W_{out}=\left\lfloor\frac{W+2P-k_w}{S}\right\rfloor+1

Where:

  • PP: padding
  • SS: stride

Key Insight

The kernel's weights are learned, not hand-crafted. During training, the network discovers which patterns are most discriminative for the task.


Multiple Kernels — Learning a Filter Bank

A single kernel detects a single pattern. In practice, a convolutional layer applies multiple kernels in parallel, each learning a different feature detector.

3-Channel Input

Filter B

Filter A

Depth-2 Feature Map

Feature A (Edge Detection)Input * Kernel_A
Feature B (Color Blobs)Input * Kernel_B

A bank of learned kernels applied simultaneously, each producing its own feature map.

Common learned kernels include:

  • Horizontal / vertical edge detectors — early layers
  • Texture and frequency patterns — mid layers
  • Semantic part detectors (eyes, wheels, text) — deep layers

CNN Architecture Pipeline

A standard CNN is composed of three primary stages, stacked in sequence.

1. Convolution and Activation

f(x)=max(0,x)f(x) = \max(0, x)
  • Extracts local spatial patterns
  • ReLU introduces non-linearity and avoids vanishing gradients

2. Pooling (Spatial Downsampling)

  • Reduces spatial resolution
  • Improves translation invariance
  • Keeps strongest activations

3. Fully Connected Layers

Softmax(zi)=ezijezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

End-to-End Flow


Hierarchical Feature Learning

One of the most important properties of deep CNNs is hierarchical representation:

DepthWhat Is Learned
Early layersEdges, corners, colour blobs
Mid layersTextures, shapes, object parts
Deep layersSemantic concepts (faces, wheels, text)

This hierarchy emerges automatically from data — not from explicit programming. CNNs learn what to detect, not merely where to detect it.


Applications

Convolutional architectures underpin virtually every state-of-the-art system in visual perception:

  • Image classification — ResNet, EfficientNet, ConvNeXt
  • Object detection — YOLO, DETR, Faster R-CNN
  • Semantic segmentation — DeepLab, SegFormer
  • Video understanding — 3D ConvNets, SlowFast
  • Medical imaging — tumour detection, radiology report generation

Playgroud

Output: ((5 + 2*1 - 3) / 1) + 1 = 5

Input (7x7)

Output (5x5)

Pos: [0, 0]

Step: 1 of 25

Summary

Convolutional Neural Networks achieve their power through three core principles:

  1. Local connectivity — each neuron sees only a small spatial region
  2. Weight sharing — the same kernel is applied everywhere, drastically reducing parameters
  3. Hierarchical composition — simple features compose into complex representations across depth

The Foundational Insight

Convolution is not merely a mathematical operation. It is a mechanism for learning spatial structure efficiently at scale — the foundational insight that made deep learning practical for vision.

On this page