How CNN’s Work- In Depth 🤓

Sheryl Li
6 min readOct 15, 2021

Looking around, we can easily and clearly identify objects without thinking about it. For computers though, they need to use image processing algorithms to do what we do effortlessly.

Computer vision has been booming- enabling computers to see and process images the way we do. It’s actually already around us- self driving cars 🚗, medical imaging, and facial recognition. Through numbers, computers are sometimes even better than us humans. 🤯

Computers have a ton of data to deal with- who wants to go through all of those one by one? 🙄

That’s why computers use machine learning algorithms to do the work.

A particular algorithm is a convolutional neural network (CNN)

A convolutional network is a type of artificial neural network used in image recognition and processing ⚙️- it can take in an input and assign weights + biases to it. In the end, it is able to differentiate one from the other.

How do CNN’s work? 🤨

Basically, they classify objects by taking images and learn patterns that make them up 🔍. Because of their architecture, they can learn and train from data on their own. It’s a process- CNN’s have a set of layers that perform special functions. 👀

Breaking up pieces of the image by pixels, and turning them into numbers as inputs are the first step to setting the image up for classification. CNN’s match pieces of the image 🧩, even if they’re shifted around and manipulated, the overall image can be matched as long as the pieces of it match.

source, with added modifications (the lines)

Those tiny parts of images are called features. The different features match with parts of the overall image. By applying features for every part of the images, we end up creating a filter 👻


Basically, you take a small part of the image, and compare it to each numerical value in each pixel. (a 1 represents perfect match) ✅

By just taking the average of the new values, you get an overall score for the match. (If its not a perfect match, the value will be less than 1)

When you do this for every location in the image, you get a whole new image with values indicating the chance of where the feature appears in the original image.

Then, it takes the features and applies it across every possible patch in the image. Now, we have a map 🗺️ of where the feature matches the image. From this, you get a filtered version of the original image-

This is what a convolution layer does- an image breaks down into features ⚙️, which are applied to the whole image, and you end up with a stack of filtered images.

Now let’s shrink it. 👩🏻‍🦯 (This part is called pooling)

By reducing the size ⬇️, it reduces ⬇️ the number of pixels in the output from the previous convolutional layer. It reducing computational load, and can help reduce overfitting.

Max pooling extracts particular features- during the process you pick the most activated pixels and keep those moving on, so you get all the benefits while keep important features while shrinking its size. (This is big brain 🧠)

How does pooling even work? 👀

  1. pick a window size (usually 2 or 3) 🪟
  2. pick a stride (usually 2) 🏃🏻
  3. walk your window across your filtered images 🚶🏻‍♀️
  4. from each window, take the maximum value

With a window of 2x2 pixels and a stride of 2 pixels, walk 🚶🏻‍♀️ across the filtered image: ➡️


From this, you get a smaller image of the filtered image. So in the pooling layer, a stack fo images becomes a stack of smaller images 📑

A rectifier function increases the non-linearity of the images. When you see any image, it has a lot of non-linear features (transitions between pixels, borders, colors, etc).

In order to train deep neural networks, an activation function is needed. Activation functions give neural networks the power of mapping nonlinear functions. Doing this helps the model to capture patterns 🔍- by squishing the output into a range of values which represents the weight of a particular node.

Rectified Linear Units (ReLU’s) is a type of activation function and takes any value that is negative and changes it into zero. It’s easy to train, and performs well 🤩

Why are ReLu’s so successful?

  • Its a non-saturating activation function- It doesn’t squeeze their inputs from the linear function; It just gets the max(0,x).
  • Because it automatically assigns 0 for all negative values, its also acts as a feature selector(reducing ⬇️ the number of input variables)

For CNN’s, you can stack these layers (filtering, convolution, pooling) on top of each other- the output of one layer is the input of the next ! (since images are represented as arrays of numbers 🔢). Stacking multiple layers multiple times is called deep stacking.

In the end, add a final output layer- list of all possible answers that we expect to get out of the classifier

Let’s take a look at an example:

-propagate to the input layer

-propagate to the 1st input layer

  • the neuron on the very top combines 2 input neurons, one light and one dark so it sums to a +1 and -1, getting a sum of zero (which is why its grey)
  • bottom two- it’s sums an input that is negative and one that is positive but its connected to one by a negative weight and the other by a positive weight. It’s weighted sum is -1 and -1, so it’s getting the opposite of its respective field

-move to 2nd layer

  • anything 0+0 = 0
  • negative and negative inputs connected by positive weights is still going to be negative

-3rd layer:

  • bottom-because its a negative value, the ReLU becomes 0, so its a grey node in the final output layer
  • very bottom: it’s connected with a negative weight from the previous layer. so it becomes positive, which gives it a maximum value- meaning the only output that is nonzero is “horizontal”

TLDR; how do neural networks work?

-each pixel value of the shrunken images 🏙️ ➡️ 🏠 (city→ house; original image → shrunken image)

-take input pixels into a list of numbers (input vector) 🔢

-now, build a neuron- take all the inputs and add them up

-then add a weight from +-1 to each input. After multiplying each input pixel by its corresponding weight and adding those values up, you get a weighted sum. ✖️➕

-activation function (such as ReLU)- this is going to be done a lot of times, so keeping the answer from -1 to 1 keeps it from growing numerically large

-multiple of those neurons (weighted sum + value from activation function) can be made with different weights, and this collection of neurons makes up a ‘layer’

-Finally, connect the neurons again to a final output layer- list of all possible answers that we expect to get out of the classifier 🎊


How do you even get the weights? 🤨 and….

Where did the filters come from? 🤔

Stay tuned for an article about optimization! 😉


There are several types of activation functions, learn more about those here