Creating a Baseline Model for MNIST Dataset

Sajalsaini
9 min readJan 31, 2022

This notebook is based on Fast AI’s course. The notebook deals with creating a baseline model for a small MNIST dataset. The dataset only contains images of handwritten 7s and 3s. The notebook aims to show a clearer picture of how a baseline ML model can be created from scratch. The model that’ll be created here will be using maths and nothing more.

This is created as per the advice of blogging given in the course.

Let’s begin by importing the fastbook library.

pip install fastbookimport fastbookfrom fastai.vision.all import *
from fastai import *

Let’s now retrieve the MNIST data and have a look at it.

path = untar_data(URLs.MNIST_SAMPLE)(path/"train").ls()(#2) [Path('/root/.fastai/data/mnist_sample/train/7'),Path('/root/.fastai/data/mnist_sample/train/3')]

Here, we can see that the training folder contains two folders. One for the images of 7s and one with images of 3s.

Now, we inspect the images in the training folder. For this we assign all the images in the training folders to threes and sevens folders. We use the sorted method to get all the images in order. The images will be sorted according to their file name.

threes = (path/"train"/"3").ls().sorted()sevens = (path/"train"/"7").ls().sorted()

Taking the path of an image, and then opening the image using the “Image” class from the Python Imaginng Library.

im3_path = threes[1]
im3 = Image.open(im3_path)
im3

Output:

Output of the above code

An image on a computer is nothing but numbers. To display that, we’ll have to convert the image into an array.

array(im3)

Output:

Output of the above code

From this we can really see how numbers make up an image.

We can display the image as a combination of numbers by using PyTorch Tensors.

tensor(im3)
Output of the above code

Now you must be wondering as to why a tensor and not an array. Let’s discuss in detail the difference between both.

  • A calculation involving tensors can be done on a GPU, whereas one done involving array, will not be computed on a GPU.
  • This results in a smaller computing time for a tensor as compared to that of an array.

Now we can use Pandas to color code the tensor displayed above, so as to really see how numbers make up an image.

pip install pandasim_3t = tensor(im3)df = pd.DataFrame(im_3t)df.style.background_gradient("Greys")

Now we can clearly see here, that the numbers range from 0 to 255. The numbers that are close to 0 are more lighter in shade and those on the other side of the spectrum approach black.

Each image is 28 by 28 pixels wide (from 0 to 27). Thus, each image has 784 pixels.

Baseline Model

Basic Idea

This baseline model is based in pixel similarity. We’ll arrange all of the images of threes in a vertical stack and try to find out the average value of each pixel (from 0 to 255). When the model will try to predict a digit from the “validation set”, then the number of that pixel can be matched with the average value. So if it’ll be closer to a “Seven Value” or a “Three Value”, the same will be predicted.

What are Baseline Models and why are they important?

A baseline model is something that is created to compare the other fancy models that you’re planning to build. Following are the qualities of a baseline model that should be kept in mind while creating one:

  • It should be easy to implement.
  • It should be weasy to test, so as to test your new models.

One should think of a very easy to implement model by thinking upon the problem and reading up on solutions provided by other people.

Model Creation Step

The very first step should be to get the average values of the pixels for both the groups, that are, Sevens and Threes.

seven_tensors = [tensor(Image.open(i)) for i in sevens]
three_tensors = [tensor(Image.open(i)) for i in threes]
len(three_tensors), len(seven_tensors)

Output:

(6131, 6265)

Fast AI provides us with a function called “show_image()” to display a tensor as an image directly.

show_image(three_tensors[1]);
Image of the tensor

Now comes the part of computing the average of the values over each pixel position.

To achieve this task, we’ll stack all of the three_tensors on top of each other to create a cuboid. This stacked tensor is called a “Rank-3 Tensor”. The image below will provide a very good idea of how a rank-3 tensor looks.

A “rank” in a tensor is the number of axes or dimensions in a tensor. The “shape” is the size of the each axs of a tensor.

Image of a rank-3 tensor

As we are using PyTorch, we’ll have to convert the pixel values to float from integer. A basic rule of thumb to be followed while dealing with float values in images, is to convert the value between 0 and 1.

stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
stacked_threes.shapetorch.Size([6131, 28, 28])

From this shape we can see that all the tensors of threes are stacked on top of each other. There are 6131 images that are 28 pixels by 28 pixels wide in this rank-3 tensor.

As was stated earlier, that a rank in a tensor is the number of dimensions of that tensor. We can verify the rank of the tensor with the help of "ndim".

stacked_threes.ndim

Output:

3

Let’s now create the ideal image.

mean3 = stacked_threes.mean(0)show_image(mean3)
Image of the ideal 3
mean7 = stacked_sevens.mean(0)
show_image(mean7)
Image of the ideal 7

Now we have to measure how much difference is in between our average 3 and the 3s in the dataset.

One way to see the difference between our 3 and the dataset threes is literally taking the difference between each pixel and then adding the differences. But we can’t do that. The reason being that there will be some -ve and some +ve differences. While addition, these will cancel out.

But why the addition of differences? The main reason for seeing the distance of our 3 from those in the dataset is to see how good it represents all of the images in the dataset, so that a more accurate prediction about the newer data can be made. If we go on to see the difference between each pixel, we won’t be able to see if our average image is any good. Hence, the additions of the differences.

Now, if we can’t simply subtract the differences, then what do we do?

There are two conventional metjods used for this purpose. One is the Mean Absolute Difference (a.k.a L1 Norm) and the second is Root Mean Squared Error (RMSE) (a.k.a L2 Norm).

In L1 norm we tae the mean of the absolute value of the difference.

Whereas in L2 norm we first sqaure the differences. Then we take their mean. Then we square root that mean.

In the images below the word error can be replaced by the word difference.

Below I am adding the formulae of MAD and RMSE. Some people have a tendency to understand these things with the help of the formulae with them.

Mean Absolute Difference

Formula for Mean Absolute Error

Root Mean Squared Error

Formula for Root Mean Squared Error

Image to help understand L1 and L2 Norm

Calculating MAD and RMSE

a_3 = stacked_threes[1]
show_image(a_3);
dist_3_abs = (a_3 - mean3).abs().mean() #MAD
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt() #RMSE
dist_3_abs, dist_3_sqr

Output:

(tensor(0.1114), tensor(0.2021))

For the 7s,

dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs, dist_7_sqr

Output:

(tensor(0.1586), tensor(0.3021))

Here, we can see that for both MAD and RMSE, the values are lesser for 3 than 7. This means less error. Hence, we can confirm of the image as a 3. This is the basic working of our model. We just now have to figure out a way to do this for all of the images in the valdation set at once so as to predict images.

Also, we don't have to write the formulae for MAD and RMSE. PyTorch provides us with functions that help us calculate this.

F.l1_loss(a_3.float(), mean7), F.mse_loss(a_3, mean7).sqrt()

Output

(tensor(0.1586), tensor(0.3021))

Now, we will use the valid dataset to measure the performance of the model. Let's first create the tensors for 3s and 7s to manipulate that data.

valid_3_tensor = torch.stack([tensor(Image.open(i)) for i in (path/"valid"/"3").ls()])
valid_3_tensor = valid_3_tensor.float()/255
show_image(valid_3_tensor[1])
2nd image in the valid set
valid_7_tensor = torch.stack([tensor(Image.open(i)) for i in (path/"valid"/"7").ls()])
valid_7_tensor = valid_7_tensor.float()/255
show_image(valid_7_tensor[0]);
1st image in the valid set
valid_3_tensor.shape, valid_7_tensor.shape

Output:

(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

Creating a function called “mnist_dist” to calculate the distance between the ideal image and the arbitraty image from the validation set.

def mnist_dist(a,b):
return (a-b).abs().mean((-1,-2))

The tuple (-1,-2) represents a range of axes. In Python, -1 refers to the last element, and -2 refers to the second-to-last. So in this case, this tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor. The last two axes are the horizontal and vertical dimensions of an image. After taking the mean over the last two axes, we are left with just the first tensor axis, which indexes over our images, which is why our final size was (1010). In other words, for every image, we averaged the intensity of all the pixels in that image.

mnist_dist(a_3, mean3)

Output:

tensor(0.1114)

Distance of the images in the validation set and the ideal image.

valid_3_dist = mnist_dist(valid_3_tensor, mean3)valid_3_dist.shape

Output:

torch.Size([1010])valid_3_dist

Output:

tensor([0.1117, 0.1295, 0.1168,  ..., 0.1506, 0.1380, 0.1483])

What just happened? You might ask. So what PyTorch actually did is called broadcasting.

PyTorch will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.

After broadcasting so the two argument tensors have the same rank, PyTorch applies its usual logic for two tensors of the same rank: it performs the operation on each corresponding element of the two tensors, and returns the tensor result.

Creating a function to tell us if the image we’ve given the model is closer to a 3 or a 7.

def is_3(x): 
return (mnist_dist(x, mean3) < mnist_dist(x, mean7))
is_3(a_3)

Output:

tensor(True)

Testing the function “is_3” on our validation set of 3s

is_3(valid_3_tensor)

Output:

tensor([ True, False,  True,  ...,  True,  True,  True])

Calculating the accuracy of our model in predicting 3s and 7s on the validation set.

accuracy_3s = is_3(valid_3_tensor).float().mean()
accuracy_7s = (1 - is_3(valid_7_tensor).float()).mean()
accuracy_3s, accuracy_7s

Output:

(tensor(0.9168), tensor(0.9854))

Both of our models are actually performing very well with this baseine model. But let’s be honest, 3 and 7 can easily be identified and only 2 out of 10 are being compared here. But, this model works pretty well for only a baseline model.

I hope that you have liked this baseline model of MNIST and this has given you an understanding of what the insides of a model looks like.

You can visit my github to have a look at more beginner friendly ML, DL and Analytics projects.

✌️

--

--

Sajalsaini

If I would like to define my self in one word it would be: curious.