7 minute read

Building Your First Neural Net From Scratch With PyTorch

PyTorch is just such a great framework for deep learning that you needn’t be afraid to stray off the beaten path of pre-made networks and higher-level libraries like fastai.

The aim of this article is to give you the intuition, knowledge, and confidence to build a neural network from the ground up.

One fact about deep learning is that it sucks, and is basically a way to search for random ways to implement some arbitrary function. Another fact is that it’s awesome and does some incredible things. This recent article by Gary Marcus inspired me to go about combining deep learning with atomic operations; it’s worth a read.

One huge thing I took away from taking the fastai course, first online and then in person at USF, is that deep learning is super-experimental, progresses incrementally, and doesn’t really have a sound theoretical basis. Well, it does, but the last five or ten years of improvements — everything that’s made DL sexy today — have been basically trial and error. I also learned how easy PyTorch makes it to do experimentation.

So let’s get involved! While you can probably already build a world class Imagenet classifier or Nietzsche generator, in this article I wanted to go back to basics. Let’s look first at coding up the simplest possible neural network: a Perceptron. (* For the purists: obviously that is not really a network since there is only one neuron, but we’ll get there!)


Yer Basic Perceptron

I’m going to assume that you know what a neuron is. I mean, in biology, yeah it’s some part of the brain, errr, we have a gazillion of them, something like that? In comp.sci terms: a unit that takes weighted inputs and sums them to produce some output; more useful ones have a non-linear function applied to the summation before producing an output.

Now, I will also assume you have PyTorch installed, and ideally also Python, and even better Jupyter Notebook. If not, Google is your friend; I understand that PyTorch even works in Windows 10 now too which is cool. Import the requirements:


import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

And define a network, just a single linear neuron.


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1,1)
    def forward(self, x):
        x = self.fc1(x)
        return x

It’s a fully connected layer, that has one input and one output. Now, that layer (technically neuron/weight combo) will have a weight that’s applied to its input; its output will be Ax + b ([weight x input] + bias) since it’s just entirely linear, with no activation function.

Take a moment and look just how dope that network definition is: pure Python (pretty much), and you only need to define how the input signal is processed, without worrying about backprop. You just specify how the neuron layer modifies its input signal. That’s because of the autograd package that exports Variable, and a Variable keeps track of gradients. As you know, it’s important to keep track of gradients otherwise gradient descent — estimate how wrong our network was for a given set of inputs, figure out the direction and scale of the error with respect to the weights, and scale that by its contribution — wouldn’t work.

I probably don’t need to specify this, but a Variable is a tensor: that is, a matrix with at least a third dimension. Go ahead and create a network according to how you specified it, and take a look inside:


net = Net()
print(net)
Net (
  (fc1): Linear (1 -> 1)
)

It’s possible then to take a look at the parameters of the network. Parameters are automatically optimized by the network; hyperparameters such as learning rate require tuning by a human (so far, at least).


print(list(net.parameters()))
[Parameter containing:
1.00000e-02 *
 -6.6961
[torch.FloatTensor of size 1x1]
, Parameter containing:
-0.4478
[torch.FloatTensor of size 1]
]

OK, so our network was initialized with some random weight of -0.4 and bias of -0.4478. Let’s consider how to work this network.


input = Variable(torch.randn(1,1,1), requires_grad=True)
print(input)
Variable containing:
(0 ,.,.) = 
 -0.5085
[torch.FloatTensor of size 1x1x1]

Yeah, we just created a random number with PyTorch. It’s a tensor with a single dimension (alternatively, torch.FloatTensor([[[1]]]) would create an equivalent tensor with the value 1). Setting requires_grad means it’s an optimizable variable. Then you can chuck this number through the unlearned network:


out = net(input)
print(out)
Variable containing:
(0 ,.,.) = 
 -0.4138
[torch.FloatTensor of size 1x1x1]

Well, that makes sense if you think about it. (-6.69e-02 * -0.5085) + -0.4478 = -0.4138.Next up, define a loss function and an optimizer using stochastic gradient descent:


import torch.optim as optim
def criterion(out, label):
    return (label - out)**2
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.5)

In this case, we’ve defined our own loss function: least squares. It has to be squared to give the magnitude of the error; the gradient itself will say whether to move the parameter up or down. For each training example in a randomly shuffled set, SGD — behind the scenes with the magic of PyTorch — adjusts the available parameters based on how they affected the gradient of the error (chain rule shizz, scaled by learning rate and momentum, which are chosen arbitrarily), and backpropagates gradients and updates through the network.

Then, define a training dataset. For this article, we’re just going to teach the network how to treble a number: our goal for the single perceptron of Ax + b will be that A=3 and b=0. A simple training dataset (who says neural nets need big data!?) is:


data = [(1,3), (2,6), (3,9), (4,12), (5,15), (6,18)]

Then, the training loop looks like:


for epoch in range(100):
    for i, data2 in enumerate(data):
        X, Y = iter(data2)
        X, Y = Variable(torch.FloatTensor([X]), requires_grad=True), Variable(torch.FloatTensor([Y]), requires_grad=False)
        optimizer.zero_grad()
        outputs = net(X)
        loss = criterion(outputs, Y)
        loss.backward()
        optimizer.step()
        if (i % 10 == 0):
            print("Epoch {} - loss: {}".format(epoch, loss.data[0]))

Loss begins to converge on zero after a while:


Epoch 0 - loss: 1.0141230821609497 
Epoch 10 - loss: 0.022670941427350044 
Epoch 20 - loss: 0.007926558144390583
...

Did we get to Ax + b (3x + 0)? Almost:


print(list(net.parameters()))
[Parameter containing:
 2.9985
[torch.FloatTensor of size 1x1]
, Parameter containing:
1.00000e-03 *
  8.3186
[torch.FloatTensor of size 1]
]

How about making a prediction?


print(net(Variable(torch.Tensor([[[1]]]))))
Variable containing:
(0 ,.,.) = 
  3.0068
[torch.FloatTensor of size 1x1x1]

Close enough. To a point Gary Marcus makes in his article linked above:- the network doesn’t know to extrapolate to integers, but I think this is a good result for something with vastly fewer brain cells than a nematode.


Stop being shallow: a Multi-Layer Perceptron

The exact same code still works for a two layer (or more than that) network. Just change the way the network is built. Note that layers need to match up in terms of number of outputs of one layer and inputs to the next:


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1,10)
        self.fc2 = nn.Linear(10,1)
    def forward(self, x):
        x = self.fc2(self.fc1(x))
        return x

This satisfies the definition for deep learning from Wikipedia!

Avoiding Spectres and Meltdowns: Going GPU

It’s remarkably easy with PyTorch to shift computation to the GPU, assuming you can afford one in these times of DDR shortages and crypto mining. Just shift the network, and variables, to the GPU with cuda():


net = Net()
net.cuda()

And inside the training loop:


X, Y = Variable(torch.FloatTensor([X]), requires_grad=True).cuda(), Variable(torch.FloatTensor([Y])).cuda()

Non-Linearity and Predefined Loss Functions

Neural nets only work because each neuron has some non-linearity. What boggles my mind as someone who grew up on sigmoid and tanh, is that the best sort of non-linearity these days is ReLU, or Rectified Linear Unit. That is, if the neuron’s sum is negative, set it to zero, otherwise proceed as usual.

You can happily rebuild the network with ReLU, and it’ll train just the same. Here are the relevant changes:


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1,1)
        self.fc2 = nn.Linear(1,1)
    def forward(self, x):
        x = F.relu(self.fc2(F.relu(self.fc1(x))))
        return x
criterion = nn.MSELoss()

That Was Easy

I hope this article gave you a taste just how easy it is to experiment with deep learning using PyTorch, and maybe inspired you to go out and build some weird stuff.

The full code for the final example is on my Github.

In Part 2 of this article, I’ll craft bespoke neurons and create a new learning method that’s not boring old gradient descent with backprop.

comments powered by Disqus