Thursday, July 28, 2016

Python console progress bar (using \b and \r)

One of the things that really fascinated me in console applications is when I see text being changed on the same line without new lines being added. An example would be a progress bar or some percentage being updated on the same line like this:



The idea is to use ASCII control characters which control the console's behaviour. These control characters were most useful back when computer displays were typewriters that printed out the text that the program wanted to show but are now still useful in console applications.

First of all, the code shown below only works in the command prompt / terminal and not in IDLE. To try them out open command prompt (if using Windows) or terminal (if using Linux) and enter the command "python" which will open a stripped down version of IDLE. Alternatively you can write a script file and then call it through the command prompt / terminal by entering the command "python <script file>" such as "python C:\file.py".

Control characters

There are two control characters of interest here:

\b moves the cursor one character back, which will print the following character over the previous character. This is useful for overwriting the characters at the end of the line.
print('xy\bz')



\r moves the cursor to the beginning of the line, which will print the following character over the first character. This is useful for overwriting the characters at the beginning of the line or the whole line.
print('xy\rz')



So how do we use these to display a functioning progress bar? First of all, if we're using \b then we'll probably need to overwrite a number of characters at the end, not just one. To do this, just print \b as many times as needed:
print('xyyy\b\b\bzzz')



Separate prints

We also probably want to use separate prints rather than put everything in the same print statement. To do this you need to make sure that the print will not start a new line at the end. In Python 3 you can add the argument "end=''" to the print:
def f():
     print('xy', end='')
     print('\bz')

f()



def f():
     print('xy', end='')
     print('\rzz')

f()



Alternatively you can use the sys.stdout.write function which is equivalent to print without the extra features such as adding a new line at the end:
import sys

def f():
     sys.stdout.write('xy')
     sys.stdout.write('\bz\n')

f()



String formatting

When displaying a percentage it's important that the number always takes the same amount of character space so that you know how many characters to overwrite. This can be done using the format method of strings. The format method allows you to make a string take a fixed amount of characters, filling the remainder with spaces as follows:
curr = 12
total = 21
frac = curr/total
print('[{:>7.2%}]'.format(frac))



The formatting code is the "{:>7.2%}", where ">" means that the string is be right aligned with spaces added to the front, "7" is the fixed length of the string + padded spaces (3 whole number digits + a point + 2 decimal digits + a percentage sign) ".2" is the number of decimal places to round the percentage to, and "%" means that the fraction should be expressed as a percentage.

Replicated strings

When displaying a progress bar it would also be useful to know that you can replicate a string for a number of times using the '*' operator, as follows:
curr = 12
total = 21
full_progbar = 10
filled_progbar = round(curr/total*full_progbar)
print('#'*filled_progbar + '-'*(full_progbar-filled_progbar))



Full example

Here is a full progress bar example:

def progbar(curr, total, full_progbar):
    frac = curr/total
    filled_progbar = round(frac*full_progbar)
    print('\r', '#'*filled_progbar + '-'*(full_progbar-filled_progbar), '[{:>7.2%}]'.format(frac), end='')

for i in range(100000+1):
    progbar(i, 100000, 20)
print()



It might also be the case that the progress bar is not updating consistently and seems to be getting stuck. This could be because the print is buffering and need to be "flushed" after every print. This is done by adding
sys.stdout.flush()
after every print (don't forget to import "sys" before).

Tuesday, June 21, 2016

The Backpropagation Algorithm for Artificial Neural Networks (ANNs)

In a previous post, I talked about how to use the gradient descent algorithm to optimize the weights and biases of an artificial neural network in order to give selected outputs for selected inputs. However plain gradient descent is inefficient on large neural nets since it recomputes a lot of values for each neuron, plus it has to be rewritten for every change in architecture (number of neurons and layers) and requires a lot of code. The standard solution is to use an optimized version of the gradient descent algorithm for neural nets called the backpropagation algorithm.

I will assume that you have read the previous post. Whereas the previous post was very specific using a specified architecture and a specified activation and cost function, here I will keep things as general as possible such that they can be applied on any feed forward neural network. Here are the definitions of symbols we shall be using, similar to last post's definitions:



In order to help explain the algorithm and equations, I shall be applying it to the last post's architecture, just to help you understand. So here is last post's neural net:



In the example, we have a 3 layer neural net with 2 neurons in each layer and 2 input neurons. It uses the sigmoid function as an activation function and the mean square error as the cost function.

The main point of interest in the backpropagation algorithm is the way you find the derivative of the cost function with respect to a weight or bias in any layer. Let's look at how to find these derivatives for each layer in the example neural net.

Weights and bias in layer L (output layer)

We begin by finding the derivatives for the output layer, which are the simplest.

d Cost / d Weight L
The derivative of the cost with respect to any weight in layer L can be found using



Here's how we can arrive to this equation based on a weight in layer 3 in the example:



d Cost / d Bias L

The derivative of the cost with respect to any bias in layer L can be found using



Here's how we can arrive to this equation based on a bias in layer 3 in the example:



Using delta

A part of the weights and biases equations is repeated. It will be convenient when finding derivatives of other layers to use an intermediate letter, called lower case delta, to represent this part of these equations.



Weights and bias in layer L-1

Now we find the derivative with respect to the weights and biases in layer L-1. The derivations will be continuations of the previous derivations and we won't be repeating the first bit of the derivations again.

d Cost / d Weight L-1
The derivative of the cost with respect to any weight in layer L-1 can be found using



Here's how we can arrive to this equation based on a weight in layer 2 in the example:



d Cost / d Bias L-1

The derivative of the cost with respect to any bias in layer L-1 can be found using



Here's how we can arrive to this equation based on a bias in layer 2 in the example:



Using delta

These equations can be shortened by using delta again and now we can use the previous delta to shorten this one even more.



Notice how there is a recursive definition of delta. This is the basis for the backpropagation algorithm.

Weights and bias in layer L-2

Now we find the derivative with respect to the weights and biases in layer L-2. Like before, we won't be repeating the first bit of the derivations again.

d Cost / d Weight L-2
The derivative of the cost with respect to any weight in layer L-2 can be found using



Here's how we can arrive to this equation based on a weight in layer 1 in the example (notice that there is a trick we'll be using with the summations that is explained below):



At the end of the derivation we did a trick with the sigmas (summations) in order to turn the equation into a recursive one.

First, we moved the delta to the inside of the sigmas. This can be done because it's just a common term that is factored out and we're factoring it back in. For example if the equation is "𝛿(a + b + c)" then we can turn that into "𝛿a + 𝛿b + 𝛿c".

Second, we swapped the "p" summation with the "q" summation. This does not change anything if you think about it. All it does is change the order of the summations, which does not change anything.

Finally we replaced the "p" summation with the previous delta, allowing us to shorten the equation.

d Cost / d Bias L-2

The derivative of the cost with respect to any bias in layer L-2 can be found using



Here's how we can arrive to this equation based on a bias in layer 1 in the example:



Using delta

Again, we can shorten the equations using delta.



Notice that this is exactly the same as the previous delta. This is because beyond L-1, all the deltas will be identical and can be defined using a single recursive relation.

In general: using indices

Up to now we saw what the derivatives for layer L, L-1, and L-2 are. We can now generalize these equations for any layer. Given the recursive pattern in the deltas, we can create an equation that gives the deltas of any layer provided you have the deltas of the previous layers. The base case of the recursion would be for layer L, which is defined on its own.



Now we have a way to find the derivatives for any layer, no matter how deep the neural network is. Here is how you'd use these equations on the example:



In general: using matrices

As things stand, there are a lot of index letters littering our equations (i,j,p). We can get rid of them however if we use matrix operations that work on all the indices at once. If we do this we can even get shorter code when programming the backpropagation algorithm by making use of a fast matrix library such as numpy.

Using matrices in neural nets

Let's look at how we can use matrices to compute the output of a neural net. A whole layer of activations can be calculated as a whole by treating it as a vector. We'll treat vectors as being horizontal by default and which need to be transposed in order to be made vertical.

Here's an example of what we want to calculate:


Of course we're still using indices just because we're organising our activations in a vector. In order to get rid of the indices we need to treat the weights as a matrix, where the first index is the row number and the second index is the column number. That way we can have a single letter "w" which represents all of the weights of a layer. When multiplied by a vector of the previous layer's activations, the matrix multiplication will result in another vector of weighted sums which can be added to a bias vector and passed through an activation function to compute the next vector of activations. Here is an example of the calculation based on the above example:



Final clean generalized deltas and cost gradients

And now we can finally see what the clean matrix-based general derivatives are for any layer:



The backpropagation algorithm

Finding the gradients is one step (the most complex one) of the backpropagation algorithm. The full backpropagation algorithm goes as follows:

  1. For each input-target pair in the training set,
    1. Compute the activations and the "z" of each layer when passing the input through the network. This is called a forward pass.
    2. Use the values collected from the forward pass to calculate the gradient of the cost with respect to each weight and bias, starting from the output layer and using the delta equations to compute the gradients for previous layers. This is called the backward pass.
  2. Following the backward pass, you have the gradient of the cost with respect to each weight and bias for each input in the training set. Add up all the corresponding gradients of each input in the training set (all the gradients with respect to the same weight or bias).
  3. Now use the gradient to perform gradient descent by multiplying the gradient by a step size, or learning rate as it's called in the backpropagation algorithm, and subtracting it from the corresponding weights and biases. In algebra,
    weight = weight - learningrate*dCost/dWeight

In code
And this is what the full program looks like in Python 3 using numpy to perform matrix operations. Again, we shall be training the neural net to perform the function of a half adder. A half adder adds together two binary digits and returns the sum and carry. So 0 + 1 in binary gives 1 carry 0 whilst 1 + 1 in binary gives 0 carry 1.

import math, random
import numpy as np

class ANN:
    '''General artificial neural network class.'''
    
    def __init__(self, num_in, num_hiddens, num_out, trainingset):
        '''Create a neural network with a set number of layers and neurons and with initialised weights and biases.'''
        if not all(len(x)==num_in and len(y)==num_out for (x,y) in trainingset.items()):
            raise ValueError()

        self._num_in = num_in
        self._num_hiddens = num_hiddens
        self._num_out = num_out
        self._trainingset = trainingset

        self._num_layers = len(num_hiddens) + 1
        
        self._weights = dict([
                (1, np.array([
                    [self.weight_init(1, i, j) for j in range(num_hiddens[0])]
                    for i in range(num_in)
                ]))
            ]
            + [
                (l+1, np.array([
                    [self.weight_init(l+1, i, j) for j in range(num_hiddens[l])]
                    for i in range(num_hiddens[l-1])
                ]))
                for l in range(1, len(num_hiddens))
            ]
            + [
                (self._num_layers, np.array([
                    [self.weight_init(self._num_layers, i, j) for j in range(num_out)]
                    for i in range(num_hiddens[-1])
                ]))
            ])
        
        self._biases = dict([
                (l+1, np.array([
                    self.bias_init(l+1, j) for j in range(num_hiddens[l])
                ]))
                for l in range(len(num_hiddens))
            ]
            + [
                (self._num_layers, np.array([
                    self.bias_init(self._num_layers, j) for j in range(num_out)
                ]))
            ])

    def weight_init(self, layer, i, j):
        '''How to initialise weights.'''
        return 2*random.random()-1

    def bias_init(self, layer, j):
        '''How to initialise biases.'''
        return 2*random.random()-1
        
    def a(self, z):
        '''The activation function.'''
        return 1/(1 + np.vectorize(np.exp)(-z))

    def da_dz(self, z):
        '''The derivative of the activation function.'''
        return self.a(z)*(1-self.a(z))

    def out(self, x):
        '''Compute the output of the neural network given an input vector.'''
        n = x
        for l in range(1, self._num_layers+1):
            n = self.a(np.dot(n, self._weights[l]) + self._biases[l])
        return n

    def cost(self):
        '''The cost function based on the training set.'''
        return sum(sum((self.out(x) - t)**2) for (x,t) in self._trainingset.items())/(2*len(self._trainingset))

    def dCxt_dnL(self, x, t):
        '''The derivative of the cost function for a particular input-target pair with respect to the output of the network.'''
        return self.out(x) - t

    def get_cost_gradients(self):
        '''Calculate and return the gradient of the cost function with respect to each weight and bias in the network.'''
        dC_dws = dict()
        dC_dbs = dict()
        for l in range(1, self._num_layers+1):
            dC_dws[l] = np.zeros_like(self._weights[l])
            dC_dbs[l] = np.zeros_like(self._biases[l])
        
        for (x,t) in self._trainingset.items():
            #forward pass
            zs = dict()
            ns = dict()
            ns_T = dict()
            ns[0] = np.array(x)
            ns_T[0] = np.array([np.array(x)]).T
            for l in range(1, self._num_layers+1):
                z = np.dot(ns[l-1], self._weights[l]) + self._biases[l]
                zs[l] = z
                n = self.a(z)
                ns[l] = n
                ns_T[l] = np.array([n]).T

            #backward pass
            d = self.dCxt_dnL(x, t)*self.da_dz(zs[self._num_layers])
            dC_dws[self._num_layers] += np.dot(ns_T[self._num_layers-1], np.array([d]))
            dC_dbs[self._num_layers] += d
            for l in range(self._num_layers-1, 1-1, -1):
                d = np.dot(d, self._weights[l+1].T)*self.da_dz(zs[l])
                dC_dws[l] += np.dot(ns_T[l-1], np.array([d]))
                dC_dbs[l] += d

        for l in range(1, self._num_layers+1):
            dC_dws[l] /= len(self._trainingset)
            dC_dbs[l] /= len(self._trainingset)

        return (dC_dws, dC_dbs)

    def epoch_train(self, learning_rate):
        '''Train the neural network for one epoch.'''
        (dC_dws, dC_dbs) = self.get_cost_gradients()
        for l in range(1, self._num_layers+1):
            self._weights[l] -= learning_rate*dC_dws[l]
            self._biases[l] -= learning_rate*dC_dbs[l]
    
    def check_gradient(self, epsilon=1e-5):
        '''Check if the gradients are being calculated correctly. This is done according to http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/ . This method calculates the difference between each calculated gradient (according to get_cost_gradients()) and an estimated gradient using a small number epsilon. If the gradients are calculated correctly then the returned numbers should all be very small (smaller than 1e-10.'''
        (predicted_dC_dws, predicted_dC_dbs) = self.get_cost_gradients()

        approx_dC_dws = dict()
        approx_dC_dbs = dict()
        for l in range(1, self._num_layers+1):
            approx_dC_dws[l] = np.zeros_like(self._weights[l])
            approx_dC_dbs[l] = np.zeros_like(self._biases[l])
            
        for l in range(1, self._num_layers+1):
            (rows, cols) = self._weights[l].shape
            for r in range(rows):
                for c in range(cols):
                    orig = self._weights[l][r][c]
                    self._weights[l][r][c] = orig + epsilon
                    cost_plus = self.cost()
                    self._weights[l][r][c] = orig - epsilon
                    cost_minus = self.cost()
                    self._weights[l][r][c] = orig
                    approx_dC_dws[l][r][c] = (cost_plus - cost_minus)/(2*epsilon)
                    
            (cols,) = self._biases[l].shape
            for c in range(cols):
                orig = self._biases[l][c]
                self._biases[l][c] = orig + epsilon
                cost_plus = self.cost()
                self._biases[l][c] = orig - epsilon
                cost_minus = self.cost()
                self._biases[l][c] = orig
                approx_dC_dbs[l][c] = (cost_plus - cost_minus)/(2*epsilon)

        errors_w = dict()
        errors_b = dict()
        for l in range(1, self._num_layers+1):
            errors_w[l] = abs(predicted_dC_dws[l] - approx_dC_dws[l])
            errors_b[l] = abs(predicted_dC_dbs[l] - approx_dC_dbs[l])
        return (errors_w, errors_b)

################################################################
nn = ANN(
    2, #number of inputs
    [ 2, 2 ], #number of hidden neurons (each number is a hidden layer)
    2, #number of outputs
    { #training set
        (0,0):(0,0),
        (0,1):(1,0),
        (1,0):(1,0),
        (1,1):(0,1)
    }
)

print('Initial cost:', nn.cost())
print('Starting training')
for epoch in range(1, 20000+1):
    nn.epoch_train(0.5)
    if epoch%1000 == 0:
        print(' Epoch:', epoch, ', Current cost:', nn.cost())
print('Training finished')

print('Neural net output:')
for (x,t) in sorted(nn._trainingset.items()):
    print(' ', x, ':', nn.out(x))

After 20,000 iterations I got this output:
Initial cost: 0.232581149941
Starting training
 Epoch: 1000 , Current cost: 0.218450291043
 Epoch: 2000 , Current cost: 0.215777328824
 Epoch: 3000 , Current cost: 0.178593050179
 Epoch: 4000 , Current cost: 0.102704613785
 Epoch: 5000 , Current cost: 0.0268721186425
 Epoch: 6000 , Current cost: 0.00819827730488
 Epoch: 7000 , Current cost: 0.00444660869981
 Epoch: 8000 , Current cost: 0.00298786209459
 Epoch: 9000 , Current cost: 0.00223137023455
 Epoch: 10000 , Current cost: 0.00177306322414
 Epoch: 11000 , Current cost: 0.00146724847349
 Epoch: 12000 , Current cost: 0.00124934223594
 Epoch: 13000 , Current cost: 0.00108652952015
 Epoch: 14000 , Current cost: 0.000960438140866
 Epoch: 15000 , Current cost: 0.000860007442299
 Epoch: 16000 , Current cost: 0.000778192177283
 Epoch: 17000 , Current cost: 0.000710297773182
 Epoch: 18000 , Current cost: 0.000653078479503
 Epoch: 19000 , Current cost: 0.000604220195232
 Epoch: 20000 , Current cost: 0.0005620295804
Training finished
Neural net output:
  (0, 0) : [ 0.02792593  0.00058427]
  (0, 1) : [ 0.97284142  0.01665468]
  (1, 0) : [ 0.97352071  0.01661805]
  (1, 1) : [ 0.03061993  0.97196112]

Sunday, May 22, 2016

Making your own deep learning workstation: From Windows to Ubuntu dual boot with CUDA and Theano

I recently managed to turn my Windows 7 gaming PC into a dual boot GPU enabled deep learning workstation which I use for deep learning experiments. Here is how I did it.

First of all, you'll want to use Theano as a deep learning framework which automatically optimises your code to use the GPU, the fast parallel processor on your graphics card, which makes deep learning faster (twice as fast on my computer). Unfortunately this seems to be difficult to do on Windows, but easy to do on Linux; so I created a partition for Ubuntu Linux (splitting a hard disk into two separate spaces which can contain two different operating systems) and made my computer dual boot (so that I can choose whether to use Windows or Ubuntu when my computer is switched on). It seems that the most compatible Ubuntu version which most tutorials focus on is Ubuntu 14.04, which is what I installed. I then installed CUDA, which is the NVIDIA graphic card driver that allows you to write programs that use the GPU. After installing this, I installed Theano together with the even easier specialisation framework Keras which lets you do deep learning with only a few lines of Python code.

Be warned that I had to reinstall Ubuntu multiple times because I followed different tutorials which didn't work. Here I will link to the tutorials which worked for me.

Create a partition for Ubuntu
Tutorial
The first step is to create a partition in your C: drive in which to install Ubuntu. I made a partition of 100GB using the Windows Disk Management tool. Follow the instructions in the tutorial.

Install Ubuntu 14.04 into the partition (yes, it is recommended to use 14.04), together with making the computer dual boot
Tutorial
Follow the instructions in the tutorial to use BCDEdit, the Windows boot manager tool, to make it possible to choose whether to use Windows or Ubuntu. Stop at step 5.

To continue past step 5, download the Ubuntu 14.04 ISO file from http://releases.ubuntu.com/trusty/. I choose the 64-bit PC (AMD64) desktop image but you might need the 32-bit one instead. After downloading it, burn it into an empty DVD, restart your computer, and pop in the DVD before it starts booting. Assuming your boot priority list has the DVD drive before the hard disk (changing this if it's not the case depends on your BIOS and mother card but it usually requires pressing F12 when the computer is starting up), you should be asked whether to install or try Ubuntu. Choose to try Ubuntu and once the desktop is loaded start the installation application.

At this point you're on step 6 of the tutorial. Be sure to choose "something else" when asked how to install Ubuntu (whether it's along side your other operating system or remove whatever you already have in your computer). Create a 1GB partition for swap space and another partition with the remainder for the available unformatted space for Ubuntu. You can find instructions on this at http://askubuntu.com/questions/343268/how-to-use-manual-partitioning-during-installation/521195#521195. It's very important that the device for boot loader is set to be the same partition where Ubuntu will be installed, otherwise Windows will not show when you turn on your computer. Do not restart, but choose to continue testing the Live CD instead.

You are now in step 8 in the tutorial. This is where you make it possible to select Ubuntu when you switch on your computer. Here is a list of steps you need:
  1. First, you need to mount the Windows and Ubuntu partitions to make them accessible. In Ubuntu there is no "My Computer" like in Windows where you can see all the partitions. Instead you have the drives and partitions shown on the side. Click on the ones whose size matches those of the Windows C: drive and the new Ubuntu partition (after clicking on them they will be opened so you can check their contents). This will to mount the partitions.
  2. Next you need to know the partition name of the Ubuntu partition. Click on the Ubuntu icon on the side and search for a program called GParted. Open the program and use the drop down list in the top corner to look for the partition you created for Ubuntu (selecting different choices in the drop down list will show you partitions in different physical hard disks you have). Take note of the name of the Ubuntu partition such as "/dev/sda2". Call this name [Ubuntu partition].
  3. Next you need to know the directory path to the Windows partition. In an open folder window (or go on the Files icon in the side), click on "Computer" on the side, go in "media", enter into the "ubuntu" folder and there you will find all the mounted partitions. Find the Windows partition. Call the directory path to this partition icon [Windows partition] (/media/ubuntu/abcdef-123456).
  4. Open the terminal (ctrl+alt+T) and then just copy the following line into the terminal. Where it says '[Windows parition]' just drag and drop the partition icon into the terminal and the whole directory will be written for you.
    sudo dd if=[Ubuntu partition] of='[Windows partition]/ubuntu.bin' bs=512 count=1

Now you can restart, remove the DVD, and log into your newly installed Ubuntu operating system, update it, and do another restart.

Installing CUDA
Tutorial
This is the part that took the most tries since only one tutorial actually worked well and gave no errors. Follow the instructions in the tutorial. When downloading the CUDA driver be sure to download the .deb file.

Installing Theano with dependencies
Tutorial
Now comes the straight forward part. Unfortunately it's easier to just use Python 2 than to try to make things work with Python 3. The examples in Keras are for Python 2 anyway. Just open the terminal and paste:
sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git

However also install an extra package which will let you test that Theano was installed correctly:
sudo pip install nose-parameterized

Now you can install theano and keras.
sudo pip install Theano
sudo pip install keras

Now whenever you want to run a theano or keras python script and use the GPU all you have to do is write the following in the terminal:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python [python script]

You may also want to use Python IDLE in order to experiment with theano, in which case install it like this:
sudo apt-get install idle
and then run it to use the GPU like this:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 idle
or do this to debug an error in your program:
THEANO_FLAGS=mode=FAST_COMPILE,device=cpu,floatX=float32 idle

I also recommend this good theano tutorial to get you started.

Has everything worked? Leave a comment if you think I should add something else to this tutorial.

Sunday, April 17, 2016

The Kullback–Leibler divergence

In a text file, each character is represented by a number of bits. Usually each character consists of 8 bits. But in data compression and information theory we learn that the number of bits need not be the same for each character. The most frequent characters can be given the fewest number of bits whilst the least frequent characters can be given many bits. According to Shannon's information theory, the smallest number of bits needed to represent a character that occurs with probability "p" is
-log2(p)

This means that if a character occurs half the time in a document, then you need -log2(0.5) bits which equals 1. On the other hand a character that occurs an eighth of the time should be represented with -log2(0.125) bits which equals 3. The rarer the character, the more bits should be assigned to it in order to be able to use less bits for the more frequent characters. This will result in compressing the whole document.

The entropy of a document is the expected number of bits per character, that is, the average number of bits in a document. If the number of bits is minimum, as described above, the expected number of bits is
-sum(p log2(p) for all character probabilities p)

Now that we know what entropy is, we can understand what the Kullback-Leibler divergence is. Assume that the number of bits for each character is calculated based on a different document than the one being compressed. Clearly this is a recipe for disaster, but you might want to compute an average probability for each character once based on a representative corpus and then always use these probabilities in all documents, which saves time. By how much will the probabilities diverge? This is what the Kullback-Leibler divergence is used for.

What we'll do is we'll calculate the difference in the average number of bits per character when using the wrong probabilities instead of the correct ones. This is done as follows:
-sum(p log2(q)) - -sum(p log2(p))

Notice how the correct probability for a particular character is "p" but in the first term we're using a different probability "q". This can now be simplified as follows:
-sum(p log2(q)) - -sum(p log2(p))
sum(p log2(p)) - sum(p log2(q))
sum(p log2(p) - p log2(q))
sum(p(log2(p) - log2(q)))
sum(p log2(p/q))

And this is the Kullback-Leibler divergence between the probabilities that "p" comes from and the probabilities that "q" comes from. It is usually used to measure the difference between two probability distributions.

Monday, March 14, 2016

Displaying the digits of pi in your web browser

Happy pi day. Here's how you can display as many digits of pi as you want (subject to your computer's resources).

We're going to be using Gibbon's unbounded spigot algorithm which spits out consecutive digits of pi in decimal without iterative approximations. It's really cool when you see it in action.

In the above linked paper, the algorithm is given in Haskell. I won't explain the mathematical derivation in this post, but here is a translation into Python 3:
def gibbons():
    (q, r, t, k, n, l) = (1, 0, 1, 1, 3, 3)
    while True:
        if 4*q + r - t < n*t:
            yield n
            (q, r, n) = (
                10*q,
                10*(r - n*t),
                10*(3*q + r)//t - 10*n
            )
        else:
            (q, r, t, k, n, l) = (
                q*k,
                (2*q + r)*l,
                t*l,
                k+1,
                (q*(7*k + 2) + r*l)//(t*l),
                l + 2
            )
To use this code run this:
digit_seq = gibbons()
print(next(digit_seq))
print(next(digit_seq))
print(next(digit_seq))
print(next(digit_seq))
This will output
3
1
4
1
If you want it to print out a thousand digits, do this:
digit_seq = gibbons()
for (position, digit) in enumerate(digit_seq):
    if position == 1000:
        break
    print(digit, end='')
Now this code will let you print as many digits as you want; however being in Python it's not something you can just tell your non-geek friend to try out. So for pi day, here's a javascript implementation that anyone can just paste in their browser's web console and get a nice stream of pi digits. One thing you should keep in mind is that in Python you can have numbers which are as big as you like. Javascript is not like that, so we'll need to use an external library called bignumber.js. This is the full code we'll be using to generate the digits:
var script = document.createElement('script');
script.src = "https://cdnjs.cloudflare.com/ajax/libs/bignumber.js/2.3.0/bignumber.min.js";
document.getElementsByTagName('head')[0].appendChild(script);

function gibbons() {
    var q=new BigNumber(1), r=new BigNumber(0), t=new BigNumber(1), k=new BigNumber(1), n=new BigNumber(3), l=new BigNumber(3);
    function f() {
        while (true) {
            if (q.mul(4).add(r).sub(t).lt(n.mul(t))) {
                var digit = n.toString();
                var q_ = q.mul(10);
                var r_ = r.sub(n.mul(t)).mul(10);
                var n_ = q.mul(3).add(r).mul(10).divToInt(t).sub(n.mul(10));
                q=q_, r=r_, n=n_;
                return digit;
            }
            else {
                var q_ = q.mul(k);
                var r_ = q.mul(2).add(r).mul(l);
                var t_ = t.mul(l);
                var k_ = k.add(1);
                var n_ = (q.mul(k.mul(7).add(2)).add(r.mul(l))).divToInt(t.mul(l));
                var l_ = l.add(2);
                q=q_, r=r_, t=t_, k=k_, n=n_, l=l_;
            }
        }
    }
    return f;
}

function show_digits(max_num_digits) {
    var get_digit = gibbons();
    var num_digits = 0;
    function f() {
        var digits = '';
        for (var i = 1; i <= 50; i++) {
            digits += get_digit();
            num_digits++;
            if (num_digits >= max_num_digits) {
                break;
            }
        }
        console.log(digits);
        if (num_digits < max_num_digits) {
            setTimeout(f, 100);
        }
        else {
            console.log('\n'+max_num_digits+' digits of pi displayed!');
        }
    }
    f();
}

show_digits(1000);
The first line is to import the bignumber library in a way that works in the web console. After that there are two functions. The first is for the Gibbons algorithm which lets us get the digits of pi. The way it works is by returning a function which returns the next digit each time it is called. The second function outputs 50 digits of pi every 100 milliseconds.
The code to copy-paste
But this is not something you'll send you friend. Instead send them this minified code (of course you shouldn't trust obscure code sent by strangers as it might be evil code, but you can minify the above code youself if you don't trust me).
var script=document.createElement("script");script.src="https://cdnjs.cloudflare.com/ajax/libs/bignumber.js/2.3.0/bignumber.min.js",document.getElementsByTagName("head")[0].appendChild(script);

function gibbons(){function u(){for(;;){if(n.mul(4).add(i).sub(m).lt(d.mul(m))){var u=d.toString(),e=n.mul(10),r=i.sub(d.mul(m)).mul(10),t=n.mul(3).add(i).mul(10).divToInt(m).sub(d.mul(10));return n=e,i=r,d=t,u}var e=n.mul(l),r=n.mul(2).add(i).mul(o),a=m.mul(o),b=l.add(1),t=n.mul(l.mul(7).add(2)).add(i.mul(o)).divToInt(m.mul(o)),g=o.add(2);n=e,i=r,m=a,l=b,d=t,o=g}}var n=new BigNumber(1),i=new BigNumber(0),m=new BigNumber(1),l=new BigNumber(1),d=new BigNumber(3),o=new BigNumber(3);return u}function show_digits(u){function n(){for(var l="",d=1;50>=d&&(l+=i(),m++,!(m>=u));d++);console.log(l),u>m?setTimeout(n,100):console.log("\n"+u+" digits of pi displayed!")}var i=gibbons(),m=0;n()}

show_digits(1000);

What you have to do is open your browser's web console on any web page by pressing F12 (even a blank tab works, although Facebook will block it), then paste the first line of code at the bottom. After pressing enter, paste the second line of code. Finally paste the "show_digits(1000);" line and press enter. You'll see a stream of digits displayed in the console panel. I think this is a nice way to celebrate pi day. Maybe someone else can make the digits somehow interact with the web page you are on instead of just display them in the console. Until then, enjoy your pie today!

Sunday, February 28, 2016

Gradient Descent Algorithm on Artificial Neural Networks (ANNs)

Polynomials are great. They're a simple mathematical equation pattern which is simple and easy to work with. You basically have a number of constants "a", "b", "c", etc. which you use in the following pattern:
f(x) = a x^0 + b x^1 + c x^2 + ...

By changing the values of the constants you change the shape of the graph of f(x). You can design a particular graph shape by choosing the values of these constants. In fact, one of the most interesting properties of polynomials is that they can be easily designed so that their graph passes exactly through a given set of points. For example, you can choose the constants of the polynomial to make f(1) = 2, f(3) = -1, f(-1.5) = 0.5, etc. This is very easy to make using Lagrange polynomials. The constants of the polynomial can be seen as parameters which can be optimized in order to get a desired function.

Polynomials are not the only equation which is designed to be easily modifiable. One of the most well known modifiable equation patterns are Artificial Neural Networks. Although neural networks are inspired by the neurons in the brain and how they change when learning to perform new tasks, the concept behind neural networks is that of having an equation that, by design, can be easily modified to turn chosen inputs into chosen outputs. They can take any number of inputs and outputs and they usually only work on binary numbers (which is great for computers). They have been used for lots of things such as recognising objects in images and classifying text into meaningful categories. Neural networks are known to generalise a set of input-output mappings into useful functions that work well for new inputs. For example, if you design a neural network to map 100 particular pictures of dogs to the output "1" and another 100 particular pictures of cats to the output "0", then the network will start giving the expected output for new pictures of dogs and cats.

Neural network definition

Whereas the building block of the polynomial is the term "k x^i", the building block of the neural network is the neuron. A neuron takes in a number of signals from other neurons, weighs the signals by their importance by making them weaker or stronger, adds them together and if the sum is over a particular threshold, then the neuron will output its own signal to some other neurons. Here is a diagram of this:


Neurons are organized into layers, where one layer of neurons supplies input signals to the next layer of neurons. In mathematical terms, a neuron is defined as follows:


Activation functions are functions that sharply change value when the input is over a certain threshold. Traditionally, neural networks use the sigmoid function given above which has a graph like this:


See how the value changes from 0 to 1 after passing x = 0? This is what will make the neuron output a signal if the weighted sum is over a threshold. Left as is, the threshold is 0; but we can change the threshold by changing the value of the bias in order to shift the graph to the left or right. The weighting of input signals by importance is done by multiplying the input by a weight, which can be a large number, a small fraction, or even a negative number.

Just like you can modify polynomial functions by changing their constants, you can modify neural network functions by changing their weights and biases. Further down we'll see how to do this.

Neural network example

Given the above definitions, here is a diagram of an example of a neural network:


We shall be referring to this diagram throughout this blog post. Notice the following things about it:
  • Its 2 inputs are called "x0" and "x1" whilst its 2 outputs are called "y0" and "y1".
  • Since there are two outputs, the output of the whole network is a column vector of two numbers.
  • The first layer of neurons are square because they propagate the "x" signals to the next layer unmodified. The rest of the neurons work as explained.
  • Each circle neuron receives input signals from all neurons in the previous layer.
  • Apart from input signals, each circle neuron also accepts a bias signal that is always the same.

This diagram is just a visual representation of the equation shown below. The equation is derived layer by layer from the output. Notice how the equation is a column vector of two numbers since there are two outputs.



Notice that we are representing the neural network using normal algebra instead of using linear algebra with matrices as usual. I believe that this makes the example easier to understand. Linear algebra is useful for generalising the concepts to any neural network shape, but for understand one network example we'll stick to this expanded representation.

Neural network modification

So now we know how to create a neural network equation, but how do we choose its weights and biases in order to get the function we want? In the previous blog post, we talked about how to use the gradient descent algorithm in order to numerically find the minimum of an equation when solving the equation algebraically is difficult. In fact, one of the most well known uses of the gradient descent algorithm is for "training" neural networks into performing a desired function, that is, giving desired outputs from particular inputs.

Usually neural networks are trained using the backpropagation algorithm which is based on the gradient descent algorithm. However we'll see how to do the training using plain gradient descent which will help us understand the backpropagation algorithm in a later blog post.

What we'll do is create a "cost" function that quantifies how close the outputs that the network is giving are to the desired outputs. Suppose that we want the above neural network to output (0,1) when given (0,0) and (1,0) when given (1,1). The cost function will measure how far off the neural network is from actually giving these outputs. If the cost function gives 0, then the network is giving exactly these desired outputs, otherwise it will be larger than zero. Different weights and biases will make the cost function give different values and our job is to find the combination of weights and biases that will give a cost of zero. We will do this using the gradient descent algorithm.

There are many possible cost functions, but traditionally, this is how we define the cost function:



Notice that O(X,W,B) = Y.

The cost function here is measuring how far the actual output is from the target output by calculating the mean squared error. For convenience later on we're also dividing the equation by 2.

For the above network, the cost function applies as follows:


Now that we have a continuous equation that quantifies how far from the desired function our neural network's weights and biases are, we can use gradient descent in order to optimize the weights and biases into giving the desired function. To do that we need to find the partial derivative of the network's equation with respect to every weight and bias. Before we do so, notice that the derivative of the activation function is as follows:


The above equation is saying that the derivative of a neuron's output is defined using the neuron's output (the output multiplied by one minus the output). This is a useful speed optimization in sigmoid activation functions.

We shall now find the partial derivative of two weights and two biases. You can then find the partial derivatives of every other weight and bias yourself. Make sure that you know how the chain rule in differentiation works. As you follow the derivations, keep in mind that all the "y"s and "n"s are actually functions of "B", "W", and "X" just like "O", but we'll leave the "(X,W,B)" out in order to save space.

Partial derivative with respect to weight in layer 3 (output layer)

Let's find the partial derivative of one of the weights in layer 3 (output layer).


Notice how the 2 in the denominator was cancelled with the 2s in the numerator. This is why we put it in the denominator of the cost function. Notice also that the summation does not change anything in the derivative since the derivative of a summation is the summation of the summed terms' derivative.

Now we'll find the derivative of each "y" separately.



Notice that we stopped decomposing neurons past layer 3 since we know that the weight we are deriving with respect to will not lie in deeper layers.

Finally we plug these back into the original equation and we have our partial derivative:


It's important to notice that the derivative uses the output of the network's neurons. This is important, since it means that before finding the derivative we need to first find the output of the network. Notice also that the green "n" at layer 3 is "y0" but we won't replace it with "y0" in order to preserve a pattern that is revealed later.

Partial derivative with respect to bias in layer 3 (output layer)

Let's find the partial derivative of one of the biases in layer 3 (output layer) now. We'll skip some steps that were shown in the previous derivations.


And now we find the derivatives of the "y"s separately.


And we now plug them back into the original equation.


Partial derivative with respect to weight in layer 2

Let's find the partial derivative of one of the weights in layer 2 now. We'll skip some steps that were shown in the previous derivations.


And now we find the derivatives of the "y"s separately.


And we now plug them back into the original equation.


Partial derivative with respect to bias in layer 2

Let's find the partial derivative of one of the biases in layer 2 now. We'll skip some steps that were shown in the previous derivations.


And now we find the derivatives of the "y"s separately.


And we now plug them back into the original equation.


Partial derivative with respect to weight in layer 1

Let's find the partial derivative of one of the weights in layer 1 now. We'll skip some steps that were shown in the previous derivations.


And now we find the derivatives of the "y"s separately.


And we now plug them back into the original equation.


Notice that we did not replace the black "n" at layer 0 with "x" in order to preserve a pattern that is revealed later.

Partial derivative with respect to bias in layer 1

Let's find the partial derivative of one of the biases in layer 1 now. We'll skip some steps that were shown in the previous derivations.


And now we find the derivatives of the "y"s separately.


And we now plug them back into the original equation.


Patterns
There is a pattern in the derivatives. In order to see the pattern, let's look at a part of the equation inside the summations of all the derivatives and compare them.

Here is the repeating pattern in the derivation for the weights:


Here is the repeating pattern in the derivation for the biases:


See how there is a repeating pattern that gets longer as the layer of the weight we are deriving with respect to gets deeper into the network? This is an important pattern on which the backpropagation algorithm is based, which computes the gradient of a layer's weights using the gradients of the previous layer.

Notice also that as more layers are added, the chain of multiplications gets longer. Since each number is a fraction between 0 and 1, the multiplications produce smaller and smaller numbers as the chain gets longer. This will make the gradients become smaller which makes training the layers at the back very slow. In fact this is a problem in multi-layered neural networks which is known as the vanishing gradient problem. It can be solved by using different activation functions such as the rectified linear unit.

In code
Now that we know how to find the partial derivative of every weight and bias, all we have to do is plug them in the gradient descent algorithm and minimize the cost function. When the cost function is minimized, the network will give us the desired outputs.

Here is a complete code example in Python 3 of a gradient descent algorithm applied to train the above neural network to act as a half adder. A half adder adds together two binary digits and returns the sum and carry. So 0 + 1 in binary gives 1 carry 0 whilst 1 + 1 in binary gives 0 carry 1. One of the functions is called "ns" which gives the outputs of each neuron in the network given an input, grouped by layer. This will be called several times by the gradient functions, so to avoid recomputing the same values, it's output is automatically cached by Python using the lru_cache decorator. The initial values are set to random numbers between 1 and -1 since starting them off as all zeros as was done in the previous blog post will prevent the gradient descent algorithm from working in a neural network.

import math
import random
from functools import lru_cache

trainingset = { (0,0):(0,0), (0,1):(1,0), (1,0):(1,0), (1,1):(0,1) }

def grad_desc(cost, gradients, initial_values, step_size, threshold):
    old_values = initial_values
    while True:
        new_values = [ value - step_size*gradient(*old_values) for (value, gradient) in zip(old_values, gradients) ]
        if cost(*new_values) < threshold:
            return new_values
        old_values = new_values

def a(z):
    return 1/(1 + math.exp(-z))

@lru_cache(maxsize=4)
def ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    (n00, n01) = ( x0, x1 )
    (n10, n11) = ( a(n00*w100 + n01*w110 + b10), a(n00*w101 + n01*w111 + b11) )
    (n20, n21) = ( a(n10*w200 + n11*w210 + b20), a(n10*w201 + n11*w211 + b21) )
    (n30, n31) = ( a(n20*w300 + n21*w310 + b30), a(n20*w301 + n21*w311 + b31) )
    return ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) )

def out(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    return ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)[-1]

def cost(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        (y0, y1) = out(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        tmp += (y0 - t0)**2 + (y1 - t1)**2
    return tmp/(2*len(trainingset))

def dCdw300(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * n30*(1-n30) * n20
    return tmp/len(trainingset)

def dCdw310(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * n30*(1-n30) * n21
    return tmp/len(trainingset)

def dCdw301(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y1 - t1) * n31*(1-n31) * n20
    return tmp/len(trainingset)

def dCdw311(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y1 - t1) * n31*(1-n31) * n21
    return tmp/len(trainingset)

def dCdb30(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * n30*(1-n30)
    return tmp/len(trainingset)

def dCdb31(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y1 - t1) * n31*(1-n31)
    return tmp/len(trainingset)

def dCdw200(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*n10 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*n10 ))
    return tmp/len(trainingset)

def dCdw210(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*n11 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*n11 ))
    return tmp/len(trainingset)

def dCdw201(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21)*n10 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21)*n10 ))
    return tmp/len(trainingset)

def dCdw211(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21)*n11 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21)*n11 ))
    return tmp/len(trainingset)

def dCdb20(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20) ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20) ))
    return tmp/len(trainingset)

def dCdb21(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21) ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21) ))
    return tmp/len(trainingset)

def dCdw100(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10)*n00 + w310*n21*(1-n21)*w201*n10*(1-n10)*n00 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10)*n00 + w311*n21*(1-n21)*w201*n10*(1-n10)*n00 ))
    return tmp/len(trainingset)

def dCdw110(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10)*n01 + w310*n21*(1-n21)*w201*n10*(1-n10)*n01 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10)*n01 + w311*n21*(1-n21)*w201*n10*(1-n10)*n01 ))
    return tmp/len(trainingset)

def dCdw101(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11)*n00 + w310*n21*(1-n21)*w211*n11*(1-n11)*n00 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11)*n00 + w311*n21*(1-n21)*w211*n11*(1-n11)*n00 ))
    return tmp/len(trainingset)

def dCdw111(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11)*n01 + w310*n21*(1-n21)*w211*n11*(1-n11)*n01 ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11)*n01 + w311*n21*(1-n21)*w211*n11*(1-n11)*n01 ))
    return tmp/len(trainingset)

def dCdb10(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10) + w310*n21*(1-n21)*w201*n10*(1-n10) ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10) + w311*n21*(1-n21)*w201*n10*(1-n10) ))
    return tmp/len(trainingset)

def dCdb11(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):
    tmp = 0
    for ((x0,x1), (t0,t1)) in trainingset.items():
        ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)
        (y0, y1) = (n30, n31)
        tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11) + w310*n21*(1-n21)*w211*n11*(1-n11) ))
        tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11) + w311*n21*(1-n21)*w211*n11*(1-n11) ))
    return tmp/len(trainingset)

new_values = grad_desc(
    cost,
    [ dCdw100, dCdw101, dCdw110, dCdw111, dCdb10, dCdb11, dCdw200, dCdw201, dCdw210, dCdw211, dCdb20, dCdb21, dCdw300, dCdw301, dCdw310, dCdw311, dCdb30, dCdb31 ],
    [ 2*random.random()-1 for _ in range(18) ],
    0.5,
    1e-3
)
print('cost:', cost(*new_values))
print('output:')
for ((x0,x1), (t0,t1)) in trainingset.items():
    print(' ', (x0,x1), out(x0,x1,*new_values))

Wait a few seconds and...
cost: 0.0009999186077385778
output:
  (0, 1) (0.9625083358870274, 0.02072955340806345)
  (1, 0) (0.9649753465406843, 0.02060601544190889)
  (0, 0) (0.03678639641748687, 0.0051592625464787585)
  (1, 1) (0.04304611235264617, 0.9642249998318806)
Conclusion
Of course this isn't a very convenient way to train neural networks since we need to create a different set of partial derivatives for every new network shape (number of neurons and layers). In fact, in a future blog post we'll see how to use the backpropagation algorithm which exploits patterns in this process in order to simplify the process. On the other hand, the backpropagation algorithm makes certain assumptions about the network, such as that every neuron in a layer is connected to every neuron in the previous layer. If we were to make no assumptions about the network then we'd end up using the bare bones method presented here.