Monday, March 27, 2017

Word embeddings: How word2vec and GloVe work

Word embeddings are vectors that represent words. For example the word "dog" might be represented as [0.1, -2.1, 1.2] whilst "cat" might be represented as [0.2, 2.4, 1.1]. These vectors are important in neural networks because neural networks can only work with continuous numbers whereas words are discrete symbols.

One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.

One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:

(1 0 0) (1 2) = (1*1 + 0*3 + 0*5    1*2 + 0*4 + 0*6) = (1 2)
        (3 4)
        (5 6)

In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.

This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in

Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are word2vec and GloVe.

word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.

GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.

Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.

Download word2vec code from here.
Download GloVe code from here.

Tuesday, February 28, 2017

Neural network deep learning hard attention using REINFORCE / score function estimator / likelihood ratio estimator

One of the things I struggled the most to understand with papers on deep learning is when they mention hard attention. Let's start with a description of what attention is first.

Attention based learning is when you have a sequence input where only part of it is relevant and you want your neural network to explicitly choose what is relevant. For example, if you are translating a sentence from English to French, only some of the words in the English sentence are relevant to generating the next word in the French sentence. So part of your neural network would be dedicated to weighing the importance of each English word in the context of what the neural network intends to generate next in the French sentence. This attention module would work by taking two inputs:
  • an embedded word vector from the English sentence
  • the current state of the recurrent neural network that is encoding what has been generated up to now
Given these two inputs the attention module would then output a single number between 0 and 1 which quantifies how relevant the given word is given what the neural network intends to generate next (which is encoded in the RNN state). This weighting would then be multiplied by the word vector in order to shrink it's magnitude and then the weighted sum of the word vectors would be used to encode the English sentence. This sentence vector would represent the information in the sentence that is relevant to generating the next French word. In this way, the neural network does not need to encode the whole sentence and use the same representation for generating all the French sentence. The English sentence would be attended to in different ways for generating each French word.

This technique has not only been used for machine translation but also for image captioning (only attend to parts of the image for every word being generated) and neural turing machines (only attend to parts of the tape with every operation). Just look at what Colah has to say about it. However this is just soft attention, which is easy. It's called soft attention because you still partially attend to every part of the input; it's just some parts get very little attention. This means that it's still possible to measure the effect of a part of an input and so it's still possible to determine, by gradient descent, whether its attention should be increased or decreased.

Hard attention, on the other hand, either completely includes or excludes elements of the input. How would you do this in a neural network? You'd need to use thresholding for example, where if the value of a neuron is above a threshold then it outputs a one, zero otherwise. You can also use argmax which selects the neuron with the greatest value and sets it to one and all the others to zero. Unfortunately both of these solutions are undifferentiable and that makes direct gradient descent ineffective if they are used in a hidden layer (in an output layer you can just maximize their continuous values and then threshold them at test time). It would be the same situation neural networks had in the past when they couldn't have a hidden layer because there was no known optimization algorithm for 2 layers of thresholding neurons.

It would be nice to somehow make neurons that have discrete outputs but which can still be trained by gradient descent. One way to do this is to use stochastic neurons which output discrete values based on a probability that can be tweaked. The simplest probabilistic neuron is the Bernoulli neuron which outputs a 1 with probability "p" and a 0 otherwise. Let's assume a simple attention based neural network that uses one Bernoulli neuron as an attention module and one normal sigmoid neuron as a feature extraction neuron. The Bernoulli neuron's output is multiplied by the normal neuron's output in order to gate it. Both neurons take the same single number as input.

The grey neuron with a "B" inside is the Bernoulli neuron, "w0" and "w1" are weights and "b0" and "b1" are biases. If we had to turn this into an equation in order to differentiate it, we'd end up with this:

$$y = B_x \cdot sig(w_1 x + b_1)$$

where "B_x" is a random variable that is dependent on "x". Unfortunately random variables "hide" their inner workings and we cannot express their probability as part of the equation. This means that we cannot optimize "w0" and "b0" by gradient descent as it would be meaningless to differentiation with respect to "w0" given that it's not in the equation. In fact "B" is treated as a constant in the above equation and we can still differentiate the equation with respect to all the other parameters. Note that this situation is different from dropout, where you randomly multiply some of the neuron values by 0 in order to avoid overfitting. In the case of dropout, the above equation would be enough as we're not optimizing the value of the dropout random variable.

There is a simple solution to this problem however. Instead of finding the gradient of "y", we can find the gradient of the expected value of "y". The expected value is the mean value you get when running a stochastic function over and over again. For example, if you're tossing a fair coin, with one face representing a 1 and the other face representing a 0, the expected value is 0.5. However, if you're tossing an unfair coin where the "1" face comes up 75% of the time, then on average you will get a value of 0.75. In general, the expected value of a discrete probability distribution, such as the Bernoulli distribution, can be found using the following equation:

$$E[X] = \sum_{v \in X} p(v) \cdot v$$

that is, just multiple each value by its respective probability and take the sum. In the case of a coin with probability of the "1" face coming up being "p" (a Bernoulli distribution), the expected value is $1 \cdot p + 0 \cdot (1-p)$ which equals "p". We can take advantage of the expect value by using gradient descent to minimize the expected error rather than the error itself. This would expose the parameters that determine the probability of the Bernoulli neuron and we would be able to use gradient descent as usual.

Let's take a concrete example of an error function. We want to minimize the sum square error of the neural net such that when given a 0 it should output a 1 and when given a 1 it should output a 0 (a logical NOT). This is what the error function (called the cost function) would look like, together with the error function of a single input:

C &= \sum_{(x,t_x) \in \{ (0,1),(1,0) \}} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\
C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\

Let's focus on just the single input error function. The expected error would look like:

E[C_x] &= sig(w_0 x + b_0) \cdot (t_x - 1 \cdot sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x - 0 \cdot sig(w_1 x + b_1))^2 \\
&= sig(w_0 x + b_0) \cdot (t_x - sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x)^2 \\

Remember that expected value finds the sum of all the possible values of the stochastic function due to randomness multiplied by their respective probability. In the case of our neural net, the two possible values due to randomness are caused by the Bernoulli neuron "B" being either 1 with probability $sig(w_0 x + b_0)$ or 0 with probability $1 - sig(w_0 x + b_0)$. Now we can find the gradient of the expected error and minimize it, which would include optimizing the parameters determining the probability of the Bernoulli neuron.

In general it might not be tractable to expose all the possible values of a stochastic neural network, especially when multinoulli neurons are used where there are many possible discrete values instead of just 0 or 1 and especially when there are multiple such neurons whose values must be considered together, creating a combinatorial explosion. To solve this problem, we can approximate the expected error by taking samples. For example, if you want to approximate the expected value of a coin, you can toss it a hundred times, count the number of times the coin gives the "1" face and divide by 100. This can be done with a stochastic neural network, where you run the network a hundred times and calculate the mean of the error for each training pair. The problem is that we don't want the expected error, but the derivative of the expected error, which cannot be calculated on the constant you get after approximating the expected error. We need to take samples of the expected derivative of the error instead. This is what the expected derivative of the error looks like:

\frac{dE[C_x]}{dw} &= \frac{d}{dw} \sum_{v \in C_x} p(v) \cdot v \\
&= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\

The problem with this equation is that it can't be sampled like an expected value can so you'll end up having to calculate the full summation, which we're trying to avoid. The solution to this is that we continue breaking down the algebra until eventually we get something that can be approximated by sampling. This has been derived multiple times in the literature and so it has multiple names such as REINFORCE, score function estimator, and likelihood ratio estimator. It takes advantage of the following theorem of logarithms: $\frac{d}{dx} f(x) = f(x) \cdot \frac{d}{dx} \log(f(x))$

\frac{dE[C_x]}{dw} &= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\
&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot \frac{d}{dw} p(v) \right) \\
&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot p(v) \cdot \frac{d}{dw} \log(p(v)) \right) \\
&= \sum_{v \in C_x} p(v) \cdot \left( \frac{d}{dw} v + v \cdot \frac{d}{dw} \log(p(v)) \right) \\
&\approx \frac{\sum_{i = 1}^{N} \frac{d}{dw} \tilde{v} + \tilde{v} \cdot \frac{d}{dw} \log(p(\tilde{v}))}{N} \text{ where } \tilde{v} \sim{} C_x \\

The last line is approximating the derivative of the expected error using "N" samples. This is possible because probability "p" was factored out inside the summation, which gives it the same form of an expect value equation and which hence can be approximated in the same way. You might be asking how it is that you can find the probability of each possible value. As long as you have access to the values returned by the stochastic neurons, you can use a dictionary to map values to their respective probabilities and multiply them together if necessary. The following is the derivative of the above Bernoulli NOT gate:

C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\
\frac{dE[C_x]}{dw_0} &\approx \frac{\sum_{i = 1}^{N} \left( \frac{d}{dw_0} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \right) + (t_x - B_x \cdot sig(w_1 x + b_1))^2 \cdot \left(
\left\{ \begin{array}{lr}
\log(sig(w_0 x + b_0)) & : B_x = 1 \\
\log(1 - sig(w_0 x + b_0)) & : B_x = 0 \\
\end{array} \right.
\right) }{N} \\

Tuesday, January 31, 2017

Applying softmax and categorical_crossentropy to 3D tensors in Theano

Theano is a great deep learning library but there are some things in it that need to be polished a bit. For example at the moment, you can only apply the softmax function in the tensor.nnet module to matrices (2D tensors). Same goes for the categorical_crossentropy function.

This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.

But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.

Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:

Original 3D tensor:
[ [ [111,112], [121,122] ], [ [211,212], [221,222] ] ]

Reshaped 2D tensor:
[ [111,112], [121,122], [211,212], [221,222] ]

Applied softmax:
[ [111',112'], [121',122'], [211',212'], [221',222'] ]

Reshaping back to 3D:
[ [ [111',112'], [121',122'] ], [ [211',212'], [221',222'] ] ]

It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:

import theano
import theano.tensor as T

X = T.tensor3()
(d1,d2,d3) = X.shape
Y = T.nnet.softmax(X.reshape((d1*d2,d3))).reshape((d1,d2,d3))

Here is what happens step by step:
print(X.reshape((d1*d2,d3)).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))

>>> [[ 1.  2.]
     [ 1.  3.]
     [ 1.  4.]
     [ 1.  5.]]

print(T.nnet.softmax(X.reshape((d1*d2,d3))).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))

>>> array([[ 0.26894142,  0.73105858],
           [ 0.11920292,  0.88079708],
           [ 0.04742587,  0.95257413],
           [ 0.01798621,  0.98201379]])

print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))

>>> [[[ 0.26894142  0.73105858]
      [ 0.11920292  0.88079708]]

    [[ 0.04742587  0.95257413]
     [ 0.01798621  0.98201379]]]

The categorical_crossentropy function can be used in the same way:

import theano
import theano.tensor as T

X = T.tensor3()
S = T.imatrix()
(d1,d2,d3) = X.shape
(e1,e2) = S.shape
Y = T.nnet.categorical_crossentropy( T.nnet.softmax(X.reshape((d1*d2,d3))), S.reshape((e1*e2,)) ).reshape((d1,d2))

And this is how it is used:
print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]], S: [[0,1],[1,0]] }))

>>> array([[ 1.31326169,  0.12692801],
           [ 0.04858735,  4.01814993]])

Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.

Tuesday, December 27, 2016

Bernoulli vs Binomial vs Multinoulli vs Multinomial distributions

Bernoulli distribution
Say you are flipping a coin that has a probability of 0.4 of turning up heads and 0.6 of turning up tails. The simplest probability distribution there is is the Bernoulli distribution which just asks for the probability of the coin turning up heads after one toss, that is, 0.4. That's all. The Bernoulli distribution is only about the probability of a random event with two possible outcomes occurring after one trial. It's used for success/fail type random variables where you want to reason about the probability that a random variable will succeed (such as a coin turning up heads).

The probability mass function for the Bernoulli distribution is:
f(x;p) = p^x * (1 - p)^(1-x)
where "*" is multiplication and "^" is power, "p" is the probability of a success and "x" is the number of successes desired, which can be either 1 or 0. When "x" is 1 then you find the probability of a success whilst when it is 0 then you find the probability of a fail. Since there are only two outcomes, if "p" is the probability of a success then the probability of a non-success is "1-p". Notice that this function is quite silly really as it is just a closed form expression for choosing either "p" or "1-p" depending on what "x" is. The number you're interested in is raised to the power of 1 whilst the number you're not interested in is raised to the power of 0, turning it into 1, and then the two results are multiplied together.

Binomial distribution
One generalization we can make upon the Bernoulli distribution is to consider multiple trails instead of just one and then ask about the probability that a number of successes would occur. This is called the binomial distribution. For example, what is the probability that the coin described above results in exactly 2 heads after tossing it 3 times. There are many ways how this result can come about. You can get [head, head, tail], or [head, tail, head], or [tail, head, head]. For this reason we need to consider the number of different ways the required total can be achieved, which is found by using the combination formula.

The probability mass function for the Binomial distribution is:
f(x;n,p) = n!/(x! * (n-x)!) * p^x * (1 - p)^(n-x)
where "*" is multiplication and "^" is power, "p" is the probability of a success, "n" is the number of trials to try out, and "x" is the number of desired successes out of the "n" trials.

Multinoulli distribution
Whereas the binomial distribution generalises the Bernoulli distribution across the number of trials, the multinoulli distribution generalises it across the number of outcomes, that is, rolling a dice instead of tossing a coin. The multinoulli distribution is also called the categorical distribution.

The probability mass function for the multinoulli distribution is:
f(xs;ps) = ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k
where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, which can be either 1 or 0 and where there can only be exactly a single "1" in "xs".

Multinomial distribution
Finally the most general generalisation of the Bernoulli distribution is across both the number of trials and the number of outcomes, called the multinomial distribution. In order to be more general, this distribution also allows specifying the number of successes desired for each individual outcome, rather than just of one outcome like the multinoulli distribution. This lets us calculate the probability of rolling a dice 6 times and getting 3 "2"s, 2 "3"s and a "5". Since we want it to be compatible with the binomial distribution, these outcomes can arrival in any order, that is, it doesn't matter whether you get [5,3,2,3,2,2] or [2,2,2,3,3,5]. For this reason we need to use the combination function again.

The probability mass function for the multinomial distribution is:
f(xs;n,ps) = n!/(xs_1! * xs_2! * ... * xs_k!) * ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k
where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, the total of which must be "n".

Wednesday, November 30, 2016

Checking if a number is prime: when to stop checking potential factors

Let's say you want to find out whether the number 888659800715261 is a prime number. If this were a programming exercise, a naive solution would be to go through all the numbers from 2 to 888659800715261/2 (half the number) and check whether any of the numbers divide 888659800715261 evenly. If none of them do, then the number is prime. The reason why we stop at half the number is because the largest factor that a number "n" can have before "n" itself is n/2, the one before that is n/3, and so on. Sure, we don't need to check every single number between 2 and n/2 as it would be enough to only check the prime numbers between 2 and n/2, but the point is that as soon as we reach n/2, we can stop checking and declare that 888659800715261 is prime.

The problem is that half of 888659800715261 is 444329900357630 which is still too big to check each number between 2 and it, probably even if you only check the prime numbers. The question is, is there an even lower upper-bound where we can stop checking potential factors in order to be sure that the number has no factors?

Let's look at which numbers between 1 and 24 are factors of 24:
[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]

In order for a number to be a factor of 24, there must be another factor which when the two factors are multiplied together give 24. For example, if 2 is a factor of 24, then there must be another factor of 24 which when multiplied by 2 gives 24. This other factor is 12. Let's call 2 and 12 co-factors. The co-factor of 3 is 8, the co-factor of 4 is 6. We can imagine co-factors being mirror reflections of each other:

[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]
 (   [   {   <     >     }           ]                                     )

Notice how all the co-factors nest each other. The co-factors 4 and 6 are between the co-factors 3 and 8 which in turn are between the co-factors 2 and 12 which in turn are between 1 and 24. It seems like there is a line of reflection at 5, where all the factors smaller than 5 have co-factors larger than 5. This means that beyond 5, all factors are co-factors of smaller numbers.

Let's see another list of factors for 16:
[1] [2] 3 [4] 5 6 7 [8] 9 10 11 12 13 14 15 [16]
 (   [    { }        ]                       )

Notice how 4 is a co-factor with itself in this case, since 4 squared is 16. The line of reflection is crossed at 4 this time, which happens to be the square root of 16. In fact the square root of 24 is 4.898... which is close to 5 (the line of reflection of 24). This is always the case, because the line of reflection occurs where the two co-factors are the same. If the co-factors are not the same then one factor must be smaller than the square root whilst the other must be larger. If both are smaller or bigger then when multiplied together they will not give the number being factored. This is because if two co-factors "a" and "b" are supposed to equal "n", and both of them are larger than √n, then that would mean that a×b must also be larger than √n×√n, which means that a×b is larger than "n". The area of a rectangle with two large sides cannot be equal to the area of a rectangle with two small sides. Therefore, if the two sides are equal (a square) then in order to change the length of the sides without changing the area of the rectangle you must make one side larger whilst making the other smaller. This proves that the square root must be between any two co-factors.

So what's the point? The point is that if you reach the square root of "n" without having found any factors, then you are guaranteed that there will not be any factors beyond √n because if there are any then they must have co-factors smaller than √n, which you have already confirmed that there aren't any. So the square root of a number is the lower upper-bound to check for factors. Can there be an even lower upper-bound? No, because the number might be a square number and have no other factors apart from its square root, such as 25. Therefore the smallest co-factor a number can have is the square root of the number.

So in order to check if 888659800715261 is a prime number, we only have to check all the numbers up to 29810397 for factors, which is much better than 444329900357630. By the way, there is no number between 2 and 29810397 that evenly divides 888659800715261, which means that it is a prime number.

Sunday, October 30, 2016

Using beam search to generate the most probable sentence

In my last blog post I talked about how to generate random text using a language model that gives the probability of a particular word following a prefix of a sentence. For example, given the prefix "The dog", a language model might tell you that "barked" has a 5% chance of being the next word whilst "meowed" has a 0.3%. It's one thing generating random text in a way that is guided by the probabilities of the words but it is an entirely different thing to generate the most probable text according to the probabilities. By most probable text we mean that if you multiply the probabilities of all the words in the sentence together you get the maximum product. This is useful for conditioned language models which give you different probabilities depending on some extra input, such as an image description generator which accepts an image apart from the prefix and returns probabilities for the next word depending on what's in the image. In this case we'd like to find the most probable description for a particular image.

You might think that the solution is to always pick the most probable word and add it to the prefix. This is called a greedy search and is known to not give the optimal solution. The reason is because it might be the case that every combination of words following the best first word might not be as good as those following the second best word. We need to use a more exploratory search than greedy search. We can do this by thinking of the problem as searching a probability tree like this:

The tree shows a probability tree of which words can follow a sequence of words together with their probabilities. To find the probability of a sentence you multiply every probability in the sentence's path from <start> to <end>. For example, the sentence "the dog barked" has a probability of 75% × 73% × 25% × 100% = 13.7%. The problem we want to solve is how to find the sentence that has the highest probability.

One way to do this is to use a breadth first search. Starting from the <start> node, go through every node connected to it, then to every node connected to those nodes and so on. Each node represents a prefix of a sentence. For each prefix compute its probability, which is the product of all the probabilities on its path from the <start> node. As soon as the most probable prefix found is a complete sentence, that would be the most probable sentence. The reason why no other less probable prefixes could ever result in more probable sentences is because as a prefix grows, its probability shrinks. This is because any additional multiplications with probabilities made to any prefix probability will only make it smaller. For example, if a prefix has a probability of 20% and another word is added to it which has a probability of 99%, then the new probability will be 20% × 99% which is the smaller probability of 19.8%.

Of course a breadth first search is impractical on any language model that has a realistic vocabulary and sentence length since it would take too long to check all the prefixes in the tree. We can instead opt to take a more approximate approach where we only check a subset of the prefixes. The subset would be the top 10 most probable prefixes found at that point. We do a breadth first search as explained before but this time only the top 10 most probable prefixes are kept and we stop when the most probable prefix in these top 10 prefixes is a complete sentence.

This is practical but it's important that the way we find the top 10 prefixes is fast. We can't sort all the prefixes and choose the first 10 as there would be too many. We can instead use a heap data structure. This data structure is designed to quickly take in a bunch of numbers and quickly pop out the smallest number. With this you can insert the prefix probabilities one by one until there are 10 prefixes in it. After that start popping out the smallest probability immediately after inserting a new one in order to only keep the 10 largest ones.

Python provides a library called "heapq" (heap queue) just for this situation. Here is a class that makes use of heapq in order to create a beam of prefix probabilities:

import heapq

class Beam(object):
#For comparison of prefixes, the tuple (prefix_probability, complete_sentence) is used.
#This is so that if two prefixes have equal probabilities then a complete sentence is preferred over an incomplete one since (0.5, False) < (0.5, True)

    def __init__(self, beam_width):
        self.heap = list()
        self.beam_width = beam_width

    def add(self, prob, complete, prefix):
        heapq.heappush(self.heap, (prob, complete, prefix))
        if len(self.heap) > self.beam_width:
    def __iter__(self):
        return iter(self.heap)

The code to perform the actual beam search is this:

def beamsearch(probabilities_function, beam_width=10, clip_len=-1):
    prev_beam = Beam(beam_width)
    prev_beam.add(1.0, False, [ '<start>' ])
    while True:
        curr_beam = Beam(beam_width)
        #Add complete sentences that do not yet have the best probability to the current beam, the rest prepare to add more words to them.
        prefix_batch = list()
        prob_batch = list()
        for (prefix_prob, complete, prefix) in prev_beam:
            if complete == True:
                curr_beam.add(prefix_prob, True, prefix)
                #Get probability of each possible next word for the incomplete prefix.
                for (next_prob, next_word) in probabilities_function(prefix):
                    if next_word == '<end>': #if next word is the end token then mark prefix as complete and leave out the end token
                        curr_beam.add(prefix_prob*next_prob, True, prefix)
                    else: #if next word is a non-end token then mark prefix as incomplete
                        curr_beam.add(prefix_prob*next_prob, False, prefix+[next_word])
        (best_prob, best_complete, best_prefix) = max(curr_beam)
        if best_complete == True or len(best_prefix)-1 == clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then return it
            return (best_prefix[1:], best_prob) #return best sentence without the start token and together with its probability
        prev_beam = curr_beam

"probabilities_function" returns a list of word/probability pairs given a prefix. "beam_width" is the number of prefixes to keep (so that instead of keeping the top 10 prefixes you can keep the top 100 for example). By making the beam search bigger you can get closer to the actual most probable sentence but it would also take longer to process. "clip_len" is a maximum length to tolerate, beyond which the most probable prefix is returned as an incomplete sentence. Without a maximum length, a faulty probabilities function which does not return a highly probable end token will lead to an infinite loop or excessively long garbage sentences.

Wednesday, September 28, 2016

Text generation using a language model

A language model is something that tells you the probability of a sequence of words. For example it tells you that the sequence "a dog is barking" is more probable than "a dog is was". This probability can then be used to generate text by selecting a sequence of words that is probable.

One way to do this is to take a huge amount of text and count how often each 4 word segment occurs, then divide by the total amount of 4 word segments. For example, you find that in your huge text, "a dog is barking" to occurs 42 times and divide it by the amount of 4 word segments in the text. This will give you an approximate probability of "a dog is barking".

Usually language models give you the probability of a particular word following a prefix of a sentence. For example given the prefix "a dog is", what is the probability that the next word in the prefix is "barking"? This is expressed mathematically as P("barking" | "a dog is"). This is what neural network language models do. You give them a prefix of words and they output a number for each known word, where the number is the probability of that word following the prefix.

Once you have a probability distribution for all the words in your vocabulary, you can then pick the word with the highest probability to continue the incomplete sentence. If you do this for every word, you will generate a whole sentence but you will always generate the same sentence. In order to generate a random sentence, we need to choose a random next word, but in such a way that highly probable words are more likely to be chosen in order to avoid generating gibberish.

The softmax function
Given a set of numbers, such as frequencies, we can turn these numbers into probabilities by using the maximum likelihood estimation as described above. This is when you divide a frequency by the total. However this isn't the best way to do things. First of all this assumes that words that don't occur in the text have a probability of zero, which is unlikely to be the case as it's more likely that the word just has a very small probability than a probability of zero. Second of all, this requires the number to be frequencies, that is, to be all greater than or equal to zero, which might be too limiting. A neural network's outputs can be used to quantify how strongly it believes a particular word should follow a prefix, but this quantity might be negative in order to avoid being limited by a lower bound. Thirdly, we might want to skew the probabilities so that the most probable word has a higher probability and the least probable word has a lower probability or vice versa. This is done so that when you're selecting words based on their probability you either bias the selection towards higher probability words or you go the other way and make the probabilities more similar in order to generate more random looking text.

To address all these concerns we can use a softmax function to convert the quantities into probabilities which uses a Boltzmann distribution. Given a list of "n" quantities called "q", the probability of the kth quantity is:
e^(q_k/T) / sum(e^(q_i/T) for i in 1..n)
where "T" is called the temperature and is used to control how similar the probabilities should be (by default it is 1).

For example if your quantities are [0, 1, -1], then for temperature 1 the probabilities are:

0:  e^(0/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.245
1:  e^(1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.665
-1: e^(-1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.09

Notice how the probabilities and all valid (non-negative and sum to 1), how by raising "e" to the power of the quantity makes it always greater than zero and see what happens when the temperature is set to a larger number:

0:  e^(0/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.307
1:  e^(1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.506
-1: e^(-1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.186

See how more similar the probabilities are now with a higher temperature?

The nice thing about the temperature is that even if you have already calculated the probabilities using temperature 1 and have lost the original quantities, you can still change the temperature of the probabilities as follows:

Given probabilities "p", new probability "p'_k" based on temperature "T" are
p'_k = p_k^(1/T) / sum(p_i^(1/T) for i in 1..n)

From the above examples,
0:  0.245^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.307
1:  0.665^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.506
-1: 0.09^(1/2)  / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.186

The roulette wheel selection
Now how do we select a word using it's probability? There is an algorithm called Roulette wheel selection which does this. The idea is to put all the probabilities next to each other on a number line and then select a point on the number line at random. The word whose probability contains the point is chosen. Here's a diagram showing the probabilities 0.6, 0.3 and 0.1 next to each other on a number line and a random red point is placed on the number line in the "word 1" region meaning that word 1 was selected:

To do this, we generate a random number between 0 and 1 (the random point), and then start adding probabilities (in any order) until the total becomes larger than the random number. The word belonging to the last probability added is the one that is chosen. Here's a Python 3 implementation:

def select(probs):
    r = random.random() #random number between 0 and 1
    total = 0.0
    for i in range(len(probs)):
        total += probs[i]
        if total > r: #important to use > and not >= in order to avoid selecting words with a probability of zero
            return i
    #a words must have been selected at this point or else the probabilities are invalid (either negative or do not sum to 1
    raise ValueError('Invalid probabilities')