Before we begin, it is important that you understand that there is no simple way to make a complex point and click game with PowerPoint where actions in one room affect what happens in other rooms. You can only make games where everything happens in the same room. You can have different rooms but all the puzzles of a room must be solved within the room. This isn't so bad. There are existing games where you have to go through several levels with each level having a small puzzle to solve, such as this. You also cannot have an inventory or pass codes without going into a lot of complexity that would make more sense using actual game programming. That said, it can still be a creative use of PowerPoint for children so here is how you do it.

The heart of the point and click game is the discovery of items that can be used to get over obstacles. You find a key and then you open a door. In PowerPoint such a thing can be done using transparent boxes that are in front of clickable objects. When you click on a key, a trigger event will set an exit animation on the transparent box, making whatever was behind it available for clicking. Let's begin.

Start by drawing a key and a door. I'm using a star to represent a key and I put a star shape on the door to represent a key hole. Also add a second slide with the next room after going through the door. I just put a smiley on the next slide to show that the game has ended.

Next make the door take you to the next slide when clicked. This is done using a hyperlink. Just right click on the door, go on hyperlink, place in this document, next slide, OK. Now when you start the presentation, clicking on the door will take you to the next slide.

Of course clicking anywhere in a slide will take you to the next slide. We don't want that. You should only be able to go to the next slide by clicking on the door. Go on transitions and unset advance slide on mouse click.

Now put a box over the door covering it completely. Make this box transparent (not "no fill"). The door is now not clickable any more because there is a box over it. Just right click the box, format shape, fill, transparency 100%. I left the border there so that you can see where the box is but you can remove the outline so that it looks better.

Now we want the key to disappear when clicked on. Click on the key, go on animations, add animation, and choose an exit animation such as Fade. Open the animation pane. You'll see the animation you just created and the automatically assigned name of the key. In my case the name is "4-Point Star 5"

At the moment, the key will disappear as soon as you click anywhere, not when you click on the key. To fix this we can click on the key animation (the orange box with "1" in it next to the key), go on trigger, on click of, and choose the name of the key shape you saw in the animation pane. Now the key's exit animation will only start when you click on the key itself.

When the key is picked up, the door should be usable. This mean that we want the transparent box over the door to disappear. We can do the same exit animation with trigger that we did with the key to the transparent box and make it exit when the key is clicked on. I made the door disappear rather than fade so that there is no delay and the transparent box is removed immediately.

As things stand, we have two exit animations: one for the key and one for the transparent box. However these two animations will not happen together but will each wait for their own click on the key. To make both animations happen together just click on the transparent box's animation and choose start with previous instead of start on click.

That's it. Your one-key-one-door point and click game is ready. Start the animation by pressing F5 and see what happens. Notice that you cannot click on anything other than the key. After clicking on the key it will disappear and then you can click on the door. Clicking on the door will take you to the next room.

This basic mechanic that I just described can be used to open chests that contain other keys, kill dragons that guard lairs, and turn off booby traps. You can put keys behind objects that have a motion animation when clicked on so that you find a key behind them. You can even put multiple keys, each of which is accessible only after getting the previous one and the final key is the one that opens the door. You can also have multiple rooms where the next slide is another is another room with a locked door. Can you think of other things to make an interesting point and click game in PowerPoint?

# Geeky is Awesome

To learn is to teach.

## Friday, July 14, 2017

## Sunday, June 25, 2017

### Display all outputs in Python immediately when in cmd/terminal

When running a Python program in the terminal, it sometimes happens that there is a long delay before I see some print statement's output. Sometimes I have to wait until the program finishes.

If it's just one print statement you're concerned with then you only need to import the "sys" module and call "sys.stdout.flush()" in order to force the program to show anything that has been printed but is still hidden in the buffer.

But most of the time you want this to always happen and not after some particular point in the program. To force Python to always show whatever is printed just add the command line option "-u" when running python.

You can read more about it here:

https://docs.python.org/3.6/using/cmdline.html#cmdoption-u

If it's just one print statement you're concerned with then you only need to import the "sys" module and call "sys.stdout.flush()" in order to force the program to show anything that has been printed but is still hidden in the buffer.

import sys print('bla bla bla') sys.stdout.flush()

But most of the time you want this to always happen and not after some particular point in the program. To force Python to always show whatever is printed just add the command line option "-u" when running python.

> python -u myscript.py

You can read more about it here:

https://docs.python.org/3.6/using/cmdline.html#cmdoption-u

## Tuesday, May 30, 2017

### Neural language models and how to make them in Tensorflow 1.0

In this blog post I will explain the basics you need to know in order to create a neural language model in Tensorflow 1.0. This tutorial will cover how to create a training set from raw text, how to use LSTMs, how to work with variable length sentences, what teacher forcing is, and how to train the neural network with early stopping.

A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.

In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.

It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.

The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.

Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.

The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.

This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.

If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.

One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called scheduled sampling. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called professor forcing.

This is the full code that you can execute to get a Tensorflow neural language model:

This section explains snippets of the code.

The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.

Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.

Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.

The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.

Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.

Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.

Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.

The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".

Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.

We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.

These are the outputs I got:

We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use beam search for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using roulette selection. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.

These are the outputs I got:

Introduction

Components and vocabulary

A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.

In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.

It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.

Training

The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.

Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.

The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.

This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.

Teacher forcing

If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.

One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called scheduled sampling. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called professor forcing.

Full code listing

This is the full code that you can execute to get a Tensorflow neural language model:

from __future__ import absolute_import, division, print_function, unicode_literals from builtins import ascii, bytes, chr, dict, filter, hex, input, int, map, next, oct, open, pow, range, round, str, super, zip import tensorflow as tf import numpy as np import random import timeit import collections import nltk TRAINING_SESSION = True rand = random.Random(0) embed_size = 256 state_size = 256 max_epochs = 100 minibatch_size = 20 min_token_freq = 3 run_start = timeit.default_timer() print('Loading raw data...') all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ] rand.shuffle(all_seqs) all_seqs = all_seqs[:20000] trainingset_size = round(0.9*len(all_seqs)) validationset_size = len(all_seqs) - trainingset_size train_seqs = all_seqs[:trainingset_size] val_seqs = all_seqs[-validationset_size:] print('Training set size:', trainingset_size) print('Validation set size:', validationset_size) all_tokens = (token for seq in train_seqs for token in seq) token_freqs = collections.Counter(all_tokens) vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token)) while token_freqs[vocab[-1]] < min_token_freq: vocab.pop() vocab_size = len(vocab) + 2 # + edge and unknown tokens token_to_index = { token: i+2 for (i, token) in enumerate(vocab) } index_to_token = { i+2: token for (i, token) in enumerate(vocab) } edge_index = 0 unknown_index = 1 print('Vocabulary size:', vocab_size) def parse(seqs): indexes = list() lens = list() for seq in seqs: indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ] indexes.append(indexes_) lens.append(len(indexes_)+1) #add 1 due to edge token maxlen = max(lens) in_mat = np.zeros((len(indexes), maxlen)) out_mat = np.zeros((len(indexes), maxlen)) for (row, indexes_) in enumerate(indexes): in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_ out_mat[row,:len(indexes_)+1] = indexes_+[edge_index] return (in_mat, out_mat, np.array(lens)) (train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs) (val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs) print('Training set max length:', train_seqs_in.shape[1]-1) print('Validation set max length:', val_seqs_in.shape[1]-1) ################################################################ print() print('Training...') #Full correct sequence of token indexes with start token but without end token. seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token] #Length of sequences in seq_in. seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq] tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0]) #Full correct sequence of token indexes without start token but with end token. seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token] tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target)) batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once. num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence. #Mask of which positions in the matrix of sequences are actual labels as opposed to padding. token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token] with tf.variable_scope('prefix_encoder'): #Encode each sequence prefix into a vector. #Embedding matrix for token vocabulary. embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature] #3D tensor of tokens in sequences replaced with their corresponding embedding vector. embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature] #Use an LSTM to encode the generated prefix. init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ])) cell = tf.contrib.rnn.BasicLSTMCell(state_size) prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature] with tf.variable_scope('softmax'): #Output a probability distribution over the token vocabulary (including the end token). W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer()) b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer()) logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ]) predictions = tf.nn.softmax(logits) #[seq, prefix, token] losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask total_loss = tf.reduce_sum(losses) train_step = tf.train.AdamOptimizer().minimize(total_loss) saver = tf.train.Saver() sess = tf.Session() if TRAINING_SESSION: sess.run(tf.global_variables_initializer()) print('epoch', 'val loss', 'duration', sep='\t') epoch_start = timeit.default_timer() validation_loss = 0 for i in range(len(val_seqs)//minibatch_size): minibatch_validation_loss = sess.run(total_loss, feed_dict={ seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size], seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size], seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size], }) validation_loss += minibatch_validation_loss print(0, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t') last_validation_loss = validation_loss saver.save(sess, './model') trainingset_indexes = list(range(len(train_seqs))) for epoch in range(1, max_epochs+1): epoch_start = timeit.default_timer() rand.shuffle(trainingset_indexes) for i in range(len(trainingset_indexes)//minibatch_size): minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size] sess.run(train_step, feed_dict={ seq_in: train_seqs_in [minibatch_indexes], seq_len: train_seqs_len[minibatch_indexes], seq_target: train_seqs_out[minibatch_indexes], }) validation_loss = 0 for i in range(len(val_seqs)//minibatch_size): minibatch_validation_loss = sess.run(total_loss, feed_dict={ seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size], seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size], seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size], }) validation_loss += minibatch_validation_loss if validation_loss > last_validation_loss: break last_validation_loss = validation_loss saver.save(sess, './model') print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t') print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t') ################################################################ print() print('Evaluating...') saver.restore(sess, tf.train.latest_checkpoint('.')) def seq_prob(seq): seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ] outputs = sess.run(predictions, feed_dict={ seq_in: [ [ edge_index ] + seq_indexes ], seq_len: [ 1+len(seq) ], })[0] probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]] return np.prod(probs) print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.'])) print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.'])) print() def next_tokens(prefix): prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ] probs = sess.run(predictions, feed_dict={ seq_in: [ [ edge_index ] + prefix_indexes ], seq_len: [ 1+len(prefix) ], })[0][-1] token_probs = list(zip(probs, ['<end>', '<unk>']+vocab)) return token_probs print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5]) print() def greedy_gen(): prefix = [] for _ in range(100): probs = sorted(next_tokens(prefix), reverse=True) (_, next_token) = probs[0] if next_token == '<unk>': (_, next_token) = probs[1] elif next_token == '<end>': break else: prefix.append(next_token) return prefix print('Greedy generation:', ' '.join(greedy_gen())) print() def random_gen(): prefix = [] for _ in range(100): probs = next_tokens(prefix) (unk_prob, _) = probs[unknown_index] r = rand.random() * (1.0 - unk_prob) total = 0.0 for (prob, token) in probs: if token != '<unk>': total += prob if total >= r: break if token == '<end>': break else: prefix.append(token) return prefix print('Random generation:', ' '.join(random_gen())) print()

Code explanation

This section explains snippets of the code.

Preprocessing

The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.

all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ] rand.shuffle(all_seqs) all_seqs = all_seqs[:20000] trainingset_size = round(0.9*len(all_seqs)) validationset_size = len(all_seqs) - trainingset_size train_seqs = all_seqs[:trainingset_size] val_seqs = all_seqs[-validationset_size:]

Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.

all_tokens = (token for seq in train_seqs for token in seq) token_freqs = collections.Counter(all_tokens) vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token)) while token_freqs[vocab[-1]] < min_token_freq: vocab.pop() vocab_size = len(vocab) + 2 # + edge and unknown tokens token_to_index = { token: i+2 for (i, token) in enumerate(vocab) } index_to_token = { i+2: token for (i, token) in enumerate(vocab) } edge_index = 0 unknown_index = 1

Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.

def parse(seqs): indexes = list() lens = list() for seq in seqs: indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ] indexes.append(indexes_) lens.append(len(indexes_)+1) #add 1 due to edge token maxlen = max(lens) in_mat = np.zeros((len(indexes), maxlen)) out_mat = np.zeros((len(indexes), maxlen)) for (row, indexes_) in enumerate(indexes): in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_ out_mat[row,:len(indexes_)+1] = indexes_+[edge_index] return (in_mat, out_mat, np.array(lens)) (train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs) (val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs)

Model definition

The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.

#Full correct sequence of token indexes with start token but without end token. seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token] #Length of sequences in seq_in. seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq] tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0]) #Full correct sequence of token indexes without start token but with end token. seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token] tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target)) batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once. num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence. #Mask of which positions in the matrix of sequences are actual labels as opposed to padding. token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]

Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.

with tf.variable_scope('prefix_encoder'): #Encode each sequence prefix into a vector. #Embedding matrix for token vocabulary. embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature] #3D tensor of tokens in sequences replaced with their corresponding embedding vector. embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature] #Use an LSTM to encode the generated prefix. init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ])) cell = tf.contrib.rnn.BasicLSTMCell(state_size) prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]

Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.

with tf.variable_scope('softmax'): #Output a probability distribution over the token vocabulary (including the end token). W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer()) b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer()) logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ]) predictions = tf.nn.softmax(logits) #[seq, prefix, token]

Training

Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask total_loss = tf.reduce_sum(losses) train_step = tf.train.AdamOptimizer().minimize(total_loss) saver = tf.train.Saver() sess = tf.Session() sess.run(tf.global_variables_initializer())

The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".

validation_loss = 0 for i in range(len(val_seqs)//minibatch_size): minibatch_validation_loss = sess.run(total_loss, feed_dict={ seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size], seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size], seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size], }) validation_loss += minibatch_validation_loss last_validation_loss = validation_loss saver.save(sess, './model')

Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.

trainingset_indexes = list(range(len(train_seqs))) for epoch in range(1, max_epochs+1): rand.shuffle(trainingset_indexes) for i in range(len(trainingset_indexes)//minibatch_size): minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size] sess.run(train_step, feed_dict={ seq_in: train_seqs_in [minibatch_indexes], seq_len: train_seqs_len[minibatch_indexes], seq_target: train_seqs_out[minibatch_indexes], }) validation_loss = 0 for i in range(len(val_seqs)//minibatch_size): minibatch_validation_loss = sess.run(total_loss, feed_dict={ seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size], seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size], seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size], }) validation_loss += minibatch_validation_loss if validation_loss > last_validation_loss: break last_validation_loss = validation_loss saver.save(sess, './model')

Application

We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.

saver.restore(sess, tf.train.latest_checkpoint('.')) def seq_prob(seq): seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ] outputs = sess.run(predictions, feed_dict={ seq_in: [ [ edge_index ] + seq_indexes ], seq_len: [ 1+len(seq) ], })[0] probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]] return np.prod(probs) print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.'])) print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.'])) print() def next_tokens(prefix): prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ] probs = sess.run(predictions, feed_dict={ seq_in: [ [ edge_index ] + prefix_indexes ], seq_len: [ 1+len(prefix) ], })[0][-1] token_probs = list(zip(probs, ['<end>', '<unk>']+vocab)) return token_probs print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5]) print()

These are the outputs I got:

P(the dog barked.) = 1.71368e-08 P(the cat barked.) = 6.16375e-10 the dog ... [(0.097657956, 'was'), (0.089791521, '<unk>'), (0.058101207, 'is'), (0.055007596, 'had'), (0.02786131, 'could')]

We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use beam search for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using roulette selection. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.

def greedy_gen(): prefix = [] for _ in range(100): probs = sorted(next_tokens(prefix), reverse=True) (_, next_token) = probs[0] if next_token == '<unk>': (_, next_token) = probs[1] elif next_token == '<end>': break else: prefix.append(next_token) return prefix print('Greedy generation:', ' '.join(greedy_gen())) print() def random_gen(): prefix = [] for _ in range(100): probs = next_tokens(prefix) (unk_prob, _) = probs[unknown_index] r = rand.random() * (1.0 - unk_prob) total = 0.0 for (prob, token) in probs: if token != '<unk>': total += prob if total >= r: break if token == '<end>': break else: prefix.append(token) return prefix print('Random generation:', ' '.join(random_gen())) print()

These are the outputs I got:

Greedy generation: `` i don't know '' . Random generation: miss the road place , title comes to the seeds of others to many and details under the dominant home

## Monday, April 3, 2017

### Getting the top n most probable sentences using beam search

This is a continuation from a previous blog post on single sentence beam search.

Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:

In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.

In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.

If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of

Here is the modified Python 3 code:

Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:

In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.

In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.

If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of

- incomplete sentences that are also as long as the allowed maximum (since all the prefixes grow together)
- complete sentences that were found before but which do not have a maximum probability

Here is the modified Python 3 code:

import heapq class Beam(object): def __init__(self, beam_width, init_beam=None): if init_beam is None: self.heap = list() else: self.heap = init_beam heapq.heapify(self.heap) #make the list a heap self.beam_width = beam_width def add(self, prob, complete, prefix): heapq.heappush(self.heap, (prob, complete, prefix)) if len(self.heap) > self.beam_width: heapq.heappop(self.heap) def __iter__(self): return iter(self.heap) def beamsearch(probabilities_function, beam_width=10, clip_len=-1): prev_beam = Beam(beam_width) prev_beam.add(1.0, False, [ '<start>' ]) while True: curr_beam = Beam(beam_width) #Add complete sentences that do not yet have the best probability to the current beam, the rest prepare to add more words to them. for (prefix_prob, complete, prefix) in prev_beam: if complete == True: curr_beam.add(prefix_prob, True, prefix) else: #Get probability of each possible next word for the incomplete prefix. for (next_prob, next_word) in probabilities_function(prefix): if next_word == '<end>': #if next word is the end token then mark prefix as complete and leave out the end token curr_beam.add(prefix_prob*next_prob, True, prefix) else: #if next word is a non-end token then mark prefix as incomplete curr_beam.add(prefix_prob*next_prob, False, prefix+[next_word]) sorted_beam = sorted(curr_beam) #get all prefixes in current beam sorted by probability any_removals = False while True: (best_prob, best_complete, best_prefix) = sorted_beam[-1] #get highest probability prefix if best_complete == True or len(best_prefix)-1 == clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then yield it yield (best_prefix[1:], best_prob) #yield best sentence without the start token and together with its probability sorted_beam.pop() #remove the yielded sentence and check the next highest probability prefix any_removals = True if len(sorted_beam) == 0: #if there are no more sentences in the beam then stop checking break else: break if any_removals == True: #if there were any changes in the current beam then... if len(sorted_beam) == 0: #if there are no more prefixes in the current beam (due to clip length being reached) then end the beam search break else: #otherwise set the previous beam to the modified current beam prev_beam = Beam(beam_width, sorted_beam) else: #if the current beam was left unmodified then assign it to the previous beam as is prev_beam = curr_beam

## Monday, March 27, 2017

### Word embeddings: How word2vec and GloVe work

Word embeddings are vectors that represent words. For example the word "dog" might be represented as [0.1, -2.1, 1.2] whilst "cat" might be represented as [0.2, 2.4, 1.1]. These vectors are important in neural networks because neural networks can only work with continuous numbers whereas words are discrete symbols.

One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.

One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:

In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.

This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are word2vec and GloVe.

word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.

GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.

Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.

Download word2vec code from here.

Download GloVe code from here.

One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.

One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:

(1 0 0) (1 2) = (1*1 + 0*3 + 0*5 1*2 + 0*4 + 0*6) = (1 2) (3 4) (5 6)

In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.

This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are word2vec and GloVe.

word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.

GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.

Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.

Download word2vec code from here.

Download GloVe code from here.

## Tuesday, February 28, 2017

### Neural network deep learning hard attention using REINFORCE / score function estimator / likelihood ratio estimator

One of the things I struggled the most to understand with papers on deep learning is when they mention hard attention. Let's start with a description of what attention is first.

Attention based learning is when you have a sequence input where only part of it is relevant and you want your neural network to explicitly choose what is relevant. For example, if you are translating a sentence from English to French, only some of the words in the English sentence are relevant to generating the next word in the French sentence. So part of your neural network would be dedicated to weighing the importance of each English word in the context of what the neural network intends to generate next in the French sentence. This attention module would work by taking two inputs:

- an embedded word vector from the English sentence
- the current state of the recurrent neural network that is encoding what has been generated up to now

This technique has not only been used for machine translation but also for image captioning (only attend to parts of the image for every word being generated) and neural turing machines (only attend to parts of the tape with every operation). Just look at what Colah has to say about it. However this is just soft attention, which is easy. It's called soft attention because you still partially attend to every part of the input; it's just some parts get very little attention. This means that it's still possible to measure the effect of a part of an input and so it's still possible to determine, by gradient descent, whether its attention should be increased or decreased.

Hard attention, on the other hand, either completely includes or excludes elements of the input. How would you do this in a neural network? You'd need to use thresholding for example, where if the value of a neuron is above a threshold then it outputs a one, zero otherwise. You can also use argmax which selects the neuron with the greatest value and sets it to one and all the others to zero. Unfortunately both of these solutions are undifferentiable and that makes direct gradient descent ineffective if they are used in a hidden layer (in an output layer you can just maximize their continuous values and then threshold them at test time). It would be the same situation neural networks had in the past when they couldn't have a hidden layer because there was no known optimization algorithm for 2 layers of thresholding neurons.

It would be nice to somehow make neurons that have discrete outputs but which can still be trained by gradient descent. One way to do this is to use stochastic neurons which output discrete values based on a probability that can be tweaked. The simplest probabilistic neuron is the Bernoulli neuron which outputs a 1 with probability "p" and a 0 otherwise. Let's assume a simple attention based neural network that uses one Bernoulli neuron as an attention module and one normal sigmoid neuron as a feature extraction neuron. The Bernoulli neuron's output is multiplied by the normal neuron's output in order to gate it. Both neurons take the same single number as input.

The grey neuron with a "B" inside is the Bernoulli neuron, "w0" and "w1" are weights and "b0" and "b1" are biases. If we had to turn this into an equation in order to differentiate it, we'd end up with this:

$$y = B_x \cdot sig(w_1 x + b_1)$$

where "B_x" is a random variable that is dependent on "x". Unfortunately random variables "hide" their inner workings and we cannot express their probability as part of the equation. This means that we cannot optimize "w0" and "b0" by gradient descent as it would be meaningless to differentiation with respect to "w0" given that it's not in the equation. In fact "B" is treated as a constant in the above equation and we can still differentiate the equation with respect to all the other parameters. Note that this situation is different from dropout, where you randomly multiply some of the neuron values by 0 in order to avoid overfitting. In the case of dropout, the above equation would be enough as we're not optimizing the value of the dropout random variable.

There is a simple solution to this problem however. Instead of finding the gradient of "y", we can find the gradient of the expected value of "y". The expected value is the mean value you get when running a stochastic function over and over again. For example, if you're tossing a fair coin, with one face representing a 1 and the other face representing a 0, the expected value is 0.5. However, if you're tossing an unfair coin where the "1" face comes up 75% of the time, then on average you will get a value of 0.75. In general, the expected value of a discrete probability distribution, such as the Bernoulli distribution, can be found using the following equation:

$$E[X] = \sum_{v \in X} p(v) \cdot v$$

that is, just multiple each value by its respective probability and take the sum. In the case of a coin with probability of the "1" face coming up being "p" (a Bernoulli distribution), the expected value is $1 \cdot p + 0 \cdot (1-p)$ which equals "p". We can take advantage of the expect value by using gradient descent to minimize the expected error rather than the error itself. This would expose the parameters that determine the probability of the Bernoulli neuron and we would be able to use gradient descent as usual.

Let's take a concrete example of an error function. We want to minimize the sum square error of the neural net such that when given a 0 it should output a 1 and when given a 1 it should output a 0 (a logical NOT). This is what the error function (called the cost function) would look like, together with the error function of a single input:

$$

\begin{align}

C &= \sum_{(x,t_x) \in \{ (0,1),(1,0) \}} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\

C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\

\end{align}

$$

Let's focus on just the single input error function. The expected error would look like:

$$

\begin{align}

E[C_x] &= sig(w_0 x + b_0) \cdot (t_x - 1 \cdot sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x - 0 \cdot sig(w_1 x + b_1))^2 \\

&= sig(w_0 x + b_0) \cdot (t_x - sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x)^2 \\

\end{align}

$$

Remember that expected value finds the sum of all the possible values of the stochastic function due to randomness multiplied by their respective probability. In the case of our neural net, the two possible values due to randomness are caused by the Bernoulli neuron "B" being either 1 with probability $sig(w_0 x + b_0)$ or 0 with probability $1 - sig(w_0 x + b_0)$. Now we can find the gradient of the expected error and minimize it, which would include optimizing the parameters determining the probability of the Bernoulli neuron.

In general it might not be tractable to expose all the possible values of a stochastic neural network, especially when multinoulli neurons are used where there are many possible discrete values instead of just 0 or 1 and especially when there are multiple such neurons whose values must be considered together, creating a combinatorial explosion. To solve this problem, we can approximate the expected error by taking samples. For example, if you want to approximate the expected value of a coin, you can toss it a hundred times, count the number of times the coin gives the "1" face and divide by 100. This can be done with a stochastic neural network, where you run the network a hundred times and calculate the mean of the error for each training pair. The problem is that we don't want the expected error, but the derivative of the expected error, which cannot be calculated on the constant you get after approximating the expected error. We need to take samples of the expected derivative of the error instead. This is what the expected derivative of the error looks like:

$$

\begin{align}

\frac{dE[C_x]}{dw} &= \frac{d}{dw} \sum_{v \in C_x} p(v) \cdot v \\

&= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\

\end{align}

$$

The problem with this equation is that it can't be sampled like an expected value can so you'll end up having to calculate the full summation, which we're trying to avoid. The solution to this is that we continue breaking down the algebra until eventually we get something that can be approximated by sampling. This has been derived multiple times in the literature and so it has multiple names such as REINFORCE, score function estimator, and likelihood ratio estimator. It takes advantage of the following theorem of logarithms: $\frac{d}{dx} f(x) = f(x) \cdot \frac{d}{dx} \log(f(x))$

$$

\begin{align}

\frac{dE[C_x]}{dw} &= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\

&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot \frac{d}{dw} p(v) \right) \\

&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot p(v) \cdot \frac{d}{dw} \log(p(v)) \right) \\

&= \sum_{v \in C_x} p(v) \cdot \left( \frac{d}{dw} v + v \cdot \frac{d}{dw} \log(p(v)) \right) \\

&\approx \frac{\sum_{i = 1}^{N} \frac{d}{dw} \tilde{v} + \tilde{v} \cdot \frac{d}{dw} \log(p(\tilde{v}))}{N} \text{ where } \tilde{v} \sim{} C_x \\

\end{align}

$$

The last line is approximating the derivative of the expected error using "N" samples. This is possible because probability "p" was factored out inside the summation, which gives it the same form of an expect value equation and which hence can be approximated in the same way. You might be asking how it is that you can find the probability of each possible value. As long as you have access to the values returned by the stochastic neurons, you can use a dictionary to map values to their respective probabilities and multiply them together if necessary. The following is the derivative of the above Bernoulli NOT gate:

$$

\begin{align}

C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\

\frac{dE[C_x]}{dw_0} &\approx \frac{\sum_{i = 1}^{N} \left( \frac{d}{dw_0} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \right) + (t_x - B_x \cdot sig(w_1 x + b_1))^2 \cdot \left(

\frac{d}{dw_0}

\left\{ \begin{array}{lr}

\log(sig(w_0 x + b_0)) & : B_x = 1 \\

\log(1 - sig(w_0 x + b_0)) & : B_x = 0 \\

\end{array} \right.

\right) }{N} \\

\end{align}

$$

## Tuesday, January 31, 2017

### Applying softmax and categorical_crossentropy to 3D tensors in Theano

Theano is a great deep learning library but there are some things in it that need to be polished a bit. For example at the moment, you can only apply the softmax function in the tensor.nnet module to matrices (2D tensors). Same goes for the categorical_crossentropy function.

This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.

But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.

Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:

It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:

Here is what happens step by step:

The categorical_crossentropy function can be used in the same way:

And this is how it is used:

Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.

This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.

But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.

Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:

Original 3D tensor: [ [ [111,112], [121,122] ], [ [211,212], [221,222] ] ] Reshaped 2D tensor: [ [111,112], [121,122], [211,212], [221,222] ] Applied softmax: [ [111',112'], [121',122'], [211',212'], [221',222'] ] Reshaping back to 3D: [ [ [111',112'], [121',122'] ], [ [211',212'], [221',222'] ] ]

It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:

import theano import theano.tensor as T X = T.tensor3() (d1,d2,d3) = X.shape Y = T.nnet.softmax(X.reshape((d1*d2,d3))).reshape((d1,d2,d3))

Here is what happens step by step:

print(X.reshape((d1*d2,d3)).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> [[ 1. 2.] [ 1. 3.] [ 1. 4.] [ 1. 5.]]

print(T.nnet.softmax(X.reshape((d1*d2,d3))).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> array([[ 0.26894142, 0.73105858], [ 0.11920292, 0.88079708], [ 0.04742587, 0.95257413], [ 0.01798621, 0.98201379]])

print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> [[[ 0.26894142 0.73105858] [ 0.11920292 0.88079708]] [[ 0.04742587 0.95257413] [ 0.01798621 0.98201379]]]

The categorical_crossentropy function can be used in the same way:

import theano import theano.tensor as T X = T.tensor3() S = T.imatrix() (d1,d2,d3) = X.shape (e1,e2) = S.shape Y = T.nnet.categorical_crossentropy( T.nnet.softmax(X.reshape((d1*d2,d3))), S.reshape((e1*e2,)) ).reshape((d1,d2))

And this is how it is used:

print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]], S: [[0,1],[1,0]] })) >>> array([[ 1.31326169, 0.12692801], [ 0.04858735, 4.01814993]])

Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.

Subscribe to:
Posts (Atom)