Friday, December 13, 2013

Proof that the square roots of all non-square numbers are irrational

In a previous post I described the classical way to prove that the square root of 2 is irrational. But that proof doesn't generalize (at least not easily) to the square root of other numbers being irrational. This proof can be applied to any number and any root (cube root, etc.).

This proof uses the Fundamental Theorem of Arithmetic which states that there is only one way to factor a whole number greater than 1 into a product of prime numbers raised to some power. For example 12 = 2^2 x 3^1. There is no other way to factorize 12 into powers of prime numbers other than 2^2 and 3^1. Another example, 15 = 2^0 x 3^1 x 5^1. So the exponent of the first prime number must be 0, the second must be 1 and the third must be 1. If we had to include the fourth prime number we'd give it an exponent of 0 too, that is, 2^0 x 3^1 x 5^1 x 7^0. In other words, for any number "n" greater than 1,
n = 2^i1 x 3^i2 x 5^i3 x ... x Pn^in
where "n" is the number of prime factors used, P1 is the first prime number, P2 is the second, etc. and i1 is the exponent of the first prime factor, i2 of the second, etc.

Proof for the square root of 2
Onto the proof. We'll start from the proof for the square root of 2 and then generalize it.

We start exactly the same way we started in the other proof.
Assume sqrt(2) = a/b
2 = (a/b)^2
2 = a^2 / b^2
2 b^2 = a^2

Now we use the fundamental theorem of arithmetic on "a" and "b".

Let a = 2^i1 x 3^i2 x 5^i3 x ... x Pn^in
Let b = 2^j1 x 3^j2 x 5^j3 x ... x Pm^jm

Therefore,
2 (2^j1 x 3^j2 x 5^j3 x ... x Pm^jm)^2 = (2^i1 x 3^i2 x 5^i3 x ... x Pn^in)^2
2 (2^2j1 x 3^2j2 x 5^2j3 x ... x Pm^2jm) = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in
2 x 2^2j1 x 3^2j2 x 5^2j3 x ... x Pm^2jm = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in
2^(2j1+1) x 3^2j2 x 5^2j3 x ... x Pm^2jm = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in

Now since the number on the left and the number on the right are both equal, then by the fundamental theorem of arithmetic, apart from the number of prime factors being equal, the exponents of their prime factors must also be equal. Therefore,

m = n

and

2j1+1 = 2i1
2j2 = 2i2
2j3 = 2i3
...
2jm = 2in

So then 2j1+1 = 2i1. But one is an odd number whilst the other is an even number. This cannot be the case, therefore "a^2" and "2b^2" cannot be equal, therefore "a/b" cannot equal "sqrt(2)" and therefore "a" and "b" cannot exist.

What happens if the square root is actually rational?
Let's try that again with the square root of 4 this time.

Assume that sqrt(4) = a/b
4 b^2 = a^2
(2^2) (2^j1 x 3^j2 x 5^j3 x ... x Pm^jm)^2 = (2^i1 x 3^i2 x 5^i3 x ... x Pn^in)^2
(2^2)(2^2j1 x 3^2j2 x 5^2j3 x ... x Pm^2jm) = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in
2^(2j1+2) x 3^2j2 x 5^2j3 x ... x Pm^2jm = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in

By the fundamental theorem of arithmetic,
m = n

and

2j1+2 = 2i1
2j2 = 2i2
2j3 = 2i3
2jm = 2in

As you can see, all exponents can be equal this time. 2j1+2 = 2i1 can be true because this time both numbers are even numbers. In fact j1 = 0.

You can also notice that since 4 is a composite number, we had to factor it as well in order to assimilate it into the factors of b.

Square root of 12?
Again, we use the proof on a composite number, this time with no rational square root.

Assume that sqrt(12) = a/b
12 b^2 = a^2
(2^2 x 3^1) (2^j1 x 3^j2 x 5^j3 x ... x Pm^jm)^2 = (2^i1 x 3^i2 x 5^i3 x ... x Pn^in)^2
(2^2 x 3^1)(2^2j1 x 3^2j2 x 5^2j3 x ... x Pm^2jm) = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in
2^(2j1+2) x 3^(2j2+1) x 5^2j3 x ... x Pm^2jm = 2^2i1 x 3^2i2 x 5^2i3 x ... x Pn^2in

By the fundamental theorem of arithmetic,
m = n

and

2j1+2 = 2i1
2j2+1 = 2i2
2j3 = 2i3
2jm = 2in

Looks like this time it although 2j1+2 = 2i1 can be true, 2j2+1 = 2i2 cannot be.

Cube root of 2?
Let's try that again with a cube root instead.

Assume that 2^(1/3) = a/b
2 b^3 = a^3
2 (2^j1 x 3^j2 x 5^j3 x ... x Pm^jm)^3 = (2^i1 x 3^i2 x 5^i3 x ... x Pn^in)^3
2 (2^3j1 x 3^3j2 x 5^3j3 x ... x Pm^3jm) = 2^3i1 x 3^3i2 x 5^3i3 x ... x Pn^3in
2^(3j1+1) x 3^3j2 x 5^3j3 x ... x Pm^3jm = 2^3i1 x 3^3i2 x 5^3i3 x ... x Pn^3in

By the fundamental theorem of arithmetic,
m = n

and

3j1+1 = 3i1
3j2 = 3i2
3j3 = 3i3
3jm = 3in

Can 3j1+1 = 3i1? No, because no multiple of 3 can equal another multiple of 3 plus 1, since any multiple of 3 plus 1 cannot be a multiple of 3.

In general
Now let's try to generalize this to any number "q" and any root "r".

Assume that q^(1/r) = a/b
q b^r = a^r
(2^k1 x 3^k2 x 5^k3 x ... x Pl^kl) (2^j1 x 3^j2 x 5^j3 x ... x Pm^jm)^3 = (2^i1 x 3^i2 x 5^i3 x ... x Pn^in)^3
(2^k1 x 3^k2 x 5^k3 x ... x Pl^kl) (2^rj1 x 3^rj2 x 5^rj3 x ... x Pm^rjm) = 2^ri1 x 3^ri2 x 5^ri3 x ... x Pn^rin
Now since we don't know whether "q" has more factors than "b" or vice versa, the last factor in the assimilated factors will simply be called "Pz^y" where z=max(l,m) and "y" is the exponent of this last prime number.
2^(rj1+k1) x 3^(rj2+k2) x 5^(rj3+k3) x ... x Pz^y = 2^ri1 x 3^ri2 x 5^ri3 x ... x Pn^rin

By the fundamental theorem of arithmetic,
z = n

and

rj1+k1 = ri1
rj2+k2 = ri2
rj3+k3 = ri3
...
y = rin

For rj1+k1 = ri1 to be true, "k1" must be a multiple of "r", otherwise rj1+k1 cannot be a multiple of "r". The same goes for all the other exponents. So this means that every exponent of the prime factors of "q" must be a multiple of "r", that is
q = 2^rh1 x 3^rh2 x 5^rh3 x ... x Pl^rhl
But then this means
q = (2^h1 x 3^h2 x 5^h3 x ... x Pl^hl)^r
which further means that "q" is a number raised to the power of "r". Therefore, for any whole number to have its nth root be a rational number, that number must be equal to another whole number raised to the power of "n". If there aren't any whole numbers which when raised to the nth power equal the number in question, then there cannot be any fractions either.

Friday, November 15, 2013

Tower of Hanoi

The Tower of Hanoi is a simple puzzle with an interesting pattern as a solution. You have 3 pegs with 3 discs of different sizes set at the first peg in ascending order like this:



The puzzle is to move all the disks one by one to the last peg with the two rules that every disc must be set at a peg and that no disc be placed on a smaller disc. The puzzle is made harder by the addition of more discs.

The minimal number of moves needed to move all 3 discs is 7 as follows:















So how do we know that there is no shorter way to move them all to the other side? Let's ignore the number of discs and just focus on how to reach the goal.

There are three sub goals required to reach the final goal:
  1. Move all the discs except the largest one out of the way, allowing the largest disc to be set at the destination peg.


  2. Move the largest disc to the destination peg.

  3. Move the rest of the discs on the largest disc.


The crucial point of this puzzle is that the movement of the smaller discs is independent of the largest disc. This means that the number of steps needed to move the smaller discs to the middle peg is the same as the number of steps needed to solve the Tower of Hanoi with one less disc.

It also doesn't matter which peg is the destination peg since the number of moves would be equivalent regardless of where the pegs need to go. All that matters is that all the discs move from one peg to another with just another peg to move around on.

So the formula for finding the smallest number of steps needed to move "n" discs is

steps(n) = steps(n-1) + 1 + steps(n-1)

where the first "steps(n-1)" is the number of steps to move the smaller discs to the other peg, the "1" is the number of steps to move the largest disc to the destination peg and the last "steps(n-1)" is the number of steps needed to move the smaller discs on top of the largest disc.

Let's see an example.

steps(3) =


The first thing we want to do is to move the 2 smaller discs off the largest disc so the the largest disc can be placed on the destination peg like this:

steps(3-1)


Then we need to move the largest disc to the destination peg like this:

+ 1


Finally we need to move the smaller disc to the destination peg as well like this:

+ steps(3-1)


How many steps will it take to move the smaller discs though? It will take the same number of steps as solving the Tower of Hanoi with 2 discs only like this:

steps(2) =


steps(2-1)


+ 1


+ steps(2-1)


Which is the same as when moving smaller 2 discs in the the 3 disc version, except that the destination peg was the middle rather than the last:

steps(3-1) =


steps(3-1-1)


+ 1


+ steps(3-1-1)


Anything other than this strategy will increase the number of steps needed to reach the goal because it means that at some point a disc was placed at a peg in which a larger disc needs to be placed and hence you will need to bring that smaller disc back to its original place.

So we go back to the proposed formula:

steps(n) = steps(n-1) + 1 + steps(n-1)

The recursion of the formula can keep on going until it turns into the number of steps to move 1 disc which is:

steps(1) = 1

Now the formula can be simplified into:

steps(n) = 2 steps(n-1) + 1

Can we solve the recurrence relation into a simple closed form? Without loss of generalization, let's try to decompose it iteratively for n=4:

steps(4) = 2 steps(3) + 1
steps(4) = 2(2 steps(2) + 1) + 1
steps(4) = 2(2(2 steps(1) + 1) + 1) + 1
steps(4) = 2(2(2(1) + 1) + 1) + 1
steps(4) = 2(2(2 + 1) + 1) + 1

Expanding the expression we get:

steps(4) = 2(2(2 + 1) + 1) + 1
steps(4) = 2(2(2) + 2 + 1) + 1
steps(4) = 2(2(2)) + 2(2) + 2 + 1

So it seems we have a sum of powers of 2, which is:

steps(4) = 2^3 + 2^2 + 2^1 + 2^0

A quick look at binary numbers (or just geometric series) lets you quickly notice that this is equivalent to:

steps(4) = 2^4 - 1

Thus, returning back to "n" instead of "4",

steps(n) = 2^n - 1

So the number of steps needed to move "n" discs is "2^n - 1" which means that in the case of 3 discs that is "2^3 - 1" that is, 7.

Monday, October 14, 2013

Python N-gram Map

I have developed a data structure in Python to store and query n-grams which is released as open source here. This blog post shall give examples on how to use it.

To add n-grams


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) which map to None

x = NGramMap()
x[('a','a','a')] = None
x[('a','a','b')] = None
x[('a','b','a')] = None
x[('b','a','a')] = None

To count n-gram frequency


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','a','a'), ('a','a','b'), ('a','a','a'), ('a','a','b'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('a','b','a') ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams():
    print(ngram, x[ngram])

To find n-grams which contain particular elements


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_with_eles({ 'a', 'y' }):
    print(ngram, x[ngram])

To find n-grams which follow a particular pattern


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_by_pattern(( None, 'x', 'y' ), { 0 }):
    print(ngram, x[ngram])

To find elements which share similar contexts


You can find elements which occur in the same context in their n-grams, for example 'a' and 'b' share a context in the n-grams (a, x, y) and (b, x, y) as do 'p' and 'q' in the n-grams (x, p, y) and (x, q, y).

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for element in x.elements():
    print(element)
    for ngram in x.ngrams_with_eles({ element }):
        ele_index = ngram.index(element)
        for ngram_ in x.ngrams_by_pattern(ngram, { ele_index }):
            if ngram_[ele_index] != element:
                print("\t", ngram_[ele_index])

Friday, September 6, 2013

Summary of research paper "Automatic Word Sense Discrimination" by Hinrich Sch├╝tze

This is a summary of the research paper http://delivery.acm.org/10.1145/980000/972724/p97-schutze.pdf. This paper describes a way to automatically identify different senses (meanings) for words with multiple senses and to disambiguate the sense of a word in context according to these identified senses.

Overview
We start by creating word vectors for each word by counting how often certain other words occur close to it. These other words are called the dimensions of the vector. This vector space is called the word space.

For example, if the word "judge" is to be projected into a word space with dimensions being "legal" and "clothes", then we go through a corpus and count the number of times the words "legal" and "clothes" occur close to "judge" and record these frequencies in a vector. This vector would represent the meaning of "judge" and we'd expect the count for "legal" to be higher than the count for "clothes". On the other hand the word vector for "robe" using the same dimensions would have a higher count for "clothes" than for "legal". The problem is that, in a corpus, an ambiguous word like "suit" would have both the legal sense and the clothes sense so its vector would not give a good indication of its meaning.

In order to disambiguate the sense of an ambiguous word like "suit", the sentence in which it occurs is used as a context to see what that particular instance of the word means. Words are chosen from the sentence, called features, in order to be used as cues to identify in which sense the ambiguous word is being used. For example, if in the sentence with the word "suit" the words "law", "judge" and "statute" are found, then we can be confident that the sense of "suit" in that context is the legal sense, since all of these words have a high count for the "legal" dimension in their word vector. In practice, centroid of the word vectors of these features is calculated and this centroid is then used to see in which dimension the word vectors of the features are generally pointing. This centroid is called the context vector of the ambiguous word.

Of course in practice there will be a lot of dimensions in the word space so senses will not be limited to a single dimension but to combinations of dimensions. This means that in order to establish what the senses of an ambiguous word are, the context vectors of each instance of the word must be clustered such that similar context vectors are clustered as sharing the same sense and each cluster will be a sense of the ambiguous word. The centroid of each cluster is called a sense vector and when an ambiguous word is to be disambiguated into a sense, the sense vector which is most similar to the word's context vector would be considered its sense.

Nitty gritty
Since there must be of a fixed amount of dimensions of the word space and the feature words which can be used for context vectors, a method to select them must be used. Two methods were tested: local and global selection. In local selection the feature and dimension words are exactly the same and different dimensions/features are used for each ambiguous word; so the system will only disambiguate ambiguous words for which dimensions and features were already determined. In global selection the dimension and feature words are different from each other and are selected once such that all ambiguous words are disambiguated using these same dimensions and features.

Local selection works by simply checking all words which are not stop words and which occur within 25 words of the ambiguous word and accepting as dimensions/features only the 1000 most important words. Importance is measured using one of two criteria: the number of times the dimension/feature occurs near the word in question and the chi-squared test of how much the ambiguous word and the dimension/feature word correlate together.

Global selection on the other hand uses the 20,000 most frequent non-stop words in the corpus as features and the 2,000 most frequent words as dimensions.

Once the dimensions and features are collected, they are arranged into a matrix such that the rows represent the dimensions and the columns the features. In this way the columns of the matrix are the word vectors of the different features. In order to disambiguate an ambiguous word, either these word vectors are used as is or they are reduced using Singular Value Decomposition (SVD) in order to reduce the size of the word vectors (number of rows) to 100 (which means that the word vectors will not correspond to frequencies of neighbouring words anymore but to abstract dimensions). Global selection was not tested with SVD.

Context vector elements are normalized and weighted by inverse document frequency.

Similarity between vectors is calculated using cosine similarity and clustering is done using the Buckshot algorithm.

Evaluation
In order to evaluate the system, pseudowords were created by taking two different words and replacing them with the same invented word. If we have two sentences "The boy peeled the banana." and "The boy opened the door.", and the words "banana" and "door" are replaced with the pseudoword "banana/door" such that the sentences are now "The boy peeled the banana/door." and "The boy opened the banana/door.", then the pseudoword is now an ambiguous word with two senses: the banana sense and the door sense. The system is now to attempt to identify these two senses and identify when it is used in the banana sense and when it is used in the door sense. This is easy to evaluated since we already know how each instance should be disambiguated and there is no need for prior manual identification of the senses.

The following is the average percent correct disambiguation for 20 pseudo words for different experimental conditions:
Condition%
Local-Frequency-no SVD:77.8
Local-Frequency-SVD:82.9
Local-Chi squared-no SVD:72.1
Local-Chi squared-SVD:84.1
Global-Frequency-SVD:89.7

Tuesday, August 13, 2013

Python functions for enumerating combinatorics

Here are some useful enumerating functions in Python which have to do with generating permutations and combinations.

Enumerating all permutations of a sequence

This is a function which will give you all the different ways an ordered sequence can be ordered. It takes each element from the sequence and then recursively finds all the permutations of the sequence without that element. For every sub permutation the function will then prefix it with the removed element. The base case is that a sequence of just one element can only be ordered as itself.
def permutations(initial_permutation):
 if len(initial_permutation) == 1:
  yield initial_permutation
 else:
  for i in range(len(initial_permutation)):
   ith_element = initial_permutation[i:i+1]
   without_ith_element = initial_permutation[:i]+initial_permutation[i+1:]
   for sub_perm in permutations(without_ith_element):
    yield ith_element + sub_perm
Output:
list(permutations("ABCD"))

['ABCD', 'ABDC', 'ACBD', 'ACDB', 'ADBC', 'ADCB', 'BACD', 'BADC', 'BCAD', 'BCDA', 'BDAC', 'BDCA', 'CABD', 'CADB', 'CBAD', 'CBDA', 'CDAB', 'CDBA', 'DABC', 'DACB', 'DBAC', 'DBCA', 'DCAB', 'DCBA']

Enumerating all unordered pairs in a sequence

This is a function which will give you all the different ways an unordered pair of elements can be taken from a sequence. It is a simple case of the next function. It simply runs two nested for loops, each of which selects an index of an element in such as way that the two selected indices are unique.
def pairs(seq):
 for i in range(0, len(seq)-1):
  for j in range(i+1, len(seq)):
   yield (seq[i], seq[j])
Output:
list(pairs("ABCD"))

[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

Enumerating all unordered subsets of a given length in a sequence

This is the general case of the previous function, where instead of pairs it will enumerate tuples of a given length. It simply replaces the nested for loops with recursive for loops, each time the number of elements to choose goes down by one until the function has to choose 0 elements at which point it returns an empty tuple.
def choices(seq, choice_len):
 if choice_len == 0:
  yield ()
 else:
  for i in range(0, len(seq)-(choice_len-1)):
   for sub_choice in choices(seq[i+1:], choice_len-1):
    yield (seq[i],) + sub_choice
Output:
list(choices("ABCD", 3))

[('A', 'B', 'C'), ('A', 'B', 'D'), ('A', 'C', 'D'), ('B', 'C', 'D')]

Saturday, July 20, 2013

No Free Lunch

There's no free lunch, you always have to lose something in order to gain something. Well at least this is true in the case of computer science where there is something called the No Free Lunch theorem. But in this post I'd like to highlight various instances in computer science where you have to lose something in order to gain something and offer some other fields where I think this also applies.

As you read these examples keep in mind those games where you have to set the stats of your character such as his speed, strength, etc. where you have a total number of points which you can distribute as you see fit, but which you cannot use any more points than those. This creates a condition where there is no free lunch, because you cannot have a perfect character who has perfect stats. There must always be some stats which are sacrificed in order to improve others.

Lossless compression

When you use WinZip in order to make a file smaller in such a way that you can extract the original file back again, you are using something called lossless compression. Can you design a compression algorithm which can make any file smaller and be able to decompress it back? No, because there are more large files than there are smaller ones.

Computer files consist of a sequence of ones and zeros called bits. Let's say that we have designed an algorithm which will compress files consisting of 2 bits. There can only be 4 such files which are:
00
01
10
11
Now let's say that our algorithm will somehow compress these files into 1 bit files. How many different 1 bit files are there? Just 2 which are:
0
1
So we want to compress each of the four 2 bit files into one of the two 1 bit files. What this results in is that two of the large files will turn into the same compressed file and so when you get to decompress it, it would be impossible to know which of the two possible files the original was, so you have lost information and hence the process is not reversible. It would be a lossy compression, not a lossless one.

So what lossless compression algorithms do, inevitably, is to make some files smaller and others bigger than they were. It is the only way to compress some files; you must enlarge some others. A good algorithm would be designed to compress the most typical files whilst enlarging the less common ones, but what would be a typical file? This depends on the application, which is why good compression results when an algorithm is designed for compressing only certain types of files such as pictures or music.

Here is an example of how the 2 bit files can be compressed:
00: 0
01: 10
10: 110
11: 111
So the file "00" will be actually compressed whilst the rest will be either the same size or bigger. As you can see each file was compressed into a unique file so it is possible to reverse the process and obtain the original files.

There's no free lunch in lossless compression, you've got to have some files which will be enlarged if you want some others to be compressed.

Hashing functions

A hashing function is an algorithm which transforms a piece of data (such as a person's name) into a number within a fixed range. This number is called a hash.

Here is how the name "john" can be hashed into the number 7. Start by associating each letter with a number (for example "a" can be associated with 1 and "b" with 2, etc.) and then taking each letter in a person's name and finding their associated numbers. These numbers would then be added together producing a single number. So "john" would be associated with the numbers [10,15,8,14] whose sum is 47. However we want the number to always be within a fixed range, such as between 0 and 9. We can do this by dividing the number by 10 and keeping the remainder which will always be between 0 and 9. So 47 divided by 10 gives 4 remainder 7, so the hash of the name "john" would be 7. This is of course just an example and hash functions can be much more complicated than this.

Hashing functions are used in order to create a hash table which is a table where every row is associated with a number and data is placed in a row according to its hash. The number of rows would be the range of the hashes. So "john" in the previous example would be placed in row 7. This speeds up searching for names since you can quickly find if a name is in the table by just computing its hash and checking if that particular row contains that particular name.

The problem is that different names can give the same hash which will create a collision in the hash table. There are ways around it but a good hashing function will minimize the number of collisions. So is there a hash function which will give the minimal number of collisions for any data?

Obviously not. Let's say that the range of the hashing function allows you to use a hash table with 1000 rows. This would mean that at most you can have 1000 different names without a collision, after which the next name will always result in a collision. Let's say that there is a list of 1000 names which when hashed will all be given different hashes. The hash function works well when the names are picked from that list. But there are more names than 1000. There are potentially infinite names which can be hashed. Since there are potentially infinite names and only 1000 can be placed in the hash table, then there are potentially infinite names which will all try to occupy the same row, regardless of which hashing function is used. This is simply because of the limited available space. So you can always find a list of names which will result in collisions. You can increase the range of the hashing function, but it must always be a finite range or it will not be a hashing function and cannot be used in a hash table.

There is no free lunch when it comes to hashing functions. You can only find a hashing function which works well with a particular subset of the data (usually the most typical data, just like in compression). There will always be some data which will result in a collision.

Machine learning / regression analysis

Both machine learning and regression analysis are the process of trying to guess a pattern from some examples. For example I say that I'm thinking of a sequence of numbers which starts as 0,2,4,6,8,... what will the next number be? You might be quick to guess that it would be 10 but this isn't necessarily true. You simply recognized one pattern which the list follows, namely the sequence of even numbers. But this isn't the only pattern which starts with those numbers. If there were more examples it might look like this: 0,2,4,6,8,1,3,5,7,9,10,12,14,16,18,11,13,15,17,19,... in which case we might guess the next number to be 20 since it alternated between even and odd numbers after every 5 numbers.

In fact for any sequence of numbers that you can think of, there exists an equation which will start with that sequence. A method for finding such an equation is called Polynomial Interpolation where an equation which starts with the sequence 1,3,2,4 (for whole number values of x) is:
f(x) = 1((x-2)(x-3)(x-4))/((1-2)(1-3)(1-4)) + 3((x-1)(x-3)(x-4))/((2-1)(2-3)(2-4)) + 2((x-1)(x-2)(x-4))/((3-1)(3-2)(3-4)) + 4((x-1)(x-2)(x-3))/((4-1)(4-2)(4-3))
which simplifies to:
f(x) = (2x^3 - 15x^2 + 35x - 20)/2
which gives the following values:
f(1) = 1
f(2) = 3
f(3) = 2
f(4) = 4

And this can be done for any finite sequence. The equation might become more and more complex (even when simplified) but there will always be one. Which means that for any example of sequence that you come up with, there is an infinite number of different equations which start with that sequence and then all continue differently. This means that there is no finite sequence of numbers you can give which will suffice to describe a single pattern such that we will always regress those examples into the original pattern which describes them, because there are an infinite number of such patterns, not just the one you were thinking of initially.

In general it is not simply a sequence of numbers that we need to find a pattern for. Machine learning is used to learn how to do a task from a set of examples of that task. It could be Youtube trying to guess what sort of videos you like based on the ones you have already seen. It could be Amazon trying to guess what sort of products you tend to buy based on the ones you have already bought. It could be a natural language production program which given a number of good stories will try to produce a new original good story. It could be a program which tries to find what you should invest in that will not fail given the current economic situation and using past economic situations as examples. It could be an automatic car which tries to learn how to drive automatically by observing how people drive. All these can be thought of as more complex versions of the previous equation shown, with the difference that instead of an equation you have a full blown program.

But the same problem persists. Different learning methods will give different patterns which describe the examples. The problem is choosing which learning method will give you the pattern that you want. This is called an induction bias and it is the fact that different learning methods tend towards different patterns and so a particular learning method will tend to give patterns which are suitable for particular uses whilst another method will give patterns which are suitable for different uses. And this applies even to us humans because we try to find patterns which describe our observations all the time, from a student who is trying to learn how to do an exercise based on the examples given by the teacher to a scientist who is trying to find a scientific theory which describes some natural process. There is not one best method of learning since each can give a pattern which fits the examples but will not fit new observations.

There is no free lunch in learning. Each learning method can fail and so there is no universally good method.

Programming languages?

We now come to the first of my unfounded hypothesis. I think that programming languages also have no free lunch. Each programming language is designed to be efficient at creating certain programs but none is efficient at everything. The more general purpose a programming language is, the harder it is to find errors in the code, which is why Haskell was reduced in generality to produce Helium which has better error messages. It seems like a perfect programming language cannot be created, it will either be too unwieldy to be used by humans or it will be too specialized to be useful in general. So there is no free lunch, there cannot be a general and easy to use programming language, it must be domain specific (usually implicitly).

Data structures and algorithms?

Here is another of my unfounded hypothesis. Data structures are ways of representing data in order to be efficient for some particular use such as searching. The problem is that there are always tradeoffs involved and sometimes these tradeoffs might not be obvious. For example although a sorted list is fast to search through, it is slower to add new data to. On the other hand an unsorted list is faster to add data to (just append it to the end) but slower to search through. Sorting algorithms also have tradeoffs. General sorting algorithms have a worst time complexity of O(n^2) except for merge sort which has a worst time complexity of O(n log n) but at the expense of using twice as much memory. Heap sort does have a worst time complexity of O(n log n) (without big O notation it is 2n log n) and uses the same amount of memory but it still has its disadvantages. On the other hand when an algorithm is optimized and improved to handle more and more special cases in order to avoid reaching the worst time complexity, then there is another trade off being involved: the size of the program needed to implement the algorithm. And different computers will be more efficient at executing certain algorithms than others, such as we humans being better at using the slow algorithms than the fast ones when it comes to sorting manually. It seems like there cannot be perfect data structures and algorithms which are good in any situation and use, there will always be some situation where the data structure or algorithm is not well suited for. So there is no free lunch, there cannot be a general algorithm or data structure which works well in every situation, it must be application specific.

Monday, June 10, 2013

Fitness function for multi objective optimization / Mapping vectors/tuples to numbers

When you want to let the computer discover how to best solve a problem, you can use a genetic algorithm to evolve a solution for a problem. For example, if you're trying to find which schedule of lessons will cause the least clashes (hopefully none), you create a population of possible schedules and let them evolve into better and better schedules. A more interesting example would be evolving a program which does something in particular, such as control a robot.

In genetic algorithms you have to supply a function called a fitness function which returns a number which quantifies how good a particular solution is at solving the problem. For example, in the case of the time table scheduler, a fitness function would be the number of clashes between lessons that are caused by a given schedule. The smaller this number, the better the schedule. If you're trying to evolve a program which generates prime numbers, then a fitness function would be the percentage of numbers generated which are actually prime numbers. The higher this number, the better the program.

The problems start when you want to maximize or minimize more than one thing, that is, when you have a number of separate fitnesses that you want to optimize together. An example of this is evolving your schedule such that both the number of lesson clashes (when students are expected to be in two different lessons at the same time) and the number of room clashes (when the same room is assigned to be used for more than one lesson at the same time) will be minimized. In the case of the program, you usually not only want to evolve a correct program, but also one which gives the correct output quickly. This is called multi objective optimization.

What is usually done is a weighted summation of the individual fitnesses. So you say that the correctness of a program is, say, twice as important as the speed of the program, so the fitnesses are combined as
fitness = 2 * percentage_correct_outputs + time_taken
Of course this is not correct since the correctness must go up whilst the time should go down. So a better combination might be
fitness = 2 * percentage_correct_outputs + 1/(time_taken + 1)
You might not think that this is the best fitness function but choosing a correct fitness function is part of the difficulty of using genetic algorithms so have fun finding a better one.

The problem with weighted summations is that the fitness will increase by simply increasing one of the sub fitnesses. So in the previous example, you can get a good fitness by simply creating a program that does nothing and takes 0 seconds to do so. This program is much easier to find than a program which is correct so the genetic algorithm tends to improve the execution time only and gets stuck as a program which increases the correctness will be much much slower and thus will reduce the fitness rather than increase it. A better way to combine the sub fitnesses is needed.

I found two ways to do this for two different needs. The first is for when the sub fitnesses are of equal importance, for example the time table scheduler minimizing the number of lesson and room clashes. The second is for when the sub fitnesses are prioritized such that the most important sub fitness should always be improved regardless of how the second is affected, for example the prime number generator being correct has the highest priority but between two equally correct programs, the faster one is preferred.

Equal priority fitness

If both sub fitnesses must be optimized together such that improving one without the other is not helpful, then we might add the two fitnesses together and subtract their absolute difference like this:

fitness = (sub1 + sub2) - |sub1 - sub2|

This way as one starts improving without the other, their difference will increase and the whole fitness will start becoming smaller. We can simplify this equation by calling the largest sub fitness "max" and the smallest sub fitness "min":

fitness = (max + min) - (max - min)
fitness = max + min - max + min
fitness = 2*min

So this fitness is equivalent to finding twice the minimum of the sub fitnesses, which makes sense since if you only care about increasing the minimum fitness, then both fitnesses will increase together. Of course multiplying by 2 is redundant since comparing a minimum with another minimum and comparing twice a minimum with twice another minimum will give the same result, so our final fitness function is:

fitness = min(sub1, sub2)

and you can add more sub fitnesses by just finding the minimum of all of them:
fitness = min(sub1, sub2, sub3, ..., subn)

So in our time table example, the fitness function would be:
fitness = min(-lesson_clashes, -room_clashes)

The negations are used so that in order to increase the fitness, the number of clashes have to be decreased.

Prioritized fitness

When one sub fitness is more important than the other, such as program correctness versus execution time, a different method is needed. What is needed is a function "f" which combines the two sub fitnesses ("sub_1" and "sub_2" where sub_1 has a greater priority than sub_2) such that the following conditions are met:

f(subA_1, subA_2) > f(subB_1, subB_2)
if subA_1 > subB_1 or
   subA_1 = subB_1 and subA_2 > subB_2

f(subA_1, subA_2) = f(subA_1, subB_2)
if subA_1 = subB_1 and subA_2 = subB_2

So for example f(2, 1) > f(1, 100), f(5, 20) > f(5, 1) and f(3, 3) = f(3, 3).

This is called a lexicographical ordering of the sub fitnesses. With strings this is natural. When sorting names, "John" comes before "Zach" because "J" comes before "Z", regardless of what the second letter is. But "James" comes before "John" because since the first letter is the same then we look at the second one and "a" comes before "o".

The problem we're facing is with finding a function which when given a pair of numbers will return a number which can be used to sort a list of pairs of numbers in lexicographical order. This is similar to Cantor's mapping of pairs of numbers to the natural numbers, except that this time we want a particular mapping which is lexicographic.

A natural lexicographic ordering in numbers comes when we look at numbers with a decimal point such as 2.1 and 3.5. The whole number part will always determine the ordering, unless they are equal, in which case the fractional part is then used. So if we can map our pairs of sub fitnesses to real numbers in such a way that the high priority sub fitness becomes the whole number part and the low priority sub fitness becomes the fractional part, then we would have found our "f".

What I found was that you can do this by squashing the low priority number into a proper fraction (a number between 0 and 1) and then adding it to the second number, assuming that it is a whole number. Unfortunately this method will only work if the high priority number is a whole number. The low priority number can be a real number however. In order to squash the low priority number, you can use a sigmoid function which given any number will return a number between 0 and 1 in such a way that ordering is preserved (sigmoid(a) > sigmoid(b) if and only if a > b). Another and perhaps simpler way would be to use the hyperbolic tangent or the arc tangent and then modify them so that their output is between 0 and 1 (y = (tanh(x)+1)/2 and y = (atan(x)+pi/2)/pi).

So the combined fitness would be:
fitness = sub1 + sigmoid(sub2)

The nice this about this is that you can add more sub fitnesses in the following way, assuming that only the least important sub fitness is a real number whilst the rest are integers:
fitness = sub1 + sigmoid(sub2 + sigmoid(sub3 + ... + sigmoid(subn)...))

So in our program example, the fitness function would be:
fitness = percentage_correct_outputs + sigmoid(time_taken)

It should be noted that this is a way to map vectors/tuples of integers to real numbers whilst preserving lexicographic ordering.

Combined

Now we can even use both of the combinations in order to combine sub fitnesses in complex ways where some are of equal priority whilst others are of different priority. For example, say we are evolving a time table schedule which minimizes lesson clashes, minimizes room clashes and minimizes density such that you avoid cramming all the lessons in one day. The lesson and room clashes are of equal priority but the density has a lower priority than the other two. So the fitness function might be:

fitness = min(-lesson_clashes, -room_clashes) - sigmoid(density)

This will give us a real number which has a whole number part representing the number of undesirable clashes and a fractional part representing the density, both of which must be minimized. Since fitness should be increased, minimization is done by using the negations and subtraction.

Sunday, May 5, 2013

An equation for enumerating fractions

In a previous post I described Cantor's way of proving that some infinite lists are bigger than others. I said that his is done by showing that a number of infinite lists, which are called countably infinite, can be paired up with every natural number from 0 onward. I shall now describe an equation that gives the integer that is paired up with a particular fraction.

The fractions are:
1/1 1/2 1/3 1/4 ...
2/1 2/2 2/2 2/3 ...
3/1 3/2 3/2 4/3 ...
4/1 4/2 4/2 3/3 ...
...
and can be paired up with the natural numbers by doing the following:
  1. Organize them in a grid like above, where the first row consists of all the fractions with a numerator of 1, the second row with a numerator of 2, etc.
  2. Start from the corner (1/1) and pair it with 0
  3. Go down each diagonal in turn, that is, [1/2, 2/1] would be the first diagonal, [1/3, 2/2, 3/1] would be the second, etc. and pair them with the next consecutive natural number
This will give us the following mapping:
[0:1/1, 1:2/1, 2:1/2, 3:3/1, 4:2/2, ...]

In order to find an equation which does this mapping, we first need to know in which diagonal a given natural number would be paired up in. So 0 is paired up with the corner (the diagonal 0), 1 is paired up with 1/2 which is in the second diagonal (diagonal 1), 2 is with 2/1 which is in the second (diagonal 1), 3 with 3/1 in the third (diagonal 2), etc. This will let us know what the nth fraction in the diagonal is.

Let's start by determining which natural number would be paired up with the first fraction in each diagonal. First fraction in diagonal 0 is paired up with 0, the first fraction in diagonal 1 is paired up with 1, in diagonal 2 with 3, in diagonal 3 with 6, etc. Notice that, since the size of each diagonal increases by 1 each time, the sequence of numbers just mentioned (0, 1, 3, 6, ...) is an arithmetic series. This means that the first fraction in diagonal 0 is paired up with 0, diagonal 1 with 0+1, diagonal 2 with 0+1+2, diagonal 3 with 0+1+2+3, etc. So in order to find which natural number n will be paired up with the first fraction in diagonal d, we use:

n = d(d+1)/2
d = 0 => n = 0
d = 1 => n = 1
d = 2 => n = 3
d = 3 => n = 6
d = 4 => n = 10

We can use this equation to find the diagonal d in which natural number n is paired up. This is done by making d subject of the formula using the completing the square method: d = -0.5 ± sqrt(2n + 1/4)

This new equation gives the following:
d = 0 => n = 0, -1
d = 1 => n = 1, -2
d = 2 => n = 1.56, -2.56
d = 3 => n = 2, -3
d = 4 => n = 2.37, -3.37
d = 5 => n = 2.7, -3.7
d = 6 => n = 3, -4
d = 7 => n = 3.27, -4.27
d = 8 => n = 3.53, -4.53
d = 9 => n = 3.77, -4.77
d = 10 => n = 4, -5

As you can see, the answer that comes out when using the minus sign is non-nonsensical, whilst the whole number part of the answer that comes out when using the plus sign is the diagonal position in which natural number y belongs. The plus part is what we want and we can extract the whole number part by using the floor function:

d = floor(-0.5 + sqrt(2n + 1/4))
or
d = floor((sqrt(8n + 1) - 1)/2)

This gives us:
n = 0 => d = 0
n = 1 => d = 1
n = 2 => d = 1
n = 3 => d = 2
n = 4 => d = 2
n = 5 => d = 2
n = 6 => d = 3
n = 7 => d = 3
n = 8 => d = 3
n = 9 => d = 3
n = 10 => d = 4

So natural number 0 will be paired up with a fraction in diagonal 0, 1 will be paired up with a fraction in diagonal 1, 2 in diagonal 1, 3 in diagonal 2, 4 in diagonal 2, etc.

Since the largest numerator and denominator in a given diagonal is 1 more than the position of the diagonal (diagonal 2 has a largest numerator, as well as a largest denominator, of 3, etc.), we now also know that the largest numerator and denominator in the diagonal in which the natural number n belongs is floor((sqrt(8n + 1) - 1)/2)+1. So the first fraction in that diagonal will be 1/(floor((sqrt(8n + 1) - 1)/2)+1) whilst the last fraction will be (floor((sqrt(8n + 1) - 1)/2)+1)/1.

Since we also know which natural number is paired up with the first fraction in a given diagonal, we can know which natural number is paired up with the first fraction in the diagonal in which n is paired.
  1. The natural number n which is paired up with the first fraction in diagonal d is: n = d(d+1)/2
  2. The diagonal d in which natural number n is paired up is: d = floor((sqrt(8n + 1) - 1)/2)
  3. So the natural number n which is paired up with the first fraction in the diagonal which natural number m is paired up is: n = floor((sqrt(8m + 1) - 1)/2)(floor((sqrt(8m + 1) - 1)/2) + 1)/2

This gives us:
m = 0 => n = 0
m = 1 => n = 1
m = 2 => n = 1
m = 3 => n = 3
m = 4 => n = 3
m = 5 => n = 3
m = 6 => n = 6
m = 7 => n = 6
m = 8 => n = 6
m = 9 => n = 6
m = 10 => n = 10

Observe that in a given diagonal such as [1/4, 2/3, 3/2, 4/1], you can find what any of the other fractions will be by knowing which is the first fraction and how far the fraction you want to find is from it (the numerator increases whilst the denominator decreases). For example, the 0th fraction in the diagonal is (1+0)/(4-0), the 1th fraction is (1+1)/(4-1), the 2th fraction is (1+2)/(4-2), etc.

We know how far m is from the first natural number n in the diagonal in which m resides:
m - floor((sqrt(8m + 1) - 1)/2)(floor((sqrt(8m + 1) - 1)/2) + 1)/2

And we know the first fraction in the diagonal in which m resides:
1/(floor((sqrt(8m + 1) - 1)/2)+1)

So the fraction which will be paired up with natural number n is:

numerator: 1 + (n - floor((sqrt(8n + 1) - 1)/2)(floor((sqrt(8n + 1) - 1)/2) + 1)/2)
denominator: floor((sqrt(8m + 1) - 1)/2)+1 - (n - floor((sqrt(8n + 1) - 1)/2)(floor((sqrt(8n + 1) - 1)/2) + 1)/2)

which I'm too lazy to simplify.

The equation in Python 3 is as follows:
for n in range(20):
    print(n, ":", int(1 + (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)), "/", int(math.floor((math.sqrt(8*n + 1) - 1)/2)+1 - (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)))

And here is the output:
0 : 1 / 1
1 : 1 / 2
2 : 2 / 1
3 : 1 / 3
4 : 2 / 2
5 : 3 / 1
6 : 1 / 4
7 : 2 / 3
8 : 3 / 2
9 : 4 / 1
10 : 1 / 5
11 : 2 / 4
12 : 3 / 3
13 : 4 / 2
14 : 5 / 1
15 : 1 / 6
16 : 2 / 5
17 : 3 / 4
18 : 4 / 3
19 : 5 / 2

So using this enumeration method, the fraction 100/173 will be paired with:
def findNat(num, den):
    while int(1 + (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)) != num and int(math.floor((math.sqrt(8*n + 1) - 1)/2)+1 - (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)) != den:
        n += 1
    return n
print(findNat(100, 173))

5049

Friday, April 12, 2013

The Monty Hall problem

This is an interesting mathematical problem in probability which has an intuitive answer. The only problem is that the intuitive answer is wrong, which is what I love about it. It's called the Monty Hall problem and it goes like this:

Imagine you're in a game show and presented with three doors.



Behind one of the doors is a prize. You are to guess behind which door it is. At this point we should all agree that the chance of guessing the right door would be 1/3. You decide to pick door number 1.



You think that you have sealed you fate, but in an surprising turn of events, the game show host says that in order to help you, since at least one of the two doors left must be wrong, the host is going to tell you that door number 3 is surely wrong.



So now you know that the prize is either behind the door you chose or the other remaining door, door number 2. The host now asks you if you'd rather stick with your first choice or choose the other door.

Did the host actually help you? Is there some advantage to knowing which one of the three doors is surely wrong? Intuitively you'd say that your chances of winning by sticking with your first choice or choosing door number 2 are both 1/2. It's either behind one or the other right? So your chances of winning are equal right? Right guys? Guys?

Most people will not be convinced that this intuition is false. So here's a program I've written in Python 3 which simulates this situation 100 times and it will count the number of times you would have won if you stayed with the first choice and the number of time you would have won if you swapped your choice.

import random

stay_win = 0
swap_win = 0
doors = {"Door 1", "Door 2", "Door 3"}
for _ in range(100): #For 100 times do the following...
    #Select the correct door
    correct = random.choice(list(doors))
    #The contestant makes their choice
    first_choice = random.choice(list(doors))

    #The door which can be eliminated cannot be the correct door or the contestant's choice
    can_eliminate = doors - { correct, first_choice }
    #Select the eliminated door
    eliminated = random.choice(list(can_eliminate))

    #The contestant has only one door left to choose if they swap their door
    choice_left = doors - { first_choice, eliminated }
    #Make this unchosen, non-eliminated door the second choice for swapping the door
    second_choice = choice_left.pop()

    if first_choice == correct:
        stay_win += 1
    if second_choice == correct:
        swap_win += 1

    print("Correct:", correct, " | ", "First choice:", first_choice, " | ", "Eliminated:", eliminated, " | ", "Second choice:", second_choice, " | ", "Won if stay:", first_choice == correct)
print()
print("stay:", stay_win, "swap:", swap_win)

In order to let you scrutinize the result, all the individual situations will be outputted as well:
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: True
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: True
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: True
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: True
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False
Correct: Door 1 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: True
Correct: Door 2 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: False
Correct: Door 3 | First choice: Door 2 | Eliminated: Door 1 | Second choice: Door 3 | Won if stay: False
Correct: Door 2 | First choice: Door 1 | Eliminated: Door 3 | Second choice: Door 2 | Won if stay: False
Correct: Door 1 | First choice: Door 3 | Eliminated: Door 2 | Second choice: Door 1 | Won if stay: False
Correct: Door 3 | First choice: Door 1 | Eliminated: Door 2 | Second choice: Door 3 | Won if stay: False
Correct: Door 3 | First choice: Door 3 | Eliminated: Door 1 | Second choice: Door 2 | Won if stay: True
Correct: Door 1 | First choice: Door 2 | Eliminated: Door 3 | Second choice: Door 1 | Won if stay: False

stay: 29 swap: 71

What? That can't be right! You win by swapping almost twice as often as when you stay with your first choice. Let's try it again!

stay: 39 swap: 61

Um...

stay: 37 swap: 63

Hmm... Why would you win twice as often when you swap than when you stay? The only way to know this is by seeing what needs to happen in order to win by sticking with your first choice and to win by swapping.

Here is what the different possibilities are when the correct door is door number 1:


And here is what the different possibilities are when the correct door is door number 2:


And finally here is what the different possibilities are when the correct door is door number 3:


Did you get it? No? Well, look carefully at what needs to happen in order to win using the first choice. You need to guess the correct door on first try. What are the chances of you doing that? It's 1/3 of course. Could you ever win by choosing a wrong door on first try and stick to it? No.

Well then what do you need to do in order to win by swapping doors? Clearly, you will only lose if you choose the correct door on first try, but if it's the wrong door you've chosen on first try then you'll win. What are the chances of you choosing the wrong door on first try? 2/3, twice as many as choosing the correct one.

Clearly you're twice as likely to win by swapping than by staying. Obviously this doesn't mean that you'll always win, it just means that you're more likely to win. This illustrates a very important point which is that intuition isn't perfect. Sometimes it's just not enough to use intuition in order to predict what will happen, you will need to get down and dirty and actually do boring work in order to make a correct judgement. Never overestimate yourself.

Friday, March 15, 2013

Fractions that approximate pi

We all know that 22/7 is an approximation of pi, although some think that it is exactly pi. It is not, it is just close. But why do we use 22/7 when there are other fractions that approximate pi better? Let's see what some of these fractions are.

To find these fractions I've written a Python 3 program which calculates the errors of various fractions when compared to pi. It works in the following way:

First, it considers a range of denominators to use for fractions. For each denominator, the numerator which makes the fraction closest to pi is found (for example, with a denominator of 7, the closest fraction would be 22/7).

Each fraction will get a number of digits at the beginning the same as pi. We would like to find which is the closest fraction to pi which gets, say, the first 5 digits exactly like pi's. So the best fraction for every number of exact digits is found.

Approximations are only useful if they are easy to use and remember and so we will be considering different lengths of numbers when performing the above process. First we shall find the best fractions which consist of a denominator which is less than 10, then which is less than 100, then 1000, etc. in order to control the sizes of the fractions.

Here is the code:
import math
import collections
 
def closestFraction(denominator, target):
    #The closest fraction which uses a particular denominator d to a target number t requires that we find the best numerator n such that n/d = t, so n = td, the closest whole number of which is round(td)
    numerator = round(denominator*target)
    fraction = numerator/denominator
    return (numerator, denominator, fraction)
 
def closestFractions(maxDenominatorRange, target):
    fractions = []
    fractionsFound = set()
    for denominator in range(1, maxDenominatorRange):
        (numerator, denominator, fraction) = closestFraction(denominator, target)
        if fraction not in fractionsFound: #avoid different forms of the same fraction due to non-simplification
            fractionsFound.add(fraction)
            fractions.append((numerator, denominator, fraction))
    return fractions
 
print("pi =", math.pi)
print()
for maxDenominatorRange in [10, 100, 1000, 10000, 100000]:
    #Group fractions by the number of first digits which are exactly like pi's
    numFirstDigitsDict = collections.defaultdict(list)
    for (numerator, denominator, fraction) in closestFractions(maxDenominatorRange, math.pi):
        numFirstDigits = 0
        while round(fraction, numFirstDigits) == round(math.pi, numFirstDigits):
            numFirstDigits += 1
        numFirstDigits -= 1
        numFirstDigitsDict[numFirstDigits].append((numerator, denominator, fraction))
 
    #Keep only the best fraction in each first digits group
    bestNumFirstDigitsDict = dict()
    for numFirstDigits in numFirstDigitsDict:
        bestNumFirstDigitsDict[numFirstDigits] = min(numFirstDigitsDict[numFirstDigits], key=lambda x:abs(x[2] - math.pi))
 
    print("==========")
    print("Fractions with denominator less than", maxDenominatorRange)
    for numFirstDigits in bestNumFirstDigitsDict:
        (numerator, denominator, fraction) = bestNumFirstDigitsDict[numFirstDigits]
        print("Best fraction with", numFirstDigits, "starting digits like pi's (rounded):", numerator, "/", denominator, "=", fraction)
    print()

And here are the results:

pi = 3.141592653589793

==========
Fractions with denominator less than 10
Best fraction with 0 starting digits like pi's (rounded): 19 / 6 = 3.1666666666666665
Best fraction with 1 starting digits like pi's (rounded): 25 / 8 = 3.125
Best fraction with 2 starting digits like pi's (rounded): 22 / 7 = 3.142857142857143

==========
Fractions with denominator less than 100
Best fraction with 0 starting digits like pi's (rounded): 167 / 53 = 3.150943396226415
Best fraction with 1 starting digits like pi's (rounded): 195 / 62 = 3.1451612903225805
Best fraction with 2 starting digits like pi's (rounded): 311 / 99 = 3.1414141414141414

==========
Fractions with denominator less than 1000
Best fraction with 0 starting digits like pi's (rounded): 167 / 53 = 3.150943396226415
Best fraction with 1 starting digits like pi's (rounded): 412 / 131 = 3.145038167938931
Best fraction with 2 starting digits like pi's (rounded): 2975 / 947 = 3.1414994720168954
Best fraction with 3 starting digits like pi's (rounded): 3085 / 982 = 3.1415478615071284
Best fraction with 4 starting digits like pi's (rounded): 2818 / 897 = 3.141583054626533
Best fraction with 6 starting digits like pi's (rounded): 355 / 113 = 3.1415929203539825

==========
Fractions with denominator less than 10000
Best fraction with 0 starting digits like pi's (rounded): 167 / 53 = 3.150943396226415
Best fraction with 1 starting digits like pi's (rounded): 412 / 131 = 3.145038167938931
Best fraction with 2 starting digits like pi's (rounded): 15541 / 4947 = 3.1414998989286436
Best fraction with 3 starting digits like pi's (rounded): 28497 / 9071 = 3.1415499944879284
Best fraction with 4 starting digits like pi's (rounded): 26669 / 8489 = 3.1415950053009776
Best fraction with 5 starting digits like pi's (rounded): 31218 / 9937 = 3.1415920297876623
Best fraction with 6 starting digits like pi's (rounded): 355 / 113 = 3.1415929203539825

==========
Fractions with denominator less than 100000
Best fraction with 0 starting digits like pi's (rounded): 167 / 53 = 3.150943396226415
Best fraction with 1 starting digits like pi's (rounded): 412 / 131 = 3.145038167938931
Best fraction with 2 starting digits like pi's (rounded): 15541 / 4947 = 3.1414998989286436
Best fraction with 3 starting digits like pi's (rounded): 28497 / 9071 = 3.1415499944879284
Best fraction with 4 starting digits like pi's (rounded): 294069 / 93605 = 3.14159500026708
Best fraction with 5 starting digits like pi's (rounded): 198379 / 63146 = 3.1415924999208182
Best fraction with 6 starting digits like pi's (rounded): 308429 / 98176 = 3.141592649934811
Best fraction with 7 starting digits like pi's (rounded): 209761 / 66769 = 3.141592655274154
Best fraction with 8 starting digits like pi's (rounded): 208341 / 66317 = 3.1415926534674368
Best fraction with 9 starting digits like pi's (rounded): 104348 / 33215 = 3.141592653921421
Best fraction with 10 starting digits like pi's (rounded): 312689 / 99532 = 3.1415926536189365

In my opinion, the best fraction is 355/113 which is easy to remember and is accurate up to 6 digits:
pi      = 3.141592653589793
355/113 = 3.1415929203539825

Tuesday, March 12, 2013

Montecarlo method of finding the area of a circle and pi

You may know that the digits of pi look random but did you know that you can use randomness to find pi? You can actually approximate the area of any circle using random points in a method called the Montecarlo method. The principle of this method uses something called "sampling" which works as follows:

Say you want to find the number of people in a country who are female. You don't have time to check every single person so instead you take a random sample of people, say 10% of the population, and check only those. You find that 40% of the people in this sample are female so you generalize this percentage to the whole population and say that approximately 40% of the country is female. If the sample was random, this would be a reasonable approximation and this sort of thing is done all the time in surveys.

Now what we want to do is to find the area of a circle such as this:



The Montecarlo method works in the following way:

Place that circle inside a square such as this:



Keep in mind that the area of a square is easy to find. Now throw a point inside the square at random, which would result in something like this:



Now if we repeat this process with many points and count how many of the points lie inside the circle, we can approximate what fraction of the area of the square is taken by the circle by treating the points as a sample of the total infinite number of points in the square.



For example, in the above diagram there are 17 points in total but only 14 of the points lie inside the circle. So we can say that approximately 14/17 of the area inside the square is taken by the circle.

Now lets assume that we have a decent number of points and have a good approximation of the fraction of area that is taken by the circle. How can we use this fraction to find the area of the circle? Just multiply the fraction by the area in the square. Since we know the fraction of the area of the square that lies inside the circle, we can find the area of the circle by finding what this fraction of the area actually is.

So in the above example, the length of the side of the square is about 6cm, so the area of the square is 6x6 = 36cm^2, so the area of the circle is approximately 14/17 x 36 = 29.64cm^2.

In reality, since the circle has a radius which is half as big as the side of the square, its radius is 3cm, so its area is pi x 3^2 = 28.27cm^2. Not bad for a sample of 17 points which weren't exactly placed randomly.

How can this be used to find an approximation of pi? Just approximate the area of a circle of radius 1. The area would then be pi x 1^2 = pi.

Now let's use a computer to create lots of points. How can we check if a point is inside the circle using a computer? We can simply check if the distance of the point to the center of the circle is less than or equal to the radius of the circle. If the distance to the center is greater than the radius, than the point cannot be inside the circle.



We find the distance of a point to the center by using Pythagoras' theorem as follows:



Here is the Python 3 program we'll be using:
import math, random

def isPointInCircle(x, y, Cx, Cy, radius):
    return math.sqrt((x - Cx)**2 + (y - Cy)**2) <= radius

def approximateCircleArea(radius, numberOfPoints):
    squareSide = radius*2
    Cx = radius
    Cy = radius

    pointsInside = 0
    for i in range(numberOfPoints):
        x = random.random()*squareSide
        y = random.random()*squareSide

        if (isPointInCircle(x, y, Cx, Cy, radius)):
            pointsInside = pointsInside + 1

    return pointsInside / numberOfPoints * squareSide**2
and we'll be finding what the area of a circle of radius 1 is for different numbers of points. Since the radius is 1, the area should approximate pi:
Number of pointsApproximate area (should be 3.14159...)
102.4
1003.0
10003.152
100003.14
100003.14896
1000003.14112

Thursday, February 7, 2013

Maximum Likelihood Estimate / Laplace's law

What is the maximum likelihood estimate in statistics? Let's use an example from natural language processing.

Let's say that we want to estimate the probability/likelihood of every word in the English language vocabulary. What this means is that you want to estimate the probability that a randomly chosen word happens to be a particular word. There are many uses for such an estimate, called a unigram, such as attempting to predict what the next word to be written by a user will be (in order to type it for the user). Morse code gives the most frequent letters (letters with a highest probability of being used) the shortest codes in order to make sending messages faster. Something similar can be used for words in order to speed up the searching of words and to compress text files.

How can we estimate such a probability? What is usually done is that a large amount of text is collected, called a corpus, and the words in the text are counted. By dividing the number of times each word occurs by the total number of words in the corpus we will have an estimate of the probability for each word. This assumes there is a large number of words and that the words are used naturally (not a single word repeated over and over for example).

So P(w) = F(w)/F(*),
where P(w) is the probability of using a word called "w", F(w) is the frequency or number of times that the word "w" is found in the corpus and F(*) is the total number of words in the corpus.

What is the problem with this estimation? The problem is that any words which are not found in the corpus are assumed to have a probability of zero, that is, that they don't exist. This is not realistic, as even a very large corpus would not have every possible word ever, and new words are invented all the time. What we need to do is give a little less probability to the words we have in the corpus in order to share it with the words which are not in the corpus. This is called smoothing.

As it is, the way we are estimating the probability gives all the probability to the words in the corpus, which is the maximum probability you can give them. For this reason, this method is called the maximum likelihood estimate.

Now how can we smooth out our probability to give the unknown words a little of the probability too? A simple way we can do this is by assuming that each unknown word has an equal probability which is less than the probabilities of the words in the corpus. We can do this easily by assuming that each unknown word has a frequency of 1 and that each known word has a frequency of one more than it actually is. By increasing the number of times that each word occurs in the corpus by 1, including the words which aren't there, our probabilities will change such that the unknown words will all have an equal probability which is less than any known word's probability. In effect, no word will have a probability of zero, since every word would now look like it occurs at least once.

We might be tempted to say that the new probability is "(F(w) + 1)/F(*)", but this would be wrong because when you add together the probabilities of each word the total should come to 1, which it will not using this equation. As it was before, if you add together all the probabilities in the form of "F(w)/F(*)", the total would come to 1. For example if our corpus consisted of just one sentence "the boy played with the girl", then the probabilities of each word would be 2/6 for "the", 1/6 for "boy", 1/6 for "played", 1/6 for "with" and 1/6 for "girl". 2/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1. If we use the wrong formula on the other hand, the probabilities would be 3/6 for "the", 2/6 for "boy", 2/6 for "played", 2/6 for "with" and 2/6 for "girl". 3/6 + 2/6 + 2/6 + 2/6 + 2/6 = 11/6 ≈ 1.8.

What we need to do is to add 1 to the denominator of the probability fraction for every different word we can find the probability of. What this means is that we need to know how many unknown words together with the known words there can be. This is called the vocabulary size and assuming that you know it you just need to add it to the denominator of the probability fraction like this:

P(w) = (F(w) + 1)/(F(*) + V),
where V is the vocabulary size or the total number of words including unknown ones.

This method is called Laplace's law, or rule of succession or add-one smoothing. It's not a very realistic estimation either however, since unknown words will not have equal probabilities and will not necessarily have a probability less than the rarest words in the corpus. Also, giving the unknown words a frequency of 1 might look like it's small, but when you consider that it is just one less than the smallest frequency in the corpus then it might be too large an estimate, especially considering that we might be including words in our vocabulary size which just don't exist (maybe no one uses them anymore).

There are other smoothing methods which can be used, such as using synonyms of unknown words. By using the probability of synonyms which are found in the corpus you can find an estimation of the unknown word's probability.

Let's see another example of how maximum likelihood estimate and Laplace's law can be used. Let's say that we want to do something more interesting than finding the probabilities of unigrams (single words) and instead want to find the probabilities of bigrams, that is, pairs of words. The nice thing about bigrams is that you can use a word in order to predict what the next word might be in a sentence. This would be useful for making writing text messages faster by having your cell phone predict and write for you the next word you want to write. Using more words than just the previous word in order to predict the next word would make for more accurate guesses, but let's just consider bigrams instead of the full "ngrams".

The probability we want to find is that of the user writing a given word provided that the previous word he/she wrote is known. So for example, we want to know what the probability of the user writing "cream" is after he/she has written the word "ice". We can then do this for every word and sort them by their probability in order to present the user with a list of suggested next words. Using maximum likelihood estimate, the probability is found by taking a corpus and counting the number of times the word "ice cream" is found and dividing that with the number of times the word "ice" is found. That way we'll estimate the probability of the word "ice" being followed by the word "cream".

P(w1 w2) = F(w1 w2)/F(w1),
where w1 is the first word and w2 is the next word.

Using Laplace's law we need to add 1 to the numerator and add the total number of possible word pairs you can have (known and unknown) to the denominator. The total number of possible word pairs might be calculated simply as the vocabulary size squared, that is, the vocabulary size multiplied by itself. This assumes that every possible word can follow every possible other word, include the same word. This isn't realistic at all but it might do.

P(w1 w2) = (F(w1 w2) + 1)/(F(w1) + V^2)

Saturday, January 12, 2013

Summary of research paper "Discovering Word Senses from Text" by Patrick Pantel and Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/kdd02.pdf. This paper describes a way to group together similar words according to the different meanings of each word and all by just using the contexts of the words. You should also check the PhD thesis of the algorithm used in this paper by Panten at http://www.patrickpantel.com/download/papers/2003/cbc.pdf. You might also be interested in seeing this Java implementation of the algorithm at https://github.com/fozziethebeat/S-Space/blob/master/src/main/java/edu/ucla/sspace/clustering/ClusteringByCommittee.java.

In order to judge the similarity between two words, the grammatical relations between those words and other words as used in a corpus were used. For example, US state names can be recognized as similar because they all share the following contexts:
___ appellate court
campaign in ___
___ capital
governor of ___
___ driver's license
illegal in ___
primary in ___
___'s sales tax
senator for ___

So, for example, by counting the number of times the verb "campaign" is grammatically related to each word in a corpus, that count can be used to find other words which are similar because they have similar counts. The grammatically related word is called a feature of the word. By finding all the different features of each word in a corpus, we can give each word a list of its features and the number of times that feature was encountered as a context of the word. For example, we can find the number of times the word "campaign" appears as a feature of "California", as well as the word "capital", "governor", etc. All these features when grouped together are called a feature vector of the word.

However, in this paper, it is not the counts, or frequencies, that are associated with the features. Using raw counts to judge the importance of a feature is misleading as the feature might just be a common feature in general and not useful to find similar words by checking if they have features in common. So instead the pointwise mutual information is used. This compares the probability of the two word co-occurring together with the probability that they do so by coincidence (independently). In this paper it is calculated as:

mi(w,c) = p(w,c) / (p(w) × p(c))
where "w" is the word, "c" is a context of "w", "p(w,c)" is the probability of finding "c" as a context of "w", "p(w)" is the probability of finding the word "w" in some context and "p(c)" is the probability of finding "c" as a context of some word. These probabilities are calculated as follows:

p(w,c) = f(w,c) / N
p(w) = sum(f(w,c) for c in contexts) / N
p(c) = sum(f(w,c) for w in words) / N
N = sum(sum(f(w,c) for w in words) for c in contexts)
where "f(w,c)" is the number of times "c" is a context of "w" and "N" is the total number of word-context combinations.

However the pointwise mutual information gives bigger values to rare words and contexts so it is made fair by being multiplied by a discounting factor:

df(w,c) = f(w,c) / (f(w,c)+1) × min(f(w), f(c)) / min(f(w), f(c))
where
f(w) = sum(f(w,c) for c in contexts)
f(c) = sum(f(w,c) for w in words)

In order to measure the similarity between two words, the cosine coefficient is used:

sim(a,b) = sum(mi(a,c) × mi(b,c) for c in contexts) / sqrt(sum(mi(a,c)^2 for c in contexts) × sum(mi(b,c)^2 for c in contexts))

Words with different meanings, or senses, are called polysemous words. In order to cluster the words together according to their senses, using a different cluster for each sense, three phases are used.

In the first phase, the top 10 similar words for each word were found. In order to speed up the search, the feature vectors were indexed by their features, such that you can find all the feature vectors that have a given feature. The index is then used to find feature vectors that share features in common with the word we are finding similar words to. In this way you skip all the words which share nothing in common with the word of interest. In order to further speed things up, only some of the word's features are considered. The features which have a small value (pointwise mutual information) are ignored when looking for similar words. So only words which share these high valued features are retrieved from the index. So if we are looking for similar words to a word with the following feature vector:
{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364,
  ("object of", "look at") : 2.753,
  ("object of", "have") : 1.353
}
and we only consider features with a value higher than 3.0 then we will use the index to find all the words which have a feature vector with the features
{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364
}
This narrows down the number of words to check the similarity of (using the sim function) and so speeds up the search greatly. The top 10 most similar words are then retrieved. The fact that words which don't share high valued features are ignored did not seem to affect the result.

In the second phase, a clustering algorithm called "clustering by committee" is applied to the words using the similarity lists of the first phase. The clustering algorithm tries to assign the words into small clusters called "committees". Words which don't fit into any committee are called "residue". The point of these small clusters is to obtain centroids from which to cluster all the words, just like the randomly chosen centroids for k-means clustering, but instead the centroids are chosen intelligently. These centroids would originate from committees of words all of which mean a particular word sense. In this way the committee would decide whether to allow a word in the cluster or not. The way each committee is chosen is such that it is composed of words which do not have multiple meanings as used in the corpus. This makes the centroid of its feature vectors tend to only have features which belong to a particular word sense rather than to multiple senses, which leads to purer clusters of word senses. For example, when clustering USA state names, state names which are also names of cities (such as New York) are not included in the committee and only state names which mean nothing else are included which would avoid clustering city names in with the state name cluster when full clustering is performed. This is the clustering algorithm:
def clustering_by_committee(words, simlists, committeeSimilarityThreshold, residueSimilarityThreshold):
    highest_scoring_clusters = []
    #STEP 1.1
    for word in words:
        #STEP 1.2
        clusters = averagelink_cluster(simlists[word]) #use average link hierarchical agglomerative clustering to cluster the words into similarity clusters such that clusters can only be merged if their similarity is below a minimum similarity threshold
        #STEP 1.3
        for cluster in clusters:
            scores[cluster] = size(cluster) * average_pairwise_similarity(cluster)
        #STEP 1.4
        highest_scoring_clusters.append(arg_max(scores)) #append the highest scoring cluster

    #STEP 2.1
    sort_descending(highest_scoring_clusters) #sort highest_scoring_clusters in descending order of score

    #STEP 3.1
    committees = []
    #STEP 3.2
    for cluster in highest_scoring_clusters:
        #STEP 3.3
        centroid = compute_centroid(cluster) #the centroid is computed by averaging the frequencies (not the pointwise mutual information) of the features of each word in the cluster. these average frequencies are used to compute the pointwise mutual information which will be used to create the centroid vector (just as the feature vectors were constructed).
        #STEP 3.4
        if similarity(centroid, any committee in committees) < committeeSimilarityThreshold: #a committee cannot be very similar to other committees
            committees.append(cluster)

    #STEP 4.1
    if committees == []:
        return committees

    residues = []
    #STEP 5.1
    for word in words:
        #STEP 5.2
        if similarity(word, any committee in committees) < residueSimilarityThreshold: #if a word is not sufficiently close to any committee then it is a residue and more committees are needed for this word to belong to. otherwise the word is "covered" by a committee.
            residues.append(word)

    #STEP 6.1
    if residues == []:
        return committees
    #STEP 6.2
    else:
        return committes + clustering_by_committee(residues, simlists, threshold1, threshold2)
In the third phase, the committees discovered in the second phase are used to actually cluster words by their different senses. The words are not added to committees themselves, rather to clusters each of which is represented by its own committee. The similarity of a word to a cluster is the similarity between the word's feature vector and the centroid of the cluster's committee (which remains the same regardless of which words are added to the cluster). If hard clustering is used then words are only added to the most similar cluster, but if soft clustering is used, as is the case here, then the same word can be added to more than one cluster, one for each sense of the word. Here is the algorithm which assigns words to clusters:
def assign_word_to_clusters(word, committeeClusters, minWordSimilarity, maxClusterSimilarity):
    clusters = []
    similar_clusters = top_200_most_similar(committeeClusters, word) #find the top 200 clusters whose committee's centroid is most similar to the word
    while similar_clusters != []:
        best_cluster = get_most_similar_cluster(similar_clusters, word)
        if similarity(word, best_cluster) < minWordSimilarity:
            break #there is no cluster left that is similar enough to assign the word to
        else if similarity(best_cluster, any cluster in similar_clusters) < maxClusterSimilarity: #a word cannot be assigned to a cluster whose committee's centroid is similar to another cluster's committee's centroid
            best_cluster.add(word)
            remove_common_features(word, best_cluster) #remove the features from the word's feature vector which are also found in best_cluster's committee's centroid
        similar_clusters.remove(best_cluster)
The fact that the features in the word which are also found in the cluster's committee's centroid are removed before it is assigned to another cluster is important to allow it to be assigned to other less common sense clusters. For example, the word "plant" would first be assigned to the "factory" sense cluster in a newspaper corpus because that is the meaning of the word most of the time in such a corpus. After it is assigned to the factory sense cluster, the features which have to do with this sense are removed (or their values are set to 0) from the feature vector of "plant". Without these features, the remaining features would be of a different sense such as of living plant and so now it can be easily assigned to the living plant sense as these features are not overshadowed anymore. So this system will output a list of clusters to which each sense of every word will belong. In order to evaluate it, the sense clusters are compared to WordNet synonym groups called synsets. This is how the comparison is made: First, the similarity measure proposed by Lin which measures the similarity between two WordNet synsets is used. This is described in another post here.
sim_synset(s1, s2) = 2 log P(most_specific_common_synset(s1, s2)) / (log P(s1) + log P(s2))
The similarity between a word and a synset is the maximum similarity between the synset and one of the synsets in which the word belongs:
sim_synset_word(s, w) = max(sim_synset(s, t) for t in synsets_of(w))
Let "top_words(c, k)" be the top "k" most similar words in a sense cluster "c" to the committee which the cluster was based on. The similarity between a synset and a sense cluster is the average similarity between the synset and "top_words(c, k)".
sim_synset_cluster(s, c, k) = sum(sim_synset_word(s, w) for w in top_words(c, k)) / k
We can check if a word belongs to a correct sense cluster by checking the similarity of the sense cluster to all the synsets of the word. If the maximum similarity found is greater than a threshold, then it is considered to be a correct sense cluster.
is_correct_sense_cluster(w, c, k, threshold) = max(sim_synset_cluster(s, c, k) for s in synsets_of(w)) >= threshold
"k" was set to 4 and the threshold was varied in the experiments. By checking for which synset is most similar to the cluster, we can find the WordNet sense that corresponds to the cluster. If more than one cluster corresponds to the same WordNet sense, then only one of them is considered correct. The precision of the system's assignment of a particular word is the percentage of clusters the word is assigned to that are a correct sense. The precision of the system in general is the average precision of all words. The recall of the system's assignment of a particular word is the percentage of correct clusters the word is assigned to out of all the actual correct senses of the word which are found in the corpus. Since the corpus was not sense tagged this is a tricky situation. So the next best thing was to use several clustering methods to extract as many correct sense clusters (as defined above) for the words as possible. The union of all these correct senses are used as the number of different senses each word has in the corpus. The overall recall of the system is the average recall of every word. The precision and recall scores are combined into an F measure:
F = 2 × precision × recall / (precision + recall)
In order to do the evaluation, 144 million words were parsed using Minipar in order to extract the grammatical relations between words. The intersection of words which are found in WordNet with those which in the corpus have a pointwise mutual information with all their features greater than 250 was used for evaluation. 13403 words met the criteria and the average number of features per word was 740.8. As was already mentioned, the threshold described in "is_correct_sense_cluster" was varied during the experiment. Here are the results of the F measure using different thresholds with different clustering algorithms. The clustering algorithms were modified to just give the centroids of the clusters and then associated all words which have a similarity greater than 0.18. This is in order to associate the same word with different clusters for each sense. The system described is "CBC" which stands for "Clustering By Committee" which outperformed all the other clustering methods for all thresholds.