The first thing that was done was to download Wikipedia and Simple English Wikipedia articles (3,266,245 articles and 60,100 articles respectively) the following preprocessing tasks were performed on them:
- Comments were removed
- HTML tags were removed
- Links were removed
- Tables and figures were removed (with associated text)
- Text was tokenized
- All word were turned to lower case
- Sentences were split
The last 3 steps were carried out using the Stanford NLP Package.
The system consists of first learning simplification rules (finding pairs of difficult/simpler words) and then applying them by finding which is the best word to replace a difficult word. We shall now look at each of these stages.
To learn the simplification rules, a 10-token window (10 tokens on each side) which crosses sentence boundaries (confirmed through private communication with the authors) was used to construct a context vector of each token (excluding stop words, numbers and punctuation) in the English Wikipedia corpus. All numbers were represented with an artificial token such as "<number>" which allows you to count all instances of any number next to a token. Only frequencies greater than 2 were kept in the context vector.
In order to create a list of possible word substitutions, first a Cartesian product of all words in the English Wikipedia corpus was created which was then filtered in the following ways in order to keep only word pairs which are simplification replacements:
- All morphological variants where removed by checking their lemma with MorphAdorner.
- In order to find more morphological variants which are not handled by MorphAdorner, all words were one is a prefix of the other followed by the suffixes "s", "es", "ed", "ly", "er", "ing" were also removed.
- WordNet was used to remove all word pairs where the first word is not a synonym or hypernym of each second.
- In order to keep only word pairs where the second word is simpler than the first, lexical complexity is measured as the frequency of the word in the English Wikipedia corpus is divided by the frequency of the word in the Simple English Wikipedia corpus and the ratio is multiplied by the length of the word. The reasoning is that the more often a word appears in the Simple English Wikipedia than it does in English Wikipedia and the shorter the word is, the simpler it is and so the smaller it's complexity measure. All word pairs where the second word is harder than the first are removed.
Now we have the substitution list. In order to make the list more practical, each word pair's similarity in meaning is computed by using cosine similarity on their context vectors. This allows the choosing of the most similar simplified word to replace a difficult word with.
Also each word pair is replaced with a list of all morphological variants of the two words which are consistent with each other. All nouns are replaced with plural and singular versions (so that if the word to replace is in plural, than the simplified word will also be in plural) and all verbs are replaced with all variants of tenses and persons. The variants keep the same similarity score of the original pairs.
In order to simplify a sentence (which must have at least 7 content words), every word in the sentence, which is not inside quotation marks or parenthesis, is checked for possible simplification.
For every word in the sentence, its context vector inside the sentence is computed just as described above. Now the word is checked for whether it is used precisely because it is difficult, for example when defining a difficult word. An example given in the paper is "The history of the Han ethnic group is closely tied to that of China" in which the word "Han" should not be replaced with "Chinese" as that would make the sentence lose its meaning. In order to detect these cases, the word is checked if it is used in a way that it is commonly used. This is done by checking if the cosine similarity of the word's context vector in the corpus with the word's context vector in the sentence is greater than 0.1 and if so then the word is not simplified.
Next, for every word pair in the substitution list whose first word is the word to simplify, only those whose second word is not in the sentence already are considered. With the remaining substitution candidates, the simpler word is checked for whether it has the same sense as the word it will replace since the same word might have different meanings. To do this, a common context vector is created for both the difficult and simple word. This is done by comparing the value for each word in the corpus context vectors of both words and keeping only the smallest one. This would be the common context vector. Now the cosine similarity of the common context vector with the difficult word's context vector in the sentence is checked for whether it is less than 0.01 and if so then the simple word is not used. Out of all the remaining possible simple words to replace with, the one with the highest similarity is used.
In order to evaluate the simplification system, a baseline was used which just replaces words with their most frequent synonym (assumed to be simpler). Test sentences were found for simplification by searching through the English Wikipedia corpus for all sentences that the system determined that only one word can be simplified, that this word had a more frequent synonym (to allow the baseline to simplify also) and that the replaced word by the system and the baseline were different. Out of these, the sentences were grouped according to the substitution pair that was used to simplify them and then only a random sentence from each group was used in the evaluation. 65 such sentences were found and these were simplified both by the system and by the baseline, resulting in 130 complex/simplified sentences.
Since the groups from which a random sentence was selected were of different sizes, different substitution pairs are applied more frequently than others. In order to see how the frequency of use of the pairs affects the performance of the system, the sentences were grouped into 3 "frequency bands" of roughly equal frequency of use (size of group): 46 in the "high" band, 44 in the "med" band and 40 in the "low".
In order to be evaluated, the complex/simplified sentence pairs (both system and baseline) were given to 3 human judges (native English speakers who are not the authors) who would determine the grammaticality of the simplified sentence (is it still grammatical?) as "bad", "ok" or "good", the meaning of the simplified sentence (does it still mean the same thing?) as "yes" or "no" and the simplification of the simplified sentence (is it simpler?) as "yes" or "no". No two annotators were given both a system and baseline complex/simplified pair and a small portion of the pairs were given to more than one judge in order to check if they agree with each other.
Here is a table showing all the results. Under grammaticality, the percentage in brackets is the percentage of "ok" ratings and the percentage outside the brackets is the percentage of "good" ratings.
Grammaticality is the same because the form of the word is determined using a morphological engine rather than using context. Meaning preservation is affected by the low frequency band because there are less examples to create a quality context vector in order to determine when a particular word can be used in a particular context. Simplification consistently outperforms the baseline.