Saturday, June 23, 2012

Summary of research paper "Readability Assessment for Text Simplification" by Sandra Aluisio, Lucia Specia, Caroline Gasperin and Carolina Scarton

This is a summary of the research paper http://aclweb.org/anthology-new/W/W10/W10-1001.pdf. This paper is about a program which classifies text into one of three difficulty of reading levels: "rudimentary", "basic" and "advanced". These difficulty levels are defined by INAF (Portuguese document at http://www.ibope.com.br/ipm/relatorios/relatorio_inaf_2009.pdf), the National Indicator of Functional Literacy. The aim is to use this program to assist authors to detect parts of their writing which can be simplified. The program is designed to work with Portuguese rather than English.

The following list of features is extracted from the Portuguese text to assess using various tools and resources described in the paper (copied directly from the paper):
  1. Number of words
  2. Number of sentences
  3. Number of paragraphs
  4. Number of verbs
  5. Number of nouns
  6. Number of adjectives
  7. Number of adverbs
  8. Number of pronouns
  9. Average number of words per sentence
  10. Average number of sentences per paragraph
  11. Average number of syllables per word
  12. Flesch index for Portuguese
  13. Incidence of content words
  14. Incidence of functional words
  15. Raw Frequency of content words
  16. Minimal frequency of content words
  17. Average number of verb hypernyms
  18. Incidence of NPs
  19. Number of NP modifiers
  20. Number of words before the main verb
  21. Number of high level constituents
  22. Number of personal pronouns
  23. Type-token ratio
  24. Pronoun-NP ratio
  25. Number of “e” (and)
  26. Number of “ou” (or)
  27. Number of “se” (if)
  28. Number of negations
  29. Number of logic operators
  30. Number of connectives
  31. Number of positive additive connectives
  32. Number of negative additive connectives
  33. Number of positive temporal connectives
  34. Number of negative temporal connectives
  35. Number of positive causal connectives
  36. Number of negative causal connectives
  37. Number of positive logic connectives
  38. Number of negative logic connectives
  39. Verb ambiguity ratio
  40. Noun ambiguity ratio
  41. Adverb ambiguity ratio
  42. Adjective ambiguity ratio
  43. Incidence of clauses
  44. Incidence of adverbial phrases
  45. Incidence of apposition
  46. Incidence of passive voice
  47. Incidence of relative clauses
  48. Incidence of coordination
  49. Incidence of subordination
  50. Out-of-vocabulary words
  51. LM probability of unigrams
  52. LM perplexity of unigrams
  53. LM perplexity of unigrams, without line break
  54. LM probability of bigrams
  55. LM perplexity of bigrams
  56. LM perplexity of bigrams, without line break
  57. LM probability of trigrams
  58. LM perplexity of trigrams
  59. LM perplexity of trigrams, without line break

The features from 1 to 42 are the "Coh-Metix PORT" feature group and were derived from the Coh-Metrix-PORT tool, which is a Portuguese adaptation of the Coh-Metrix.

The features from 43 to 49 are the "Syntactic" feature group and were added to analyse syntactic constructions which are useful for automatic simplification.

The features from 50 to 59 are the "Language model" feature group and are derived by comparing n-grams in the input text to n-gram frequencies in the language (Portuguese) as well as their perplexity and scores for any words which are not listed in the system's vocabulary.

The features from 1 to 3 and 9 to 11 are also called the "Basic" feature group because they require no linguistic knowledge. Feature 12 is also put into a feature group on its own called "Flesch".

Experiments were performed to check if it is possible to use machine learning techniques to learn to detect the INAF difficulty of reading levels and to discover which features are best to use for complexity detection.

Three types of machine learning algorithms in Weka were evaluated:
  • "Standard classification" (SMO): A standard classifier (support vector machines) which assumes that there is no relation between the difficulty levels
  • "Ordinal classification" (OrdinalClassClassifier): An ordinal classifier which assumes that the difficulty levels are ordered
  • "Regression" (SMO-reg): A regressor which assumes that the difficulty levels are continuous

Manually simplified corpora were used as training data for the algorithms. The corpora were simplified into each of the difficulty levels. Using these simplified corpora it was possible to check if the system could predict that the original corpora are of "advanced" difficulty level, the lightly simplified corpora are of "basic" difficulty level and the heavily simplified corpora are of "rudimentary" difficulty level.

In order to determine the importance of each feature described, the absolute Pearson correlation between each feature of each corpora and how well it predicts the corpora's difficulty level is computed. The results for the best features below are copied from the paper (1=highly correlated, 0=not correlated):
  • Words per sentence: 0.693
  • Incidence of apposition: 0.688
  • Incidence of clauses: 0.614
  • Flesch index: 0.580
  • Words before main verb: 0.516
  • Sentences per paragraph: 0.509

In order to determine which machine learning algorithm is better at classifying the corpora into correct difficulty levels using the features mentioned earlier, the Pearson correlation with true score was used on how the corpora were classified. The experiments were also run using the sub-groups of features which were mentioned to see how important each group was. Here are the results, summarized from the paper:

Standard classification
Feature groupPearson correlation
All0.84
Language model0.25
Basic0.76
Syntactic0.82
Coh-Metrix-PORT0.79
Flesch0.52

Ordinal classification
Feature groupPearson correlation
All0.83
Language model0.49
Basic0.73
Syntactic0.81
Coh-Metrix-PORT0.8
Flesch0.56

Regression
Feature groupPearson correlation
All0.8502
Language model0.6245
Basic0.7266
Syntactic0.8063
Coh-Metrix-PORT0.8051
Flesch0.5772

It is clear that it is better to use all the features together than any sub-group of them, but the syntactic features, followed by the Coh-Metrix-PORT features are the most useful feature groups and the language model feature group was the worst.

The simple classification algorithm was chosen as the best because although it is the simplest algorithm, its results are comparable to the other algorithms' results.

No comments:

Post a Comment