Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching

Pedro Curto, Nuno Mamede, Jorge Baptista

2015

Abstract

This paper describes a system to assist the selection of adequate reading materials to support European Portuguese teaching, especially as second language, while highlighting the key challenges on the selection of linguistic features for text difficulty (readability) classification. The system uses existing Natural Language Processing (NLP) tools to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, 52 features are extracted: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features. A classifier was created using these features and a corpus, previously annotated by readability level, using a five-levels language classification official standard for Portuguese as Second Language. In a five-levels (from A1 to C1) scenario, the best-performing learning algorithm (LogitBoost) achieved an accuracy of 75.11% with a root mean square error (RMSE) of 0.269. In a three-levels (A, B and C) scenario, the best-performing learning algorithm (C4.5 grafted) achieved 81.44% accuracy with a RMSE of 0.346.

References

  1. Aït-Mokhtar, S., Chanod, J.-P., and Roux, C. (2002). Robustness Beyond Shallowness: Incremental Deep
  2. 9https://string.l2f.inesc-id.pt/demo/classification.pl
  3. (accessed in March 2015).
  4. Parsing. Natural Language Engineering, 8(3):121- 144.
  5. Baptista, J., Mamede, N., and Gomes, F. (2010). Auxiliary Verbs and Verbal Chains in European Portuguese. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language (PROPOR'10), pages 110-119, Porto Alegre, RS, Brazil. Springer.
  6. Beaman, K. (1984). Coordination and Subordination Revisited: Syntactic Complexity in Spoken and Written Narrative Discourse. In Coherence in Spoken and Written Discourse, volume 12, pages 45-80. Ablex, Norwood, NJ.
  7. Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., and Scuse, D. (2013). WEKA Manual for Version 3-7-11. Hamilton, New Zealand.
  8. Branco, A., Rodrigues, J., Costa, F., Silva, J., and Vaz, R. (2014). Rolling out Text Categorization for Language Learning Assessment Supported by Language Technology. In Proceedings of the 11th International Conference on Computational Processing of Portuguese (PROPOR'14), volume 8775, pages 256-261, Sa˜o Carlos, Brazil.
  9. Brown, J. and Eskenazi, M. (2004). Retrieval of Authentic Documents for Reader-Specific Lexical Practice. In Proceedings of InSTIL/ICALL Symposium 2004, volume 17, pages 25-28, Venice, Italy.
  10. Curto, P. (2014). Classificador de textos para o ensino de portugueˆs como segunda língua. Master's thesis, Instituto Superior Técnico - Universidade de Lisboa, Lisboa.
  11. Figueirinha, P. (2013). Syntactic REAP.PT. Exercises on Word Formation. Master's thesis, Instituto Superior Técnico - Universidade de Lisboa, Lisboa.
  12. Flesch, R. (1943). Marks of Readable Style: A Study in Adult Education (Contributions to education). Number 897. Columbia University, Teachers College, Bureau of Publications, New York, United States.
  13. Fry, E. (1968). A readability formula that saves time. Journal of Reading, 11(7):513-578.
  14. Fulcher, G. (1997). Text difficulty and accessibility: Reading formulae and expert judgement. System, 25(4):497-513.
  15. Grosso, M. J., Soares, A., de Sousa, F., and Pascoal, J. (2011a). QuaREPE - Quadro de Refereˆncia para o Ensino de Portugueˆs no Estrangeiro. Documento Orientador. Lisboa: Ministério da Educac¸a˜o e Cieˆncia/Direc¸ a˜o Geral de Inovac¸ a˜o e Desenvolvimento Curricular.
  16. Grosso, M. J., Soares, A., de Sousa, F., and Pascoal, J. (2011b). QuaREPE - Quadro de Refereˆncia para o Ensino de Portugueˆs no Estrangeiro. Tarefas, Actividades, Exercícios e Recursos para a avaliac¸ a˜o. Lisboa: MEC/DGIDC.
  17. Gunning, R. (1952). The Technique of Clear Writing. McGraw-Hill, New York, USA.
  18. Gunning, R. (1969). The FOG Index after twenty years. Journal of Business Communication, 6(2):3-13.
  19. Klare, G. (1963). The measurement of readability. Iowa State University Press, Ames, USA.
  20. Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, pages 159-174.
  21. Mamede, N., Baptista, J., Diniz, C., and Cabarra˜o, V. (2012). STRING: An Hybrid, Statistical and Rule-Based Natural Language Processing Chain for Portuguese. In Proceedings of the 10th International Conference on Computational Processing of Portuguese (PROPOR'12), volume Demo Session, Coimbra, Portugal, https://string.l2f.inescid.pt/w/index.php/Publications.
  22. Marujo, L., Lopes, J., Mamede, N., Trancoso, I., Pino, J., Eskenazi, M., Baptista, J., and Viana, C. (2009). Porting REAP to European Portuguese. In Proceedings of SLaTE 2009, pages 69-72, Wroxall Abbey Estate, Warwickshire, England.
  23. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8):639-646.
  24. Pitler, E. and Nenkova, A. (2008). Revisiting readability: a unified framework for predicting text quality. In Proceedings of EMNLP'08, pages 186-195, Stroudsburg, PA, USA. ACL.
  25. Scarton, C. E. and Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Portugueˆs. Linguamática, 2(1):45- 61.
  26. Schwarm, S. E. and Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of ACL'05, pages 523-530, Stroudsburg, PA, USA. ACL.
  27. Stenner, A. J. (1996). Measuring reading comprehension with the Lexile framework. In Fourth North American Conference on Adolescent/Adult, London, UK. Academic Press Ltd.
  28. Thompson, K. C. and Callan, J. P. (2004). A Language Modeling Approach to Predicting Reading Difficulty. In Proceedings of NAACL'04, pages 193-200, Boston, United States. ACL.
Download


Paper Citation


in Harvard Style

Curto P., Mamede N. and Baptista J. (2015). Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching . In Proceedings of the 7th International Conference on Computer Supported Education - Volume 1: CSEDU, ISBN 978-989-758-107-6, pages 36-44. DOI: 10.5220/0005428300360044


in Bibtex Style

@conference{csedu15,
author={Pedro Curto and Nuno Mamede and Jorge Baptista},
title={Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching},
booktitle={Proceedings of the 7th International Conference on Computer Supported Education - Volume 1: CSEDU,},
year={2015},
pages={36-44},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005428300360044},
isbn={978-989-758-107-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Computer Supported Education - Volume 1: CSEDU,
TI - Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching
SN - 978-989-758-107-6
AU - Curto P.
AU - Mamede N.
AU - Baptista J.
PY - 2015
SP - 36
EP - 44
DO - 10.5220/0005428300360044