On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?

Vojtěch Kovář, Miloš Jakubíček, Aleš Horák

Abstract

The paper discusses problems in state of the art evaluation methods used in natural language processing (NLP). Usually, some form of gold standard data is used for evaluation of various NLP tasks, ranging from morphological annotation to semantic analysis. We discuss problems and validity of this type of evaluation, for various tasks, and illustrate the problems on examples. Then we propose using application-driven evaluations, wherever it is possible. Although it is more expensive, more complicated and not so precise, it is the only way to find out if a particular tool is useful at all.

References

  1. Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., Marcinkiewicz, M. A., and Schasberger, B. (1995). Bracketing guidelines for treebank II style Penn treebank project.
  2. Galliers, J. and Spärck Jones, K. (1993). Evaluating natural language processing systems. Technical Report UCAM-CL-TR-291, University of Cambridge, Computer Laboratory.
  3. Hajic?, J. (2006). Complex corpus annotation: The Prague dependency treebank. Insight into the Slovak and Czech Corpus Linguistics, page 54.
  4. Hajic?, J., Panevová, J., Burán?ová, E., Ures?ová, Z., Bémová, A., S? te?pánek, J., Pajas, P., and Kárník, J. (2005). Annotations at analytical level: Instructions for annotators.
  5. Katz-Brown, J., Petrov, S., McDonald, R., Och, F., Talbot, D., Ichikawa, H., Seno, M., and Kazawa, H. (2011). Training a parser for machine translation reordering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 183-192. Association for Computational Linguistics.
  6. Kilgarriff, A., Jakubíc?ek, M., Kovár?, V., RychlÉ, P., and Suchomel, V. (2014a). Finding terms in corpora for many languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics, pages 53-56, Gothenburg, Sweden. The Association for Computational Linguistics.
  7. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79-86. Citeseer.
  8. Kovár?, V. (2014). Automatic Syntactic Analysis for RealWorld Applications. Phd thesis, Masaryk University, Faculty of Informatics.
  9. Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Computational Linguistics and Intelligent Text Processing - 12th International Conference, CICLing 2011, pages 171-189. Springer, Berlin.
  10. Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313-330.
  11. Mikulová, M. and S?te?pánek, J. (2009). Annotation procedure in building the Prague Czech-English dependency treebank. In Slovko 2009, NLP, Corpus Linguistics, Corpus Based Grammar Research, pages 241-248, Bratislava, Slovakia. Slovenská akadémia vied.
  12. Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., and Tsujii, J. (2009). Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 25(3):394-400.
  13. Mollá, D. and Hutchinson, B. (2003). Intrinsic versus extrinsic evaluations of parsing systems. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable?, Evalinitiatives 7803, pages 43-50, Stroudsburg, PA, USA. Association for Computational Linguistics.
  14. Nothman, J., Ringland, N., Radford, W., Murphy, T., and Curran, J. R. (2012). Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, 194:151-175.
  15. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311-318. Association for Computational Linguistics.
  16. Radziszewski, A. and Grác, M. (2013). Using low-cost annotation to train a reliable Czech shallow parser. In Proceedings of Text, Speech and Dialogue, 16th International Conference, volume 8082 of Lecture Notes in Computer Science, pages 575-1156, Berlin. Springer.
  17. Sampson, G. (2000). A proposal for improving the measurement of parse accuracy. International Journal of Corpus Linguistics, 5(01):53-68.
  18. Sampson, G. and Babarczy, A. (2008). Definitional and human constraints on structural annotation of English. Natural Language Engineering, 14(4):471-494.
Download


Paper Citation


in Harvard Style

Kovář V., Jakubíček M. and Horák A. (2016). On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution? . In Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-172-4, pages 540-545. DOI: 10.5220/0005824805400545


in Bibtex Style

@conference{icaart16,
author={Vojtěch Kovář and Miloš Jakubíček and Aleš Horák},
title={On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?},
booktitle={Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2016},
pages={540-545},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005824805400545},
isbn={978-989-758-172-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?
SN - 978-989-758-172-4
AU - Kovář V.
AU - Jakubíček M.
AU - Horák A.
PY - 2016
SP - 540
EP - 545
DO - 10.5220/0005824805400545