AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS

Yaakov Hacohen-Kerner, Nadav Schweitzer, Yaakov Shoham

2010

Abstract

Quotations in a text document contain important information about the content, the context, the sources that the author uses, their importance and impact. Therefore, automatic identification of quotations from documents is an important task. Quotations included in rabbinic literature are difficult to identify and to extract for various reasons. The aim of this research is to automatically identify Biblical quotations included in rabbinic documents written in Hebrew-Aramaic. We deal with various kinds of quotations: partial, missing and incorrect. We formulate nineteen features to identify these quotations. These features were divided into seven different feature sets: matches, best matches, sums of weights, weighted averages, weighted medians, common words, and quotation indicators. Several features are novel. Experiments on various combinations of these features were performed using four common machine learning methods. A combination of 17 features using J48 (an improved version of C4.5) achieves an accuracy of 91.2%, which is an improvement of about 8% compared to a baseline result.

References

  1. Choueka, Y., Conley E. S., Dagan. I., 2000. A Comprehensive Bilingual Word Alignment System: Application to Disparate Languages - Hebrew, English, in Veronis J. (Ed.), Parallel Text Processing, Kluwer Academic Publishers, 69-96.
  2. de La Clergerie É., Sagot, B., Stern, R., Denis P., Recourcé G., Mignot. V., 2009. Extracting and Visualizing Quotations from News Wires, in Proc. of L&TC 2009, Poznán, Poland.
  3. Cortes, C., Vapnik. V., 1995. Support-Vector Networks. Machine Learning 20, 273-297.
  4. Forman, G., 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. of Machine Learning Research 3 1289-1305
  5. Gabrilovich, E., Markovitch, S., 2004. Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5. In Proc. of the 21 International Conference on Machine Learning, 321- 328, Morgan Kaufmann
  6. Haykin, S., 1998. Neural Networks: A Comprehensive Foundation, 2nd Edition, Prentice Hall.
  7. Hosmer D. W., Lemeshow. S., 2000. Applied Logistic Regression. 2nd ed. New York; Chichester, Wiley.
  8. Liang, J., Dhillon, N., Koperski. K., 2010. A Large Scale System for Annotating and Querying Quotations in News Feeds, Semantic Search 2010 Workshop ,the 19th international conference on World wide web (WWW2010), Raleigh, North Carolina, USA.
  9. Miller, G., A., 1956. The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity of Information. Psychological Science, 63, 81-97.
  10. Platt, J., C., 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, Massachusetts, chapter 12, 185- 208.
  11. Pouliquen, B., Steinberger, R., Best, C., 2007. Automatic Detection of Quotations in Multilingual News. In Proc. of Recent Advances in Natural Language Processing (RANLP-2007), 25-32.
  12. Quinlan. R., J., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos.
  13. Sagot, B., Boullier, P., 2008. SXPipe 2: Architecture pour le Traitement Présyntaxique de Corpus Bruts. Traitement Automatique des Langues (T.A.L.), 49(2):155-188.
  14. Yang, Y., Pedersen, J., P., A., 1997. Comparative Study on Feature Selection in Text Categorization. In Proc. of the Fourteenth International Conference on Machine Learning (ICML'97), 412-420.
  15. Vapnik. V., N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA.
  16. Witten, I., H., Frank, E., 2009. Learning Software in Java. http://www.cs.waikato.ac.nz/ml/weka.
Download


Paper Citation


in Harvard Style

Hacohen-Kerner Y., Schweitzer N. and Shoham Y. (2010). AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 320-325. DOI: 10.5220/0003106703200325


in Bibtex Style

@conference{kdir10,
author={Yaakov Hacohen-Kerner and Nadav Schweitzer and Yaakov Shoham},
title={AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={320-325},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003106703200325},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS
SN - 978-989-8425-28-7
AU - Hacohen-Kerner Y.
AU - Schweitzer N.
AU - Shoham Y.
PY - 2010
SP - 320
EP - 325
DO - 10.5220/0003106703200325