Gemma Bel-Enguix, Veronica Dahl, M. Dolores Jimenez-lopez



We present, discuss and exemplify a fully implemented model of text mining that can be applied to spoken languages as well as to molecular biology languages. This is based in the model presented in (Zahariev et al., 2009) oriented to discovering DNA barcodes for sequences. The novelty of our methodology is the use of Constraint Based Reasoning to detect string repetitions through unification, by introducing a new general rule for matching. We claim that the same method can be succesfully applied to mining natural language texts.


  1. Banko, M. and Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In ACL 7801: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 26-33. Morristown, NJ, Association for Computational Linguistics.
  2. Basu, S., Burma, D., and Chaudhuri, P. (2003). Words in dna sequences: Some case studies based on their frequency statistics. Journal of Mathematical Biology.
  3. Becket, B. (1988). Introduction to Cryptology. Blackwell.
  4. Dahl, V. and Voll, K. (2004). Concept formation rules: an executable cognitive model of knowledge construction. In proceedings of First International Workshop on Natural Language Understanding and Cognitive Sciences. INSTICC Press.
  5. Forsyth, R. (1999). New Directions in Text Categorization, pages 151-185. Springer, Berlin.
  6. Fruhwirth, T. (1993). User-defined constraint handling. In ICLP 93, Budapest. MIT Press.
  7. Fruhwirth, T. (1998). Theory and practice of constraint handling rules. Journal of Logic Programming. Special Issue on Constraint Logic Programming, (37(1- 3)):95-138.
  8. Ginter, F., Boberg, J., Jarvinen, J., and Salakoski, T. (2004). New techniques for disambiguation in natural language and their application to biological text. Journal of Machine Learning Research, (5):605-621.
  9. Hakkani-Tur, D. and Tur, G. (2007). Statistical sentence extraction for information distillation. In Acoustics, Speech and Signal Processing. ICASSP 2007, IEEE International Conference, volume vol. 4.
  10. Mani, I. and Maybury, M. (1999). Advances in Automatic Text Summarization. MIT Press, Cambridge.
  11. Manning, C., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  12. Manning, C. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
  13. Menezes, A., Oorschot, P., and Vanstone, S. (1996). Handbook of Applied Cryptography. CRC Press.
  14. Mikheev, A. (2003). Text Segmentation. Oxford, Oxford University Publications.
  15. Parida, L. (2007). Pattern Discovery in Bioinformatics: Theory and Algorithms. Chapman & Hall/CRC.
  16. Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, (26(4)):471-495.
  17. Zahariev, M., Dahl, V., Chen, W., and Levesque, A. (2009). Efficient algorithms for the discovery of dna oligonucleotide barcodes for dna sequences and groups of sequences.

Paper Citation

in Harvard Style

Bel-Enguix G., Dahl V. and Dolores Jimenez-lopez M. (2009). DNA AND NATURAL LANGUAGES - Text Mining . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 140-145. DOI: 10.5220/0002292201400145

in Bibtex Style

author={Gemma Bel-Enguix and Veronica Dahl and M. Dolores Jimenez-lopez},
title={DNA AND NATURAL LANGUAGES - Text Mining},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},

in EndNote Style

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
SN - 978-989-674-011-5
AU - Bel-Enguix G.
AU - Dahl V.
AU - Dolores Jimenez-lopez M.
PY - 2009
SP - 140
EP - 145
DO - 10.5220/0002292201400145