Learning Text Extraction Rules, without Ignoring Stop Words

João Cordeiro, Pavel Brazdil

2004

Abstract

Information Extraction (IE) from text /web documents has become an important application area of AI. As the number of web sites and documents has grown dramatically, the users need an easy, fast and flexible ways of generating systems that can carry out specific IE tasks. This can be achieved with the help of Machine Learning (ML) techniques. We have developed a system that exploits this strategy. After training the system is capable of identifying certain relevant elements in the text and extracting the corresponding information. As input, system takes a collection of text documents (in a certain domain), that have been previously annotated by a user. This is used to generate extraction rules. We describe a set of experiments that have been oriented towards the domain of announcements (in Portuguese) concerning house/flat sales. We show that quite good results overall can be achieved using this methodology. In previous work some authors argue that stop words should really be eliminated before training. We have decided to re-examine this assumption and present evidence that these can be quite useful in some sub-tasks.

References

  1. Shadbolt N., Caught up in the web, Invited talk at the 13th Int. Conf. on Knowledge Engineering and Knowledge Manangement (EKWA02) (2002)
  2. Kushmerick N., Wrapper induction: efficiency and expressiveness, Elsevier (2000), 15-68
  3. Sitter A., W. Daelemans, Information Extraction via Double Classification, in Proceedings of the Int. Workshop on Adaptive Text Extraction and Mining (C.Ciravegna and N.Kushmerick, eds.), associated with ECML/PKDD-2003 Conf., Dubrovnik, Croatia, (2003)
  4. Mladenic D., M. Grobelnik, Feature selection for unbalanced class distribution and Naive Bayes, in Machine Learning: Proceedings of the Sixtheenth International Conference (ICML'99), Morgan Kaufmann (1999)
  5. Mitchell T. M., Machine Learning, McGraw-Hill (1997)
  6. Witten Ian, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann (2000)
  7. Craven M., D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slatery, Learning to construct knowledge bases from the World Wide Web, Elsevier (2000), 69-113
  8. Winston P. H., Artificial Intelligence, Addison-Wesley (1992)
  9. Quinlan J. R., C5.0 Data Mining Tool, www.rulequest.com (1997)
  10. Meadow T. Charles, B. R. Boyce, D.H. Kraft, Text Information Retrieval Systems, 2nd ed., Academic Press (2000)
Download


Paper Citation


in Harvard Style

Cordeiro J. and Brazdil P. (2004). Learning Text Extraction Rules, without Ignoring Stop Words . In Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004) ISBN 972-8865-01-5, pages 128-138. DOI: 10.5220/0002681601280138


in Bibtex Style

@conference{pris04,
author={João Cordeiro and Pavel Brazdil},
title={Learning Text Extraction Rules, without Ignoring Stop Words},
booktitle={Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)},
year={2004},
pages={128-138},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002681601280138},
isbn={972-8865-01-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)
TI - Learning Text Extraction Rules, without Ignoring Stop Words
SN - 972-8865-01-5
AU - Cordeiro J.
AU - Brazdil P.
PY - 2004
SP - 128
EP - 138
DO - 10.5220/0002681601280138