Hieu Quang Le, Stefan Conrad



This paper studies the problem of classifying structured data sources on the Web. While prior works use all features, once extracted from search interfaces, we further refine the feature set. In our research, each search interface is treated simply as a bag-of-words. We choose a subset of words, which is suited to classify web sources, by our feature selection methods with new metrics and a novel simple ranking scheme. Using aggressive feature selection approach, together with a Gaussian process classifier, we obtained high classification performance in an evaluation over real web data.


  1. Barbosa, L. and Freire, J. (2005). Searching for hidden-web databases. In WebDB'05, pages 1-6.
  2. Barbosa, L., Freire, J., and Silva, A. (2007). Organizing hidden-web databases by clustering visible web documents. In ICDE'07, pages 326-335.
  3. Bergman, M. K. (2001). White paper - The Deep Web: Surfacing hidden value. Accessible at
  4. Callan, J. P., Connell, M., and Du, A. (1999). Automatic discovery of language models for text databases. In SIGMOD'99, pages 479-490.
  5. Chakrabarti, S., Dom, B., and Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. ACM SIGMOD Record, 27(2):307 - 318.
  6. Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. (2004). Structured databases on the Web: Observations and implications. SIGMOD Record, 33(3):61- 70.
  7. Chang, K. C.-C., He, B., and Zhang, Z. (2005). Toward large scale integration: Building a MetaQuerier over databases on the web. In CIDR'05, pages 44-55.
  8. Chawathe, S., Garcia-molina, H., Hammer, J., Irel, K., Papakonstantinou, Y., Ullman, J., and Widom, J. (1994). The Tsimmis project: Integration of heterogeneous information sources. JIIS, 8(2):7-18.
  9. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3:1289-1305.
  10. Gabrilovich, E. and Markovitch, S. (2004). Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In ICML'04, pages 321-328.
  11. He, B. and Chang, K. C.-C. (2003). Statistical schema matching across web query interfaces. In SIGMOD'03, pages 217-228.
  12. He, B., Tao, T., and Chang, K. C.-C. (2004). Organizing structured web sources by query schemas: A clustering approach. In CIKM'04, pages 22-31.
  13. He, H., Meng, W., Yu, C., and Wu, Z. (2005). WISEIntegrator: A system for extracting and integrating complex web search interfaces of the Deep Web. In VLDB'05, pages 1314-1317.
  14. Ipeirotis, P. G., Gravano, L., and Sahami, M. (2001). Probe, count, and classify: categorizing hidden web databases. In SIGMOD'01, pages 67-78.
  15. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137-142.
  16. Levy, A. Y., Rajaraman, A., and Ordille, J. J. (1996). Querying heterogeneous information sources using source descriptions. In VLDB'96, pages 251-262.
  17. Lu, Y., He, H., Peng, Q., Meng, W., and Yu, C. (2006). Clustering e-commerce search engines based on their search interface pages using WISE-Cluster. DKE Journal, 59(2):231-246.
  18. Mladenic, D. (1998). Feature subset selection in textlearning. In ECML'98, pages 95-100.
  19. R. M. (1997). Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Available from
  20. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. The MIT Press.
  21. Rogati, M. and Yang, Y. (2002). High-performing feature selection for text classification. In CIKM'02.
  22. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM CSUR, 34(1):1-47.
  23. Soucy, P. and Mineau, G. W. (2001). A simple feature selection method for text classification. In ICAI'01.
  24. UIUC (2003). The UIUC Web integration repository. Computer Science Dept., Uni. of Illinois at UrbanaChampaign.
  25. Wu, W., Yu, C., Doan, A., and Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In SIGMOD'04.
  26. Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML'97, pages 412-420.
  27. Zamir, O. and Etzioni, O. (1998). Web document clustering: a feasibility demonstration. In SIGIR'98, pages 46- 54.

Paper Citation

in Harvard Style

Quang Le H. and Conrad S. (2009). CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 613-620. DOI: 10.5220/0001824706130620

in Bibtex Style

author={Hieu Quang Le and Stefan Conrad},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},

in EndNote Style

JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
SN - 978-989-8111-81-4
AU - Quang Le H.
AU - Conrad S.
PY - 2009
SP - 613
EP - 620
DO - 10.5220/0001824706130620