AUTOMATIC IDENTIFICATION OF SPECIFIC WEB DOCUMENTS BY USING CENTROID TECHNIQUE

Udomsit Sukakanya, Kriengkrai Porkaew

Abstract

In order to reduce time to find specific information from high volume of information on the Web, this paper proposes the implementation of an automatic identification of specific Web documents by using centroid technique. The Initial training sets in this experiment are 4113 Thai e-Commerce Web documents. After training process, the system gets a Centroid e-Commerce vector. In order to evaluate the system, six test sets were taken under consideration. In each test set has 100 Web pages both known e-Commerce and non e-Commerce Web pages. The average system performance is about 90%.

References

  1. Bharat, K., Henzinger, M., 1998. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval.
  2. Han, E.H., Karypis, G., 2000. Centroid-Based Document Classification: Analysis & Experimental Results, Technical Report TR-00-017, Department of Computer Science, University of Minnesota, Minneapolis.
  3. Klose, M., Lechner and Ulrike., 1999. Design of Business Media- An Integrated Model of Electronic Commerce, Proceeding of the fifth America Conference on Information Systems (AMCIS'99), Milwaukee, WI.
  4. Lan, Y., Bing, L.,2003. Web Page Cleaning for Web Mining through Feature Weighting. In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico.
  5. Meknavin, S., Charoenpornsawat, C. and Kijsirikul, B. 1997. Feature-based Thai Word Segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS'97), Phuket, Thailand.
  6. Raghavan, V.V., Wong, S.K.M.,1986. A Critical Analysis of the Vector Space Model for Information Retrieval, Journal of the American Society for Information Science (JASIS).
  7. Salton, G., 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
  8. Salton, G., Wong, A., and Yang, C.S., 1975. A Vector Space Model for Automatic Indexing, Communication of the ACM.
  9. Surasak, S., Kasom, K., 2003 Structure Properties of the Thai WWW: The 2003 Survey, The Conference on Internet Technology (CIT2003), Asian Institute of Technology., Thailand.
Download


Paper Citation


in Harvard Style

Sukakanya U. and Porkaew K. (2005). AUTOMATIC IDENTIFICATION OF SPECIFIC WEB DOCUMENTS BY USING CENTROID TECHNIQUE . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 333-338. DOI: 10.5220/0001234903330338


in Bibtex Style

@conference{webist05,
author={Udomsit Sukakanya and Kriengkrai Porkaew},
title={AUTOMATIC IDENTIFICATION OF SPECIFIC WEB DOCUMENTS BY USING CENTROID TECHNIQUE},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={333-338},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001234903330338},
isbn={972-8865-20-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - AUTOMATIC IDENTIFICATION OF SPECIFIC WEB DOCUMENTS BY USING CENTROID TECHNIQUE
SN - 972-8865-20-1
AU - Sukakanya U.
AU - Porkaew K.
PY - 2005
SP - 333
EP - 338
DO - 10.5220/0001234903330338