INDEXING BANGLA NEWSPAPER ARTICLES USING FUZZY AND CRISP CLUSTERING ALGORITHMS

A. K. M. Zahiduzzaman, Mohammad Nahyan Quasem, Faiyaz Ahmed, Rashedur M. Rahman

Abstract

The paper presents two document clustering techniques to group Bangla newspaper articles. The first one is based on traditional c-means algorithm, and the later is based on its fuzzy counterpart, i.e., fuzzy c-means algorithm. The key principle for both of those techniques is to measure the frequency of keywords in a particular type of article to calculate the significance of those keywords. The articles are then clustered based on the significance of the keywords. We believe the findings from this research will help to index Bangla newspaper articles. Therefore, the information retrieval will be faster than before. However, one of the challenge is to find the salient features from hundred of features found in documents. Besides, both clustering algorithms work well on lower dimensions. To address this, we use three dimensionality reduction techniques, known as Principle Component Analysis (PCA), Factor Analysis (FA) and Linear Discriminant Analysis (LDA). We present and analyze the performance of traditional and fuzzy c-means algorithms with different dimensionality reduction techniques.

References

  1. Fisher, R, 1936. The Use of Multiple Measurements in Taxonomic Problems In: Annals of Eugenics, 7, p. 179-188.
  2. Han J., Kamber, M., 2000. Data Mining Concept and Techniques, Morgan Kaufmann Publishers.
  3. Maaten, L. J. P. van der, 2007 An Introduction to Dimensionality Reduction Using Matlab, Technical Report MICC 07-07. Maastricht University, Maastricht, The Netherlands.
  4. MacCallum, R, 1983. A comparison of factor analysis programs in SPSS, BMDP, and SAS. Psychometrika 48 (48).
  5. McLachlan, 2004. Discriminant Analysis and Statistical Pattern Recognition In: Wiley Interscience.
  6. Pearson, K. 1901. On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2 (6): 559-572.
  7. Prothom Alo website, 2011. http://www.prothom-alo.com
Download


Paper Citation


in Harvard Style

K. M. Zahiduzzaman A., Nahyan Quasem M., Ahmed F. and M. Rahman R. (2011). INDEXING BANGLA NEWSPAPER ARTICLES USING FUZZY AND CRISP CLUSTERING ALGORITHMS . In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8425-53-9, pages 361-364. DOI: 10.5220/0003492603610364


in Bibtex Style

@conference{iceis11,
author={A. K. M. Zahiduzzaman and Mohammad Nahyan Quasem and Faiyaz Ahmed and Rashedur M. Rahman},
title={INDEXING BANGLA NEWSPAPER ARTICLES USING FUZZY AND CRISP CLUSTERING ALGORITHMS},
booktitle={Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2011},
pages={361-364},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003492603610364},
isbn={978-989-8425-53-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - INDEXING BANGLA NEWSPAPER ARTICLES USING FUZZY AND CRISP CLUSTERING ALGORITHMS
SN - 978-989-8425-53-9
AU - K. M. Zahiduzzaman A.
AU - Nahyan Quasem M.
AU - Ahmed F.
AU - M. Rahman R.
PY - 2011
SP - 361
EP - 364
DO - 10.5220/0003492603610364