EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT SEMANTIC ANALYSIS

Zvonimir Radoš; Franjo Jović; Josip Job

doi:10.5220/0002551903700373

EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT SEMANTIC ANALYSIS

Zvonimir Radoš, Franjo Jović, Josip Job

2005

Abstract

A method for extracting root words from Croatian language text is presented. The described method is knowledge-free and can be applied to any language. Morphological and semantic aspects of the language were used. The algorithm creates morph-semantic groups of words and extract common root for every group. For morphological grouping we use digram comparison to group words depending on their morphological similarity. Latent semantic analysis is applied to split morphological groups into semantic subgroups of words. Root words are extracted from every morpho-semantic group. When applied to Croatian language text, among hundred most frequent root words, produced by this algorithm, there were 60 grammatically correct ones and 25 FAP (for all practical purposes) correct root words.

References

F. C. Graham, 2004. Large Dynamic Graphs: What Can Researchers Learn From Them?, SIAM News, vol. 37., no. 3.
T. Laundauer, S. Dumais, 1997. A Solution to Plato's Problem, The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review, no. 104., pp. 211- 240.
P. Schone, D. Jurafsky, 2000. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis, Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp. 67-72.
De Roeck, A., W. Al-Fares, 2000. A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots, Proceedings of the 38th Annual Meeting of the ACL, Hong Kong.
R. Scitovski, 1999. Numericka Matematika, Elektrotehnicki fakultet Osijek, Osijek.
P. Nakov, A. Popov, P. Mateev, 2001. Weight Functions Impact on LSA Performance, EuroConference RANLP'2001, Tzigov Chark, Bulgaria, pp. 187-193.
C. D. Manning, H. Schutze, 1999. Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, pp. 554-566.
M. Moguš, M. Bratanic, M. Tadic, 1999. Hrvatski cestotni rjecnik, Školska knjiga, Zagreb.
J. Goldsmith, 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. 153-189.

Download

Paper Citation

in Harvard Style

Radoš Z., Jović F. and Job J. (2005). EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT SEMANTIC ANALYSIS . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 972-8865-19-8, pages 370-373. DOI: 10.5220/0002551903700373

in Bibtex Style

@conference{iceis05,
author={Zvonimir Radoš and Franjo Jović and Josip Job},
title={EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT SEMANTIC ANALYSIS},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2005},
pages={370-373},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002551903700373},
isbn={972-8865-19-8},
}

in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT SEMANTIC ANALYSIS
SN - 972-8865-19-8
AU - Radoš Z.
AU - Jović F.
AU - Job J.
PY - 2005
SP - 370
EP - 373
DO - 10.5220/0002551903700373