SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING

Fumiyo Fukumoto, Yoshimi Suzuki

2011

Abstract

This paper presents a method for semantic classification of unknown verbs including polysemies into Levinstyle semantic classes. We propose a semi-supervised clustering, which is based on a graph-based unsupervised clustering technique. The algorithm detects the spin configuration that minimizes the energy of the spin glass. Comparing global and local minima of an energy function, called the Hamiltonian, allows for the detection of nodes with more than one cluster. We extended the algorithm so as to employ a small amount of labeled data to aid unsupervised learning, and applied the algorithm to cluster verbs including polysemies. The distributional similarity between verbs used to calculate the Hamiltonian is in the form of probability distributions over verb frames. The result obtained using 110 test polysemous verbs with labeled data of 10% showed 0.577 F-score.

References

  1. Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. (2003). Learning Distance Functions using Equivalence Relations. In Proc. of the 20th International Conference on Machine Learning, pages 11-18.
  2. Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating Constraints and Metric Learning in Semi-Supervised Clustering. In Proc. of the 21th International Conference on Machine Learning, pages 81-88.
  3. Bouraev, B., Briscoe, E. J., Carroll, J., Carter, D., and Grover, C. (1987). The Derivation of a Grammatically-Indexed Lexicon from the Longman Dictionary of Contemporary English. In Proc. of the 25th Annual Meeting of the Association for Computational Linguistics, pages 193-200.
  4. Brew, C. and Walde, S. S. (2002). Spectral Clustering for German Verbs. In Proc. of 2002 Conference on Empirical Methods in Natural Language Processing, pages 117-123.
  5. Briscoe, E. J. and Carroll, J. (1997). Automatic Extraction of Subcategorization from Corpora. In Proc. of 5th ACL Conference on Applied Natural Language Processing, pages 356-363.
  6. Briscoe, E. J. and Carroll, J. (2002). Robust Accurate Statistical Annotaion of General Text. In Proc. of 3rd International Conference on Language Resources and Evaluation, pages 1499-1504.
  7. Dagan, I., Lee, L., and Pereira, F. C. N. (1999). Similaritybased Models of Word Cooccurrence Probabilities. Machine Learning, 34(1-3):43-69.
  8. Grishman, R., Macleod, C., and Meyers, A. (1994). Complex Syntax: Building a Computational Lexicon. In Proc. of International Conference on Computational Linguistics, pages 268-272.
  9. Hindle, D. (1990). Noun Classification from PredicateArgument Structures. In Proc. of 28th Annual Meeting of the Association for Computational Linguistics, pages 268-275.
  10. Hughes, J. (1994). Automatically Acquiring Classification of Words. Ph.D. thesis University of Leeds.
  11. Kermanidis, K., Maragoudakis, M., Fakotakis, N., and Kokkinakis, G. K. (2008). Learning Verb Complements for Modern Greek: Balancing the Noisy Dataset. Natural Language Engineering, 14(1):71- 100.
  12. Kirkpatrick, S., Jr., C. D. G., and Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science, 220(4598):671-680.
  13. Korhonen, A. (2002). Subcategorization Acquisition. Ph.D. thesis University of Cambridge.
  14. Korhonen, A., Krymolowski, Y., and Briscoe, T. (2006). A Large Subcategorization Lexicon for Natural Language Processing Applications. In Proc. of the 5th International Conference on Language Resources and Evaluation.
  15. Korhonen, A., Krymolowski, Y., and Marx, Z. (2003). Clustering Polysemic Subcategorization Frame Distributions Semantically. In Proc. of the 41st Annual Meeting of the Association for Computational Linguistics, pages 64-71.
  16. Lee, L. (1999). Measures of Distributional Similarity. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25-32.
  17. Leech, G. (1992). 100 Million Words of English: The British National Corpus. Language Research, 28(1):1-13.
  18. Levin, B. (1993.). English Verb Classes and Alternations. Chicago University Press.
  19. Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. In Proc. of 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 768-773.
  20. Matsuo, Y., Sakaki, T., Uchiyama, K., and Ishizuka, M. (2006). Graph-based Word Clustering using a Web Search Engine. In Proc. of 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP2006), pages 542-550.
  21. Navigli, R. (2008). A Structural Approach to the Automatic Adjudication of Word Sense Disagreements. Natural Language Engineering, 14(4):547-573.
  22. Navigli, R. (2009). Word Sense Disambiguation: A Survey. ACM Computing Surveys, 41(2):1-69.
  23. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002.). On Spectral Clustering: Analysis and an Algorithm. MIT Press.
  24. Pereira, F., Tishby, N., and Lee, L. (1993). Distributional Clustering of English Words. In Proc. of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183-190.
  25. Reichardt, J. and Bornholdt, S. (2004). Detecting Fuzzy Community Structure in Complex Networks with a Potts Model. PHYSICAL REVIEW LETTERS, 93(21).
  26. Reichardt, J. and Bornholdt, S. (2006). Statistical Mechanics of Community Detection. PHYSICAL REVIEW E, 74.
  27. Reiter, E. and Dale, R. (2000.). Building Natural Language Generation Systems. Cambridge University Press.
  28. Rooth, M. (1998). Two-Dimensional Clusters in Grammatical Relations. In Inducing Lexicons with the EM Algorithm, AIMS Report, 4(3).
  29. Rooth, M., Riezler, S., Prescher, D., Carroll, G., and Beil, F. (1999). Inducing a Semantically Annotated Lexicon via EM-Based Clustering. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics.
  30. Schulte im Walde, S. (2000). Clustering Verbs Semantically according to their Alternation Behaviour. In Proc. of the 18th International Conference on Computational Linguistics, pages 747-753.
  31. Schulte im Walde, S., Hying, C., Scheible, C., and Schmid, H. (2008). Combining EM Training and the MDL Principle for an Automatic Verb Classification Incorporating Selectional Preferences. In Proc. of the 46th Annual Meeting of the Association for Computational Linguistics, pages 496-504.
  32. Stevenson, S. and Joanis, E. (2003). Semi-Supervised VerbClass Discovery using Noisy Features. In Proc. of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, pages 71-78.
  33. Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. (2001). Constrained K-Means Clustering with Background Knowledge. In Proc. of 18th International Conference on Machine Learning, pages 577-584.
  34. Widdows, D. and Dorow, B. (2002). A Graph Model for Unsupervised Lexical Acquisition. In Proc. of 19th International conference on Computational Linguistics (COLING2002), pages 1093-1099.
  35. Witten, I. H. and Bell, T. C. (1991). The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transactions on Information Theory, 37(4):1085-1094.
Download


Paper Citation


in Harvard Style

Fukumoto F. and Suzuki Y. (2011). SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING . In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011) ISBN 978-989-8425-80-5, pages 37-46. DOI: 10.5220/0003633100370046


in Bibtex Style

@conference{keod11,
author={Fumiyo Fukumoto and Yoshimi Suzuki},
title={SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING},
booktitle={Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011)},
year={2011},
pages={37-46},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003633100370046},
isbn={978-989-8425-80-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011)
TI - SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING
SN - 978-989-8425-80-5
AU - Fukumoto F.
AU - Suzuki Y.
PY - 2011
SP - 37
EP - 46
DO - 10.5220/0003633100370046