Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting

Aleksi Sahala, Krister Lindén

2020

Abstract

Although word association measures are useful for deciphering the semantic nuances of long extinct languages, they are very sensitive to excessively formulaic narrative patterns and full or partial duplication caused by different copies, edits, or fragments of historical texts. This problem is apparent in the corpora of the ancient Mesopotamian languages such as Sumerian and Akkadian. When word associations are measured, vocabulary from repetitive passages tends to dominate the top-ranks and conceal more interesting and descriptive use of the language. We propose an algorithmic way to reduce the impact of repetitiveness by weighting the co-occurrence probabilities by a factor based on their contextual similarity. We demonstrate that the proposed approach does not only effectively reduce the impact of distortion in repetitive corpora, but that it also slightly improves the performance of several PMI-based association measures in word relatedness tasks in non-repetitive corpora. Additionally, we propose normalization for PMI2, a commonly-used association measure, and show that the normalized variant can outperform the base measure in both, repetitive and non-repetitive corpora.

Download


Paper Citation


in Harvard Style

Sahala A. and Lindén K. (2020). Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR; ISBN 978-989-758-474-9, SciTePress, pages 48-58. DOI: 10.5220/0010106800480058


in Bibtex Style

@conference{kdir20,
author={Aleksi Sahala and Krister Lindén},
title={Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting},
booktitle={Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR},
year={2020},
pages={48-58},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010106800480058},
isbn={978-989-758-474-9},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR
TI - Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting
SN - 978-989-758-474-9
AU - Sahala A.
AU - Lindén K.
PY - 2020
SP - 48
EP - 58
DO - 10.5220/0010106800480058
PB - SciTePress