loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Aleksi Sahala and Krister Lindén

Affiliation: University of Helsinki, Finland

Keyword(s): Collocation Extraction, Distributional Semantics, Computational Assyriology.

Abstract: Although word association measures are useful for deciphering the semantic nuances of long extinct languages, they are very sensitive to excessively formulaic narrative patterns and full or partial duplication caused by different copies, edits, or fragments of historical texts. This problem is apparent in the corpora of the ancient Mesopotamian languages such as Sumerian and Akkadian. When word associations are measured, vocabulary from repetitive passages tends to dominate the top-ranks and conceal more interesting and descriptive use of the language. We propose an algorithmic way to reduce the impact of repetitiveness by weighting the co-occurrence probabilities by a factor based on their contextual similarity. We demonstrate that the proposed approach does not only effectively reduce the impact of distortion in repetitive corpora, but that it also slightly improves the performance of several PMI-based association measures in word relatedness tasks in non-repetitive corpora. Additi onally, we propose normalization for PMI2, a commonly-used association measure, and show that the normalized variant can outperform the base measure in both, repetitive and non-repetitive corpora. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.226.150.175

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Sahala, A. and Lindén, K. (2020). Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR; ISBN 978-989-758-474-9; ISSN 2184-3228, SciTePress, pages 48-58. DOI: 10.5220/0010106800480058

@conference{kdir20,
author={Aleksi Sahala. and Krister Lindén.},
title={Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting},
booktitle={Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR},
year={2020},
pages={48-58},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010106800480058},
isbn={978-989-758-474-9},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR
TI - Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting
SN - 978-989-758-474-9
IS - 2184-3228
AU - Sahala, A.
AU - Lindén, K.
PY - 2020
SP - 48
EP - 58
DO - 10.5220/0010106800480058
PB - SciTePress