Language Agnostic Data Augmentation for Text Classification

M. Sharmila Devi, V. Lakshmi Chaitanya, G. Sharanya, B. Himaja, A. Bhavya Rohitha, A. Sujitha

2025

Abstract

The lack of labeled data is especially problematic for low-resource languages, making the development of high-performance text classification models especially challenging. Collecting diverse and large-scale annotated datasets, that are crucial to train generalizable models, is often expensive. One of the solutions that address this limitation is data augmentation, using synthetic training data to strengthen model performance. Nevertheless, the majority of existing methods for data augmentation are extremely language-dependent, focused only on English, and only at word or sentence level through word replacement or paraphrasing respectively. These approaches may not generalize across languages, and thus their applicability to low-resource settings is limited. Type a language зху The LIDA approach, which works on the level of sentence embeddings and can thus be applied regardless of language, is also compared and contrasted with traditional augmentation methods. Thus, these datasets can efficiently be augmented across a variety of language models without needing to depend on specific language preprocessing. We evaluate LIDA on three distinct languages for LSTM and BERT based models on 4 different dataset fractions. We also conduct an ablation study to evaluate the impact of different components of our approach on model performance. The empirical results indicate that LIDA could be viewed as language-agnostic, scalable, and robust augmentation strategy for low-resource text classification scenarios, and the source code of LIDA has been released on GitHub for facilitating other relevant researches and applications.

Download


Paper Citation


in Harvard Style

Devi M., Chaitanya V., Sharanya G., Himaja B., Rohitha A. and Sujitha A. (2025). Language Agnostic Data Augmentation for Text Classification. In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25; ISBN 978-989-758-777-1, SciTePress, pages 50-58. DOI: 10.5220/0013876800004919


in Bibtex Style

@conference{icrdicct`2525,
author={M. Devi and V. Chaitanya and G. Sharanya and B. Himaja and A. Rohitha and A. Sujitha},
title={Language Agnostic Data Augmentation for Text Classification},
booktitle={Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25},
year={2025},
pages={50-58},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013876800004919},
isbn={978-989-758-777-1},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25
TI - Language Agnostic Data Augmentation for Text Classification
SN - 978-989-758-777-1
AU - Devi M.
AU - Chaitanya V.
AU - Sharanya G.
AU - Himaja B.
AU - Rohitha A.
AU - Sujitha A.
PY - 2025
SP - 50
EP - 58
DO - 10.5220/0013876800004919
PB - SciTePress