Authors:
Nina Hosseini-Kivanani
1
;
Homa Asadi
2
and
Christoph Schommer
1
Affiliations:
1
Department of Computer Science, University of Luxembourg, Esch-sur-Alzette, Luxembourg
;
2
Faculty of Foreign Languages, University of Isfahan, Isfahan, Iran
Keyword(s):
Speaker Verification, Mel-Frequency Cepstral Coefficients (MFCCs), Vowel Formants, Deep Learning, Persian Language.
Abstract:
This paper investigates the impact of speaking rate variation on speaker verification using a hybrid feature approach that combines Mel-Frequency Cepstral Coefficients (MFCCs), their dynamic derivatives (delta and delta-delta), and vowel formants. To enhance system robustness, we also applied data augmentation techniques such as time-stretching, pitch-shifting, and noise addition. The dataset comprises recordings of Persian speakers at three distinct speaking rates: slow, normal, and fast. Our results show that the combined model integrating MFCCs, delta-delta features, and formant frequencies significantly outperforms individual feature sets, achieving an accuracy of 75% with augmentation, compared to 70% without augmentation. This highlights the benefit of leveraging both spectral and temporal features for speaker verification under varying speaking conditions. Furthermore, data augmentation improved the generalization of all models, particularly for the combined feature set, where
precision, recall, and F1-score metrics showed substantial gains. These findings underscore the importance of feature fusion and augmentation in developing robust speaker verification systems. Our study contributes to advancing speaker identification methodologies, particularly in real-world applications where variability in speaking rate and environmental conditions presents a challenge.
(More)