Authors:
Ahmed Azab
1
;
Ahmed Zaky
2
;
3
;
Tetsuji Ogawa
4
and
Walid Gomaa
5
;
1
Affiliations:
1
Computer Science and Engineering, Egypt-Japan University of Science and Technology, Alexandria, Egypt
;
2
Computer Science and Information Technology Programs (CSIT), Egypt Japan University of Science and Technology, Egypt
;
3
Shoubra Faculty of Engineering, Benha University, Benha, Egypt
;
4
Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan
;
5
Faculty of Engineering, Alexandria University, Alexandria, Egypt
Keyword(s):
Natural Language Processing, Text-To-Speech, Egyptian Arabic.
Abstract:
This paper presents the improvement and evaluation of Masry, an end-to-end system planned to synthesize Egyptian Arabic speech. The proposed approach leverages the capable Tacotron speech synthesis models, counting Tacotron1 and Tacotron2, and integrated with progressed vocoders – Griffin-Lim for Tacotron1 and HiFi-GAN for Tacotron2. By synthesizing waveforms from mel-spectrograms, Masry offers a comprehensive solution for generating natural and expressive Egyptian Arabic speech. To train and validate our system, we construct a dataset including a male speaker describing standard composing pieces and news content in Egyptian Arabic. The sampling rate of recorded data is 44100 Hz, guaranteeing constancy and richness within the synthesized speech output. The execution of our framework was fastidiously assessed through different measurements, with a specific center on the Mean Opinion Score (MOS). The experimental results demonstrated the prevalence of Tacotron2 over Tacotron1, yielding
a MOS of 4.48 compared to 3.64. This emphasizes the system’s capacity to capture and duplicate the nuances of Egyptian Arabic speech more effectively. Besides, The assessment extended to include fundamental measurements such as word and character error rates (WER and CER). These metrics give a quantitative appraisal of the precision and exactness of the synthesized speech.
(More)