Adaptive Decision-level Fusion for Fongbe Phoneme Classification using Fuzzy Logic and Deep Belief Networks

Frejus A. A. Laleye, Eugene C. Ezin, Cina Motamed

2015

Abstract

In this paper, we compare three approaches for decision fusion in a phoneme classification problem. We especially deal with decision-level fusion from Naive Bayes and Learning Vector Quantization (LVQ) classifiers that were trained and tested by three speech analysis techniques: Mel-frequency Cepstral Coefficients (MFCC), Relative Spectral Transform - Perceptual Linear Prediction (Rasta-PLP) and Perceptual Linear Prediction (PLP). Optimal decision making is performed with the non-parametric and parametric methods. We investigated the performance of both decision methods with a third proposed approach using fuzzy logic. The work discusses the classification of an African language phoneme namely Fongbe language and all experiments were performed on its dataset. After classification and the decision fusion, the overall decision fusion performance is obtained on test data with the proposed approach using fuzzy logic whose classification accuracies are 95,54% for consonants and 83,97% for vowels despite the lower execution time of Deep Belief Networks.

References

  1. A. Metallinou, S. L. and Narayanan., S. (2010). Decision level combination of multiple modalities for recognition and analysis of emotional expression. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 2462-24665.
  2. Ager, M., Cvetkovic, Z., and Sollich, P. (2013). Phoneme Classification in High-Dimensional Linear Feature Domains. Computing Research Repository.
  3. Agoli-Agbo, E. O. and Bernard, C. (2009). Les particules nonciatives du fon. Institut national des langues et civilisations orientales, Paris, 1st edition.
  4. Akoha., A. B. (2010). Syntaxe et lexicologie du fon-gbe: Bénin. Ed. L'harmattan, page 368.
  5. Bengio, Y., P., L., D., P., and H., L. (2006). Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems.
  6. Borne, P., Benrejeb, M., and Haggege., J. (2007). Les rseaux de neurones, présentation et applications. TECHNIP Editions, page 90.
  7. Cho, S.-B. and Kim., J. (1995). Combining multiple neural networks by fuzzy integral and robust classification. IEEE Transactions on Systems, Man, and Cybernetics, pages 380-384.
  8. Corradini, A., Mehta, M., Bernsen, N., Martin, J., and Abrilian., S. (2003). Multimodal input fusion in humancomputer interaction. In NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management.
  9. Esposito, A., Ezin, E., and Ceccarelli, M. (1996). Preprocessing and neural classification of english stop consonants [b, d, g, p, t, k]. In The 4th International Conference on Spoken Language Processing, pages 1249- 1252, Philadelphia.
  10. Esposito, A., Ezin, E., and Ceccarelli, M. (1998). Phoneme classification using a rasta-plp preprocessing algorithm and a time delay neural network : Performance studies. In Proceedings of the 10th Italian Workshop on Neural Nets, pages 207-217, Salerno,.
  11. Foucher, S., Laliberte, F., Boulianne, G., and Gagnon., L. (2006). A dempster-shafer based fusion approach for audio-visual speech recognition with application to large vocabulary french speech. In IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1.
  12. Genussov, M., Lavner, Y., and Cohen, I. (2010). Classification of unvoiced fricative phonemes using geometric methods. In 12th International Workshop on Acoustic Echo and Noise Control. Tel-Aviv, Israel.
  13. Hinton, G., S., O., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Comput, 18:1527- 1554.
  14. Iyengar, G., Nock, H., and Neti., C. (2003). Audiovisual synchrony for detection of monologue in video archives. In IEEE International Conference on Multimedia and Expo, volume 1, pages 329-332.
  15. Jacobs., R. (1995). Methods for combining experts's probability assessments. Neural Computation, pages 867- 888.
  16. Jacobs, R., Jordan, M., Nowlan, S., and Hinton., G. (1991). Adaptive mixture of local experts. Neural Computation, pages 79-87.
  17. Kittler, J., Hatef, M., Duin, R., and Matas., J. (1998). On combining classifiers. IEEE Transactions on Patterns Analysis and Machine Intelligence, pages 226-239.
  18. Kohonen., T. (1988). An introduction to neural computing. Neural Networks, 1:3-16.
  19. LALEYE, F. A. A., EZIN, E. C., and MOTAMED, C. (2014). Weighted combination of naive bayes and lvq classifier for fongbe phoneme classification. In Tenth International Conference on Signal-Image Technology & Internet-Based Systems, pages 7 - 13, Marrakech. IEEE.
  20. Le, V.-B. and L, B. (2009). Automatic speech recognition for under-resourced languages: Application to vietnamese language. In IEEE Transactions on Audio, Speech, and Language Processing, pages 1471-1482. IEEE.
  21. Lefebvre, C. and Brousseau., A. (2001). A grammar of fonge, de gruyter mouton. page 608.
  22. Lewis, T. W. and Powers., D. M. (2001). Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier. International Symposium on Intelligent Multimedia, Video and Speech Processing, 1:551-554.
  23. Lung, J. W. J., Salam, M. S. H., Amjad Rehman, M. S. M. R., and Saba, T. (2014). Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization. IETE Technical Review, London, 2nd edition.
  24. Malcangi, M., Ouazzane, K., and Patel, K. (2013). Audiovisual fuzzy fusion for robust speech recognition. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1 - 8, Dallas. IEEE.
  25. Meyer, G., Mulligan, J., and Wuerger., S. (2004). Continuous audio-visual digit recognition using n-best decision fusion. Information Fusion, 5:91-101.
  26. Mugler, E. M., Patton, J. L., Flint, R. D., Wright, Z. A., Schuele, S. U., Rosenow, J., Shih, J. J., Krusienski, D. J., and Slutzky, M. W. (2014). Direct classification of all american english phonemes using signals from functional speech motor cortex. J. Neural Eng.
  27. Neti, C., Maison, B., Senior, A., Iyengar, G., Decuetos, P., Basu, S., and Verma., A. (2000). Joint processing of audio and visual information for multimedia indexing and human-computer interaction. In Sixth International Conference RIAO. Paris, France, pages 294- 301.
  28. Niesler, T. and Louw, P. H. (2004). Comparative phonetic analysis and phoneme recognition for afrikaans, english and xhosa using the african speech technology telephone speech database. In South African Computer Journal, pages 3-12.
  29. O'Connor, P., Neil, D., SC, L., Delbruck, T., and Pfeiffer, M. (2013). Real-time classification and sensor fusion with a spiking deep belief network. Front. Neurosci.
  30. Palaz, D., Collobert, R., and Magimai.-Doss, M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. Idiap-RR.
  31. Pfleger., N. (2004). Context based multimodal fusion. In ACM International Conference on Multimodal Interfaces, pages 265-272.
  32. Pitsikalis, V., Katsamanis, A., G.Papandreou, and Maragos., P. (2006). Adaptive multimodal fusion by uncertainty compensation. In Ninth International Conference on Spoken Language Processing. Pittsburgh, volume 7, pages 423-435.
  33. Rogova., G. (1994). Combining the results of several neural networks classifiers. Neural Networks, pages 777- 781.
  34. Schlippe, T. and Edy Guevara Komgang Djomgang, Ngoc Thang Vu, S. O. T. S. (2012). Hausa large vocabulary continuous speech recognition. In The third International Workshop on Spoken Languages Technologies for Under-resourced Languages, Cape-Town.
  35. Wang, S. and Yao., X. (2009). Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp.Comput. Intell. Data Mining, pages 324-331.
  36. Xu, H. and Chua., T. (2006). Fusion of av features and external information sources for event detection in team sports video. ACM Trans. Multimed. Comput. Commun. Appl., 2:44-67.
  37. Yousafzai, J., Cvetkovic, Z., and Sollich, P. (2009). Tuning support vector machines for robust phoneme classification with acoustic waveforms. In 10th Annual conference of the International Speech communication association, pages 2359 - 2362, England. ISCAINST SPEECH COMMUNICATION ASSOC.
  38. Zhang., H. (2005). Exploring conditions for the optimality of nave bayes. IJPRAI, 19:183-198.
Download


Paper Citation


in Harvard Style

Laleye F., Ezin E. and Motamed C. (2015). Adaptive Decision-level Fusion for Fongbe Phoneme Classification using Fuzzy Logic and Deep Belief Networks . In Proceedings of the 12th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO, ISBN 978-989-758-122-9, pages 15-24. DOI: 10.5220/0005536100150024


in Bibtex Style

@conference{icinco15,
author={Frejus A. A. Laleye and Eugene C. Ezin and Cina Motamed},
title={Adaptive Decision-level Fusion for Fongbe Phoneme Classification using Fuzzy Logic and Deep Belief Networks},
booktitle={Proceedings of the 12th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,},
year={2015},
pages={15-24},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005536100150024},
isbn={978-989-758-122-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,
TI - Adaptive Decision-level Fusion for Fongbe Phoneme Classification using Fuzzy Logic and Deep Belief Networks
SN - 978-989-758-122-9
AU - Laleye F.
AU - Ezin E.
AU - Motamed C.
PY - 2015
SP - 15
EP - 24
DO - 10.5220/0005536100150024