A Machine-learning based Technique to Analyze the

Dynamic Information for Visual Perception of

Consonants

Wai Chee Yau

, Dinesh Kant Kumar

and Hans Weghorn

School of Electrical and Computer Engineering, RMIT University

GPO Box 2476V, Melbourne, Victoria 3001,Australia

Information Technology, BA-University of Cooperative Education

Stuttgart, Germany

Abstract. This paper proposes a machine-learning based technique to investi-

gate the signiﬁcance of the dynamic information for visual perception of conso-

nants. The visual speech information can be described using static (facial appear-

ance) or dynamic (movement) features. The aim of this research is to determine

the saliency of dynamic information represented by the lowerfacial movement for

visual speech perception. The experimental results indicate that the facial move-

ment is distinguishable for nine English consonants with a success rate of 85%

using the proposed approach. The results suggest that time-varying information

of visual speech contained in lower facial movements is useful for machine recog-

nition of consonants and may be an essential cue for human perception of visual

speech.

1 Introduction

The advancements in computer-based speech recognition models in the past decades

have provided new insights into the understanding of human speech perception. Human

speech perception is bimodal and consists of the acoustic and visual modality [5]. The

bimodal nature of human speech perception is clearly proven by the McGurk effect,

which demonstrates that when a person is presented with conﬂicting visual and audio

speech information, the perception of the sound maybe different from both modalities

[12]. An example is when a person hears a sound of /ba/ but sees a lip movement of

/ga/, the sound /da/ is perceived.

The acoustic domain is characterized by speech sounds whereas the visual com-

ponent is described using visual speech signals. The visual speech data refers to the

movements of the speech articulators such as lips, facial muscles, tongue and teeth. The

visual information from a speaker’s face is long known to inﬂuence the perception and

understanding of spoken language by humans with normal hearing [19,20]. The ability

of people with hearing impairment to comprehend speech by looking at the face of the

speaker is yet another clear demonstration of the signiﬁcance of the visual information

in speech perception.

Chee Yau W., Kant Kumar D. and Weghorn H. (2007).

A Machine-learning based Technique to Analyze the Dynamic Information for Visual Perception of Consonants.

In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, pages 119-128

DOI: 10.5220/0002424801190128

 SciTePress

The visual analysis of speech by computers are useful in improving the under-

standing of human speechreading skills. The insights gained regarding the nature of

visual speech signals may be beneﬁcial for understanding humans’ cognitive abilities in

speech perception which encompasses modalities with different temporal, spatial and

sensing characteristics. Such machine-based analysis might be able to suggest which

visual aspects of speech events are signiﬁcant in classiﬁcation of utterances [4]. The

results of machine-vision analysis of speech are also useful for applications such as

automatic speech recognition in noisy environments.

The visual speech information can be dichotomized into the static and the dynamic

components. Campbell [3] reports that the static features are important for visual speech

perception where observers are able to recognize phonemes from pictures of faces.

Some of the machine-based visual speech recognition approaches proposed in the lit-

erature [13, 18] are based on static visual speech features extracted from mouth im-

ages. Nevertheless, the dynamic information is demonstrated to be important in human

visual speech perception by Rosenblum et. al. [17] (through experiments using point-

light display where dots are placed on the lips, cheeks, chin , teeth and tongue tip of the

speaker). This paper provides a different view point to the time-varying aspect of visual

speech perception through machine analysis. This paper investigates the signiﬁcance

of dynamic, visual speech information - lower facial movements for perceiving conso-

nants. This paper uses only the lower facial movement and not the head movement of

the speaker because the most prominent visual speech information lies within the lower

face region [15]. A video processing technique is adopted to analyze the mouth video

and extract the lower facial movement for machine classiﬁcation of the consonants. The

lower facial movements comprise of the movements of the jaw, lips and teeth. The goal

of this research is to use a computer-vision based technique to evaluate the signiﬁcance

of dynamic information encoded in the visible facial movements for visual perception

of consonants.

2 An Overview of The Proposed Technique

Spoken language consists of successions of sounds produced by the movements of the

speech articulators such as tongue, teeth, lips, velum and glottis in altering the shape of

the vocal tract. Figure 1 shows the organs in human speech production system [24]. The

smallest units of speech sounds are known as phonemes. Phonemes can be categorized

into vowels or consonants depending on the relative sonority of the sounds [7]. The

articulation of each phoneme is associated with particular movements of the speech

articulators. Nonetheless, the movements of certain speech organs such as velum and

glottis are not visible from the frontal view of the speaker [6]. The speech articulators’

movements that can be modelled using vision-based system are limited to mostly lips,

jaw and teeth motions.

This paper focuses on recognition of consonants due to the fact that consonants

are easier to ’see’ and harder to ’hear’ than vowels [8]. The articulations of vowels

are produced with an open vocal tract whereas the productions of consonants involve

constrictions at certain part of the vocal tract by the speech articulators. Thus, the facial

movements involved in pronunciation of consonants are more discernible. The visual

120

Fig.1. Diagram of Human Speech Production System.

information is crucial in disambiguating the consonants, especially in conditions where

the speech sounds are weak or in noisy environments.

The visible, facial movements associated with the articulations of different speech

sounds maybe identical for certain consonants such as /p/ and /b/. Thus, the mapping of

speech sounds to facial movements is many-to-one. Table 1 show a mapping of speech

sounds to visual movements based on an international audiovisual object-based video

representation standard known as MPEG-4.

This paper proposes a machine vision model to analyze the facial movements. The

proposed model consist of three stages, which are shown in Figure 2.

2.1 Segmentation of the Lower Facial Movements from Video

This paper proposes to segment the lower facial movements from the video data using

a spatio-temporal templates(STT) technique [2]. STT are grayscale images that show

where and when facial movements occur in the video. The pixel locations indicate the

place where movements occur and the intensity values of the pixels of the STT varies

linearly with the recency of the motion. STT are generated using accumulative image

difference approach.

Accumulative image difference is applied on the video of the speaker by subtracting

the intensity values between successive frames to generate the difference of frames

(DOF). DOF of the t

frame is deﬁned as

DOF

(x, y) = |I

(x, y) − I

t−1

(x, y)| (1)

where I

(x, y) represents the intensity value of pixel location with coordinate (x, y) of

the t

frame. a is the ﬁxed threshold for binarisation of the DOF. B

(x, y) represents

121

Fig.2. Block diagram of the proposed technique.

the binarised version of the DOF and is given by

(x, y) =



1 if DOF

(x, y) ≥ a,

0 otherwise

(2)

The intensity value of the STT at pixel location (x, y) of t

frame is deﬁned by

ST T

(x, y) = max

N −1

[

t=1

(x, y) × t (3)

where N is the total number of frames used to capture the lower facial movements. In

Eq. (3), the binarised version of the DOF is multiplied with a linear ramp of time to

implicitly encode the temporal information of the motion into the STT. By computing

the STT values for all the pixels coordinates (x, y) of the image sequence using Eq. (3)

will produce a grayscale image (STT) that contains the spatial and temporal information

of the facial movements [23]. Figure 3 illustrates the STTs of nine consonants used in

the experiments.

This paper proposes the use of STT because STT is able to remove static elements

from the sequence of images and preserve the short duration facial movements. STT

is also invariant to the skin color of the speakers due to the image subtraction process

involved in the generation of STT.

The speed of phonation of the speaker might vary for each pronunciation of a phone.

The variation in the speed of utterance results in the variation of the overall duration and

there maybe variation in the micro phases of the utterances. The modelling of the details

of such variations is very challenging. This paper suggests a model to approximate

such variations by normalizing the overall duration of the utterance. This is achieved

by normalizing the intensity values of the STT to in between 0 and 1 to minimize the

differences in STTs produced from different video recordings of similar phone.

122

Fig.3. Spatio-temporal templates of nine consonants that represent the different patterns of facial

movements.

The proposed technique uses the discrete stationary wavelet transform (SWT) to

reduce the small variations of the facial movements of the same consonant. While the

classical discrete wavelet transform (DWT) is suitable for this, DWT results in transla-

tion variance [11] where a small shift of the image in the space domain will yield very

different wavelet coefﬁcients. The translation sensitivity of DWT is caused by the alias-

ing effect that occurs due to the downsampling of the image along rows and columns

[16]. SWT restores the translation invariance of the signal by omitting the downsam-

pling process of DWT, and results in redundancies.

2.2 Feature Extraction

This paper adopts Zernike moments as the rotation invariant features to represent the

SWT approximation of the STT. Zernike moments have been demonstrated to outper-

formed other image moments such as geometric moments and Legendre moments in

terms of sensitivity to noise, information redundancy and capability for image repre-

sentation [21]. Zernike moments are computed by projecting the image function onto

the orthogonal Zernike polynomial deﬁned within a unit circle. The main advantage of

Zernike moments is the simple rotational property of the features. Rotational changes

of the speaker’s mouth in the image results in a phase shift on the Zernike moments

[22]. The absolute value of the Zernike moments are invariant to rotational changes [9,

23]. This paper proposes to use the absolute value of the Zernike moments as rotation

invariant features to represent the SWT approximate image of STT.

2.3 Supervised Classiﬁer - Artiﬁcial Neural Network

A number of possible classiﬁers maybe suitable for such a machine speech recognition

model. The selection of the appropriate classiﬁer would require statistical analysis of the

123

data that would also identify the features that are irrelevant. Supervised artiﬁcial neural

network (ANN) approach lends itself for identifying the separability of data even when

the statistical properties and the types of separability (linear or nonlinear) is not known.

While it may be suboptimum, it is an easy tool to implement as a ﬁrst step.

This paper presents the use of ANN to classify the features into one of the con-

sonants. ANN has been selected because it can solve complicated problems where the

description for the data is not easy to compute. The other advantage of the use of ANN

is its fault tolerance and high computation rate due to the massive parallelism of its

structure [10]. A feedforward multilayer perceptron (MLP) ANN classiﬁer with back

propagation (BP) learning algorithm is used in the proposed approach. MLP ANN was

selected due to its ability to work with complex data compared with a single layer net-

work. Due to the multilayer construction, such a network can be used to approximate

any continuous functional mapping [1]. The advantage of using BP learning algorithm

is that the inputs are augmented with hidden context units to give feedback to the hidden

layer and extract features of the data from the training events.

3 Methodology

Experiments were conducted to test the repeatability of facial movement features during

articulations of consonants. The experiments were approved by the Human Experiments

Ethics Committee. Nine consonants highlighted in bold font in Table 1 were used in the

experiments. Each consonant represents one pattern of facial movement. The speaker

pronounces each consonant in isolation.

Table 1. Visual model of English consonants based on the MPEG-4 standard.

Cluster Number Phonemes

1 /p/,/b/,/m/

2 /f/,/v/

3 /T/,/D/

4 /t/,/d/

5 /k/,/g/

6 /S/, /dZ/, /tS/

7 /s/,/z/

8 /n/,/l/

9 /r/

The video data used in the experiments was recorded from a speaker using a cam-

era that focused on the mouth region of the speaker. The camera was kept stationary

throughout the experiments. The window size and view angle of the camera, back-

ground and illumination were kept constant during the recording. The video data was

stored as true color (.AVI) ﬁles with a frame rate of 30 frames per second. 180 video

clips were recorded and one STT was generated from each AVI ﬁle. Examples of the

STT are shown in Figure 3.

124

SWT at level-1 using Haar wavelet was applied on the STTs and the approximate

images was used for analysis. 49 Zernike moments were used as features to represent

the SWT approximate image of the STTs. For further data analysis, k-means clustering

algorithm was applied to the moments feature to partition the feature space into nine

exclusive clusters using squared Euclidean distance. Figure 4 shows the silhouette plot

of the nine clusters representing the nine consonants.

Fig.4. Silhouette plot of the nine clusters generated using K-means algorithm.

The next step of the experiments was to classify the facial movement features using

artiﬁcial neural network (ANN). The facial movement features were fed to ANN to

classify the features into one of the consonants. Multilayer perceptron (MLP) ANN with

backpropagation (BP) learning algorithm was used in the experiments. The architecture

of the ANN consisted of two hidden layers. In the experiments, features of 90 STTs

were used to train the ANN. The remaining 90 STTs that were not used in training

were presented to the ANN to test the ability of the trained ANN to recognize the facial

movement patterns. The experiments were repeated 10 times with different set of testing

and training data through random sub sampling of the data. The mean and variance of

the recognition rates for the 10 repetitions of the experiment were computed.

4 Results and Discussions

The accuracies of the neural network in recognizing the facial movement features of the

nine consonants are tabulated in Table 2. The mean classiﬁcation rate of the experiments

is 84.7% with a standard deviation of 2.8%.

The results demonstrate that the patterns of facial movements during articulation

of English consonants are highly consistent. 100% success rate is achieved using the

visual system to identify the consonant /m/ due to the distinct bilabial movements while

125

pronouncing /m/. The results suggest that facial movements can be useful as dynamic

cues for machine recognition of utterances.

Table 2. Mean Classiﬁcation Accuracies for nine English Consonants.

Viseme Recognition Rate

/m/ 100%

/v/ 87%

/T/ 65%

/t/ 74%

/g/ 85%

/tS/ 91%

/s/ 93%

/n/ 74%

/r/ 93%

Figure 4 shows that clusters 1, 2, 3, 6, 7 and 9 formed through k-means cluster

analysis contain low or negative silhouette values. The low or negative silhouette values

indicate that the facial movement features are not distinctly grouped in one cluster, or

are assigned to the wrong clusters. The poor clustering results suggest that the features

might not be linearly separable. Based on the preliminary data analysis using clustering

algorithm, this paper proposes to use a nonlinear classiﬁer - the multilayer perceptron

(MLP) artiﬁcial neural network (ANN) to classify the facial movement features. The

satisfactory classiﬁcation accuracies of the ANN demonstrate the ability of the ANN to

adapt and learn the patterns of the facial movements and achieve non-linear separation

of features.

The classiﬁcation errors can be attributed to the inability of vision-based techniques

to capture the occluded movements of speech articulators such as glottis, velum and

tongue. For example, the tongue movement in the mouth cavity is either partially or

completely not visible (occluded by the teeth) in the video data during the pronunciation

of alveolar and dental sounds such as /t/, /n/ and /T/. The STTs of /t/, /n/ and /T/ do not

contain the information of the occluded tongue movements. This is a possible reason

for the higher error rates of 26% and 35% for these three consonants as compare to

the average error rate of 15% for all consonants. The results suggest that the facial

movements of /t/, /n/ and /T/ are less distinguishable compared with other consonants.

The human perceptual analysis on visual speech using point-light displays reported

in [17] clearly indicates that the dynamic component of visual speech may be the most

salient informational form (versus the static face information). Our experimental results

using computer-based analysis present an evidence to support the signiﬁcance of time-

varying information for visual perception of consonants. Our results indicate that the

dynamic information in the lower face region of the speaker is useful in perceiving con-

sonants. Nevertheless, the authors would like to point out that the proposed technique

has only been tested using discrete consonants. Speech sounds are often perceived in

context and not in isolation by humans. The future direction of this research is to exam-

126

ine the feasibility of using dynamic visual speech information to identify speech sounds

that are embedded in words.

5 Conclusion

This paper analyzes the signiﬁcance of dynamic information of the lower facial move-

ments in the visual perception of consonants using a machine-learning based technique.

The proposed visual technique has been used to validate the results of perceptual analy-

sis that shows that time-varying information is important in human perception of visual

speech [17]. The outcome of the analysis using the proposed machine-learning tech-

nique indicate that dynamic speech information contained in the lower facial move-

ments are useful in disambiguating consonants thereby supporting the ﬁndings of the

study reported in [17].

The experimental results indicate that different patterns of facial movements can be

used to differentiate nine consonants with accuracies of 84.7%. These results demon-

strate that facial movements are reliable in representing consonants and can be useful

in machine speech recognition. The proposed machine analysis provides better under-

standing of the cognitive process involved in human speech perception by validating

the saliency of the dynamic visual speech information.

For future work, the authors intend to evaluate the reliability of facial movements

in other commonly spoken languages such as German and Mandarin. Also, the authors

intend to test on a larger vocabulary set covering words and phrases. Potential applica-

tions for the proposed technique include automated systems such as interfaces for users

with speech impairment to control computers and control of heavy machineries in noisy

factory.

References

1. Bishop, C. M.: Neural Networks for Pattern Recognition. Oxford University Press (1995)

2. Bobick, A. F., Davis, J. W.: The Recognition of Human Movement Using Temporal Tem-

plates. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23 (2001)

257–267

3. Campbell, R.: The lateralisation of lipread sounds:A ﬁrst look. Brain and Cognition. Vol. 5,

(1986) 1–21

4. Campbell, R., Dodd, B., Burnham, D.:Hearing by Eye II: Advances in the Psychology of

Speechreading and Auditory-visual Speech.(1998) X–XIV Vol. 91 (2003)

5. Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, Vol. 18. (2001)

9–21

6. Hazen, T. J.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech

Recognition. IEEE Transactions on Audio, Speech and Language Processing (2006) Vol. 14

No. 3 1082–1089

7. Jones, D.:An Outline of English Phonetics,W Jeffer and Sons Ltd(1969) 23

8. Kaplan, H., Bally, S. J., Garretson, C.:Speechreading: A Way to Improve Understand-

ing.Gallaudet University Press,(1999)14–16

9. Khontazad, A., Hong , Y. H.: Invariant Image Recognition by Zernike Moments. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence (1990) Vol. 12 489–497

127

10. Kulkarni, A. D.: Artiﬁcial Neural Network for Image Understanding. Van Nostrand Reinhold

(1994)

11. Mallat, S.:A Wavelet Tour of Signal Processing. Academic Press (1998)

12. McGurk, H., MacDonald, J.: Hearing Lips and Seeing Voices. Nature,Vol. 264 (1976)746–

748

13. Petajan, E. D.: Automatic Lip-reading to Enhance Speech Recognition. In GLOBE-

COM’84,IEEE Global Telecommunication Conference (2004)

14. Potamianos, G., Neti, C., Gravier, G., Senior, A.W.: Recent Advances in Automatic Recog-

nition of Audio-Visual Speech. In Proc. of IEEE, Vol. 91 (2003)

15. Potamianos, G., Neti, C.: Improved ROI and Within Frame Discriminant Features For

Lipreading. In Proc. of Internation Conference on Image Processing, (2001) 250–253

16. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., Heeger, D. J.:Shiftable Multiscale Trans-

form. IEEE Transactions on Information Theory (1992) Vol. 38 587–607

17. Rosenblum, L. D., Saldaa, H. M. : Time-varying information for visual speech perception,

in Hearing by Eye: Part 2, The Psychology of Speechreading and Audiovisual Speech, R.

Campbell, B. Dodd, and D. Burnham, Editors. Earlbaum: Hillsdale, NJ (1998)61–81

18. Stork, D. G., Hennecke, M. E.: Speechreading: An Overview of Image Processing, Feature

Extraction, Sensory Intergration and Pattern Recognition Techiques.In the 2nd International

Conference on Automatic Face and Gesture Recognition (FG ’96), (1996)

19. Summerﬁeld, A. Q.: Some preliminaries to a comprehensive account of audio-visual speech

perception. Hearing by Eye : The Psychology of Lipreading (1987)

20. Sumby, W. H., Pollack, I.: Visual contributions to speech intelligibility in noise. Journal of

the Acoustical Society of America, Vol. 26 (1954) 212–215

21. Teh, C. H., Chin, R. T.: On Image Analysis by the Methods of Moments. IEEE Transactions

on Pattern Analysis and Machine Intelligence,Vol. 10. (1988)496–513

22. Teague, M. R.: Image Analysis via the General Theory of Moments. Journal of the Optical

Society of America (1980) Vol. 70 920–930

23. Yau, W. C., Kumar, D. K., Arjunan, S. P. : Visual Speech Recognition Method Using Trans-

lation, Scale and Rotation Invariant Features. IEEE International Conference on Advanced

Video and Signal based Surveillance, Sydney, Australia (2006)

24. http://www.kt.tu-cottbus.de/speech-analysis/tech.html

128