
 
5 RESULTS AND DISCUSSION 
The vocal tract visualisation tool has been designed 
to operate with MS Windows-based PC 
environment. The multi-display window and other 
user’s features of the complete system are shown in 
Figure 5. As can be seen, the system’s screen is 
divided into four windows for displaying the vocal 
tract graphics, the sound intensity, the pitch and the 
first three formants of the speech signal. The system 
can operate in two main modes: (a) near real-time 
mode, whereby the speech signal is picked up by a 
microphone connected to the PC sound card (as with 
the case shown in Figure 5), and (b) non real-time 
mode, whereby the speech signal is either recorded 
by the system or read from a stored audio file, and 
its features are then displayed. It also allows the 
saving of speech/sound signals. For the vowel 
articulation, the user can compare the shape of 
his/hers vocal tract to a reference trace (shown with 
a dashed line in Figure 5) for the correct tongue 
position derived from the measurements data 
reported in  (Miller & Mathews, 1963).  The 
deviation from the reference trace is given for this 
case in the form of a computed mean squared error 
(MSE) of all the estimated mid-sagittal distances.      
Figure 6 shows the vocal tract profiles for 10 
American English vowels, as estimated by the 
system (dashed lines represent reference trace for 
tongue position). For comparison and evaluation 
purposes, the deviations, in terms of MSE values, 
from the reference tongue position data adopted 
from (Harshman, et. al., 1977) are also indicated. In 
general, the obtained results seem to correlate well 
with the reference data.  They were also found to 
correlate well with x-ray data and the PARAFAC 
analysis. Referring to the MSE values shown in 
Figure 6, the system seems to perform particularly 
well in the cases of all the ‘front vowels’, such as 
/IY/, /EY/, /IH/, /EH/ and /AE/, with the MSE 
increasing as the vowel height decreases. With the 
exception of /AA/ and /UH/, the results show 
relatively less accurate correlation with the reference 
data for the cases of the ‘back vowels’.  As vowel 
classification into front and back vowels is related to 
the position of the tongue elevation towards the front 
or the back of the mouth, we believe that the higher 
accuracy in the cases of the front vowels is attributed 
to the formant-based added adjustments of the lips, 
jawbone and front sections of the vocal tract we used 
in our approach.  
On the other hand, the relative length of the 
vowel’s vocalisation seems to affect the accuracy of 
the estimated area functions and hence the displayed 
vocal tract shape. In specific, the system seems to 
give relatively lower accuracy for relatively longer 
vowels, such as /AO/, and complex vowels which 
involve changes in the configuration of the mouth 
during production of the sound, such as /OW/. We 
believe this is due to the fact that the system, in its 
current design, bases its estimation of the speech 
parameters on information extracted from the 2-3 
middle frames of the analysed speech waveform.   
6 CONCLUSIONS 
We have described the process of designing and 
development of a computer-based system for the 
near real-time and non real-time visualisation of the 
vocal tract shape during vowel articulation. 
Compared to other similar systems, our system uses 
a new approach for estimating the vocal tract mid-
sagittal distances based on both the area functions 
and the first three formants as extracted from the 
acoustic speech signal. It also utilises a novel and 
simple technique for mapping the extracted 
information to corresponding mid-sagittal distances 
on the displayed graphics. The system is also 
capable of displaying the sound intensity, the pitch 
and the first three formants of the uttered speech. It 
extracts the required parameters directly from the 
acoustic speech signal using an AR speech 
production model and LP analysis. Reported 
preliminary experimental results have shown that in 
general the system is able to reproduce well the 
shapes of the vocal tract, with real-time sensation, 
for vowel articulation. Work is well underway to 
optimise the algorithm used for extraction of the 
required acoustics information and the mapping 
technique, such that dynamic descriptions of the 
vocal tract configuration for long and complex 
vowels, as well as vowel-consonant and consonant-
vowel are obtained.  Enhancement of the system’s 
real-time capability and features, and facilitation of 
an integrated speech training aid for the hearing-
impaired are also being investigated. 
REFERENCES 
Choi, C.D., 1982. A Review on Development of Visual 
speech Display Devices for Hearing Impaired Children. 
Commun. Disorders, 5, 38-44. 
Bunnell, H.T., Yarrington, D. M. & Polokoff, 2000. 
STAR: articulation training for young children. In Intl. 
Conf. on Spoken Language Processing 
(INTERSPEECH 2000), 4, 85-88. 
Mashie, J.J., 1995. Use of sensory aids for teaching speech 
to children who are deaf. In Spens , K-E. and Plant, G. 
A VOCAL TRACT VISUALISATION TOOL FOR A COMPUTER-BASED SPEECH TRAINING AID FOR
HEARING-IMPAIRED INDIVIDUALS
157