5 DEEP LEARNING
IMPLEMENTATION
With the image-data for each spoken letter extracted,
we began implementing the machine learning layers.
We defined several smaller networks to test our
concept and compare different setups. Table 2
provides a brief overview of the models we used,
including the number of layers and nodes, the number
of epochs, and the accuracy achieved:
We tested different combinations of layers and
nodes and trained each model for a different number
of epochs. The best performing model was the 3D-
CNN, which achieved an accuracy of 93% with just 5
epochs. However, the 3D-CNN + LSTM model also
showed promising results, achieving an accuracy of
90% with 6 layers and 143 nodes.
Each model was implemented with Keras, an
open-source deep learning library. The first LSTM
model was a simple test that achieved great results in
tests with just a single subject. It consists of a network
with three layers, the first being a 20 LSTM nodes
layer, followed by a hidden layer of 10 densely
connected nodes. Finally, in every model, there are
three output nodes that return the calculated
probability of being each letter. Next, we tested a 3D-
CNN network consisting of five layers:
60 3D-convolutional nodes
30 3D-convolutional nodes
20 3D-convolutional nodes
10 dense nodes
3 output nodes
The increased training time and fast convergence
on the final accuracy led to the reduction to just five
epochs.
After this, we tried to combine the 3D-CNN with
the LSTM network by adding another layer between
layers 3 and 4 of the 3D-CNN model. This new layer
consists of 20 nodes and worsened the calculated
result compared to the simpler network. This could
have many reasons, such as a too small LSTM layer,
too little prepared training data, a bad integration of
the new nodes.
6 DISCUSSION AND LIMITS
In this paper it was studied whether it is possible to
recognize speech based on MRI videos with the help
of Deep Learning. Despite a positive result, there are
some points that need to be critically addressed.
1. Currently, only a limited number of people in
the dataset were examined, as data preparation
is manual and very time-consuming. The model
performs better with a lot of data from a few
people than with data from many different
people, because the variances naturally increase
with the number of different subjects. With the
option of adding multiple data from different
subjects to make the model more realistic, the
number of training data increases, which brings
us to the next point.
2. The 3D-CNN-LSTM model requires high
computing power. With the increase in training
data, the execution time continues to increase,
and powerful computers are needed.
3. Finally, it should be mentioned that only one
approach to detection was used in this paper,
which deals with the use of vectors. But there
are other approaches such as contour tracking.
This method defines outlines, or contours, of the
tongue and vocal tract which requires more
complex data preparation but could be more
precise through inclusion of all movements of
the oral cavity.
7 CONCLUSION
The investigations carried out have shown that it is
possible to recognize individual letters by using a
classification method. The best results were achieved
with the 3D-CNN model, which has an accuracy
value of 93%. The combination of the 3D-CNN
model and the LSTM model achieved a value of 90%.
The lower accuracy is because too little data available
for this solution.
8 OUTLOOK
For the future, the first step is to massively increase
the dataset to achieve better results. Therefore, it is
important to include more letters in the analysis.
Should this be successful, it would be conceivable to
analyse whole words. Above all, this project should
be an inspiration for anyone interested in further
research to achieve new insights in linguistics,
language modelling or clinical research.