Speech to Text Recognition for Videogame Controlling with

Convolutional Neural Networks

Joaquin Aguirre-Peralta, Marek Rivas-Zavala and Willy Ugarte

Universidad Peruana de Ciencias Aplicadas, Lima, Peru

Keywords:

Speech to Text, Machine Learning, Deep Learning, Gamiﬁcation.

Abstract:

Disability in people is a reality that has always been present throughout humanity and all nations of the planet

are immersed in this reality. Being communication and interaction through technology much more important

than ever, people with disabilities are the most affected by having a physical gap. There are still few tools that

these people can use to interact more easily with different types of hardware, therefore, we want to provide

them a playful and medical tool that can adapt to their needs and allow them to interact a little more with the

people around them. From this context, we have decided to focus on people with motor disabilities of the

upper limbs and based on this, we propose the use of gamiﬁcation in the NLP (Natural Language Processing)

area, developing a videogame consisting of three voice-operated minigames. This work has 4 stages: analysis

(benchmarking), design, development and validation. In the ﬁrst stage, we elaborated a benchmarking of

the models. In the second stage, we describe the implementation of CNNs, together with methods such as

gamiﬁcation and NLP for problem solving. In the third stage, the corresponding mini-games which compose

the videogame and its characteristics are described. Finally, in the last stage, the application of the videogame

was validated with experts in physiotherapy. Our results show that with the training performed, the prediction

of words with noise was improved from 43.49% to 74.50% and of words without noise from 63.87% to 96.36%

1 INTRODUCTION

Although technology is constantly developing and

more software, ofﬁce automation, therapeutic tools,

etc. are being created (Hersh and Mouroutsou, 2019),

the tools that help people with motor disabilities of

the upper limbs are not enough to cover the needs

this people require because the target audience is too

small, as well as not being sufﬁciently lucrative for

companies to be able to create more on a large scale.

It is important for disability people in many aspects,

like, on the employment side, by using these tools,

people with disabilities could be at the same level in

terms of competitiveness and be able to have a com-

parable value as a worker. On the legal aspect, these

people can improve and increase their interaction with

other people if they are given the right tools to im-

prove their sociability (Nevarez-Toledo and Cedeno-

Panezo, 2019). For the medical aspect, this kind of

tools can help doctors to do analysis, therapies, reha-

bilitations in a more efﬁcient way when dealing with

people with disabilities (Garnica et al., 2020; Sun-

daram et al., 2019). Many solutions have emerged

https://orcid.org/0000-0002-7510-618X

over time, proposing new tools with different types of

methods to help these people, studies such as (Kan-

demir and Kose, 2022), who created a video game

for people with disabilities of different ages and dis-

abilities, which promotes physical activity, improv-

ing attention, improving emotion and sensorimotor

coordination. Studies such as (Hwang et al., 2021)

in which a human computer interface was designed

and implemented speciﬁcally for users with motor

disabilities, which works using eye tracking and lip

movement technology. The project aims to promote

employment, education, social life and leisure for

this type of people. The method that was used for

the development of the neural network that supports

the videogame, is speech-to-text. This is an inter-

disciplinary subﬁeld of computer science and com-

putational linguistics that allows the recognition and

translation of spoken language to text by computers

(O’Shaughnessy, 2008). To carry out this project we

have used a Convolutional Neural Network (CNN),

trained with voice data sets created by us, which we

have used using the wav2vec2-large-xlsr-53-spanish

model with the Spanish language. We have used

948

Aguirre-Peralta, J., Rivas-Zavala, M. and Ugarte, W.

Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks.

DOI: 10.5220/0011782900003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 948-955

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

the training results of the neural network

and we

have converted this data as the main input of the

videogame created, so it is able to understand through

the “speech-to-text” method and successfully interact

with the game.

Our main contributions are as follows:

• We developed a convolutional neural network ar-

chitecture focused on speech-to-text to be able to

recognize words, convert them to text and be able

to use them in a videogame.

• We developed 3 turn-based local multiplayer

mini-games within the videogame that can pro-

cess the data provided by the convolutional neural

network.

• We present an analysis of the most promising re-

sults showing the feasibility of our project.

This paper is organized as follows. Therefore, in

Section 2, similar solutions implemented in different

ﬁelds for people with disabilities will be explained. In

Section 3, we will explain the deﬁnitions of the tech-

nologies related to convolutional neural networks, the

details of the architecture chosen for the solution and

some terms such as NLP, CNN, speech-to-text and

gamiﬁcation. Finally, Section 4 will detail the experi-

ments and their results. To conclude with Section 5

2 RELATED WORKS

Next, we will brieﬂy discuss the different types of so-

lutions used for serious games similar to the ones in

this project using Gamiﬁcation and NLP.

In (Kandemir and Kose, 2022), the authors pro-

pose the use of 3 different types of human-computer

interaction games with the use of a Kinect sensor

based on physical activity to improve attention, emo-

tion and sensorimotor coordination. Unlike them,

which seek to increase the value of attention of the

participants and the willingness to continue with ther-

apy sessions and tests, we also seek, through gam-

iﬁcation, to improve the interaction of people with

others who suffer from motor disability of the upper

limbs of people with others.

In study (Hwang et al., 2021) the authors pro-

posed, a nine-point ﬁxation experiment to assess the

efﬁcacy of gaze data smoothing. To do this, they con-

ducted an on-screen keyboard-based Chinese text in-

put experiment with four different keyboard sizes to

assess the inﬂuence of keyboard size on text input

rate. This work aims to facilitate the interaction of

https://huggingface.co/jonatasgrosman/wav2vec2-

large-xlsr-53-spanish

disabled users with their computers for various com-

munication applications using eye tracking. In the

same way, we want to improve interaction for users

with upper limb motor impairment but using natural

language processing, speech recognition and speech-

to-text conversion.

Within the article (Mavropoulos et al., 2021) an at-

tempt is made to unify some advanced functionalities

in IoT with a human-computer verbal interaction to

improve the effectiveness of treatments/medications,

operating expenses, optimize personnel management,

among other beneﬁts in order to limit the interaction

between caregiver and patient to only what is nec-

essary. In contrast to this work, although we want

to facilitate human-computer interaction for people

with motor disabilities of the upper back, they use

this technology to provoke in hospital/clinic patients

a sense of independence that will help them mentally

to his speedy recovery.

In the study (Anjos et al., 2020) a comparison of

the classiﬁcation of sibilant consonants with neural

networks is carried out and, by gamifying this tech-

nique, it seeks as a result that children do not lose in-

terest and motivation to do their speech exercises. Re-

garding their work, they use CNN to convert matrixes

into data samples that, through gamiﬁcation meth-

ods, are transformed into serious games to help chil-

dren who cannot pronounce the correct sibilant con-

sonants. For our part, we use inputs in “wav” format

with stereo channel that, we then process to obtain

data for the training of the neural network and that

later becomes input for the correct operation of our

game that helps people with motor disabilities of the

upper limbs to increase the interaction between them.

In (Song et al., 2006) the author creates a system

which allows creating 3D cartoons from a text input,

which is based on Unity, as well as the ability to auto-

matically conﬁgure the environment without the need

for the input to be a script. Although both works use

NLP principles, in contrast to what we do, this studio

is used to create 3D cartoons through the NLP and

Graphic methods. Instead, our work uses NLP and

Speech to Text to understand human language, inter-

pret and process it to ﬁnally convert it into text and

use it as input for the main functionality of the game.

Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks

949

3 VIDEOGAME CONTROLLING

FROM SPEECH TO TEXT

RECOGNITION

3.1 Preliminary Concepts

Machine learning development have given us very

signiﬁcant advances that help improve aspects of our

lives and make everyday tasks easier. Like pattern

recognition, predictive behavior, facial recognition,

etc. and for what this document is about, word

recognition well known as speech-to-text. The model

chosen for this work was wav2vec2-large-xlsr-53-

spanish, which is based on Convolutional Neural Net-

works (CNN).

Convolutional Neural Networks. According to Ke-

iron O’Shea and Ryan Nash, CNN is a type of Ar-

tiﬁcial Neural Network (ANN) specialized in recog-

nizing patterns in images. Like the other ANNs, its

neurons self-optimize and update their weights. In the

last layer lie loss functions associated with the classes.

CNNs are made up of three types of layers, these are

convolutional layers, pooling layers, and fully con-

nected layers. These layers have to be together to

consider the architecture as CNN. Figure 1 shows a

CNN architecture (O’Shea and Nash, 2015).

Figure 1: A simple CNN architecture, composed of only

ﬁve layers (O’Shea and Nash, 2015).

Natural Language Processing (NLP). According to

Ann Neethu Mathew et al. Neethu Mathew et al.,

NLP is focused on how to train a machine to learn

and process human language using computational and

linguistic technology. (Jung et al., 2019).

Gamiﬁcation. According to Oriol Borr

as Gen

e, gam-

iﬁcation is the use of mechanics, techniques or some

element related to game design in a context in which

it is not, with the goal of increasing user engagement

and with the goal of solving or helping to solve a

problem. Figure 3 shows some of the characteristics

that Gamiﬁcation incorporates.

Figure 2: Image of a simpliﬁed NLP process

Figure 3: Characteristics that make up gamiﬁcation

3.1.1 Wav2vec2.0 Structure

Below in Figure 4 it is shown that wav2vec2.0 is made

up of several convolutional and self-attending layers.

This is a common type of structure recently used in

the end-to-end ASR family of models. The job of the

convolution layers is to reduce the speech sample X

and generate a more compressed latent representation

Z. In detail, Z are the raw audio signals and X are

sampled at 16kHz. With a lead time of 20 ms and a

receptive ﬁeld of 25 ms. Self-care layers are responsi-

ble for creating C-contextualized representations and

capturing high-level content Z. These self-attention

layers have a high level of context dependency mod-

eling, which allows the model to make the correct de-

cisions during the following contrastive training given

the masked Z (Yi et al., 2020).

3.2 Method

In this section we will describe our contributions.

Section 3.2.1 presents the workﬂow of our project,

https://monkeylearn.com/blog/what-is-natural-langu

age-processing/

https://blog.efmdglobal.org/2021/09/08/redefining-fai

lure-teaching-digital-skills-with-gamification/

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

950

Figure 4: Left: the structure of wav2vec2.0 and the corresponding self-training criteria. It contains a stack of convolution

layers and self-care layers; Right: Two decoding branches applying wav2vec2.0 to ASR tasks with extra projection or decoder,

which is trained with CTC or cross-entropy loss respectively (Yi et al., 2020).

which allows us to interact with the videogame from

the trained results of the implemented CNN. Sec-

tion 3.2.2 shows our data preprocessing method to ob-

tain audio that can be useful when training the convo-

lutional neural network. Section 3.2.3 corresponds to

explain the conﬁguration and the training procedure

of the CNN with the data obtained. Section 3.2.4

explains the link of the neural network outputs with

the videogame speech controller to create the human-

computer voice interaction.

3.2.1 Workﬂow

The objective of our proposed model is to convert the

audio into text so that it works as the data input for the

correct operation of the videogame. We created our

own dataset from the voices of 62 volunteers. Which

contains more than 11718 audios with durations of up

to 3 seconds. The words found in this dataset repre-

sent all the inputs and combinations that we needed

for the game to work in its entirety.

Figure 5: Physical diagram of our project.

Figure 5 shows the workﬂow of our project, which

starts in the “training” section. Our neural network

is stored in the shared ﬁle platform ”Google Drive”

which is coded in python in the ”Google Collab” plat-

form. The network training results were saved in

JSON format to a GitHub repository. Finally, the

video game was developed with the Unity game en-

gine, which connects to the repository and uses the

JSON as input to function.

3.2.2 Data Preprocessing

The dataset consisted of 27 words in spanish, which

are divided into:

• Words that designate games or actions such as

“michi”, “memoria”, “damas”, “salir”, “pausa”,

“continuar”, “reiniciar” and “cerrar”.

• Letter names such as “a”, “b”, “c”, “f”, “g”, “h”,

“i”, “j”, “k”, “l” and “m”.

• In addition to the names of the ﬁrst eight numbers

“1”, “2”, “3”, “4”, “5”, “6”, “7” and “8”.

Very few audios of the people who helped us had pro-

nunciation problems, too much noise or bad record-

ings; therefore, those audios were not used in the train

or test phases.

3.2.3 Neural Network Model and Training

As seen in section 3.1.1 wav2vec is based on multi-

ple convolutional layers. And then, the training pro-

cess consisted of providing the model with noise au-

dio of the keywords, this with the intention that the

model learns to identify and separate the words from

the noises.

We took the 27 pre-recorded words and combined

them with 7 different noises, obtaining 7 variants of

the same audio, thus allowing the model to identify

only the words. This process is known as Augmenta-

tion (Laptev et al., 2020). We listened and validated

Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks

951

the correct pronunciation of every audio, letting just

8.06% of the data obtained as not valid to proceed

with the training and testing. The training was carried

out in the Google Colab platform, we only used au-

dios of 57 people in the training phase, training and

saving the models each time. In the end, the model

was retrained with 10,773 audios.

3.2.4 Video Game Input and Human-Computer

Interaction

After the neural network has trained using the previ-

ously described dataset and has predicted the words

with a high percentage (74.50% in the case of noise

and 96.36% in the case of no noise), these were saved

in a JSON and exported to the GitHub virtual reposi-

tory. These are then used by the game’s Speech Con-

troller, which is responsible for parsing the JSON ex-

tracted from the web so that the Unity engine can in-

terpret the data. With this said, the user will be able to

use his microphone to say a word within the catalog

of trained words and interact with his voice with the

videogame.

4 EXPERIMENTS

In this section, we are going to detail the hardware and

software speciﬁcations that we have been using in the

project with their respective conﬁgurations in order to

replicate the experiments. In addition to showing the

results obtained and discussing their interpretation in

the evaluation of our metrics.

4.1 Experimental Protocol

Development environment: The neural network was

programmed and the models were trained in Google

Collaboratory, which provides us with an Intel Xeon

CPU at 2.20 GHz with 13 GB of RAM, a Tesla

K80 accelerator and 12GB of GDDR5 VRAM. On

the other hand for the videogame, the version of

Unity 2021.3.2f1 with the C# programming language

was used to program the three minigames with voice

recognition.

As we mencioned in section 3.2.2, 10773 audios

were used as a dataset for the different phases of train-

ing our neural network model.

We separate the training by phases to keep track

of the progress of each phase. Each phase has in an

average between 945 and 1512 audios trained, in a

limited way because of restrictions of the free ver-

sion of Google Collab you can only train for a limited

time and amount. Also, it was trained according to

the number of audios that were available at that time.

In the ﬁrst scenario, we have 4 test audio groups,

1 for training and 3 for testing:

• Group A. It is the original data collected, mixed

with 7 different noises, being a total of 10773 au-

dios.

• Group B. It is the original data collected, it is not

with noise, it was not used in the training and it

consists of 1539 audios.

• Group C. It is data collected from 5 other people,

it is mixed with the 7 noises and in total it consists

of 945 audios.

• Group D. It is the data from which Group C starts,

but without the 7 noises and consists of 135 au-

dios.

On the other hand, the video game was created in

Unity 3D with full functionality by mouse, which

later had to be transformed into voice commands, so

that the use of people with motor disabilities can be

validated. This environment will load the results of

the GitHub in which the results of the trained mod-

els were uploaded, which will serve as input for the

operation of the videogame.

The neural network code found in Google

Drive/Google Collaboratory can be found at the fol-

lowing links: https://goo.su/nia1a2 and https:

//goo.su/5MONA. Meanwhile, the video game repos-

itory with all the validated and complete features is on

Github: https://github.com/Xerathox/GamiHelper.

4.2 Results

4.2.1 Experiment Data Sets

As previously mentioned in section 4.1 the dataset in

total consisted of 11,718 audios, which went through

a preprocessing process mentioned in section 3.2.2.

Of these, 8.06% were replaced by synthetic audios. In

order to reproduce the training process, access to the

Python code ﬁle in Google Drive was provided in sec-

tion 4.1. First, the necessary ﬁles for the training must

be uploaded to your own Google Drive, which are the

audios and a “csv” format ﬁle with a column called

“ﬁle” in which the path of each training audio in your

Google Drive is speciﬁed and the other column called

“text” which speciﬁes the word it represents. Second,

you must change the paths of the training ﬁle to match

the paths of the training “csv” ﬁle, the path where the

model will be saved after training and also load a pre-

viously saved model.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

952

4.2.2 Experiment Conﬁguration

In the comparison with our work we have the follow-

ing models:

DeepSpeech. The ﬁrst hyper-parameter setup we had

was 10 epochs and 10 layers.

• With our Group B test we had a WER of 100%

(0% accuracy).

• With our Group C test we had a WER of 100%

(0% accuracy).

• With our Group D test we had a WER of 100%

(0% accuracy).

In the second hyper-parameter conﬁguration we

had 20 epochs and 20 layers.

• With our Group B test we had a WER of 96.30%

(accuracy of 3.70%).

• With our Group C test we had a WER of 96.30%

(accuracy of 3.70%).

• With our Group D test we had a WER of 100%

(accuracy of 0%).

Tensorﬂow Model from 0. The hyper-parameter

conﬁguration in this case was in the layers:

• InputLayer: Input layer.

• Resizing: Resizes the image to 32x32 pixels.

• Normalization: A normalization layer with de-

fault parameters.

• Conv2D: A 2D Convolution layer with ﬁlter of 32,

kernel size of 3 and activation function “relu”.

• Conv2D: A 2D Convolution layer with ﬁlter of 96,

kernel size of 3 and activation function “relu”.

• MaxPooling2D: A MaxPool layer with default pa-

rameters.

• Dropout: A Dropout layer with rate of 0.25.

• Flatten: A Flatten layer with default parameters.

• Dense: A dense layer with units of 256 and acti-

vation function “relu”.

• Dropout: A Dropout layer with rate of 0.5.

• Dense: A dense layer with units of the same num-

ber of labels.

This is a special case because this model uses vali-

dation data. Two tests were carried out: Each time

the model is trained a different model can be obtained

which can give different accuracies to the same test,

this means that if the same steps are followed to recre-

ate the experiment it may not give the same accuracy

as what we present below. For Case 1 we obtained

in 5 training and testing attempts 19.26%, 13.33%,

15.56%, 11.11% and 14.81%. We will consider an

average of 14.81% accuracy. For Case 2, 16.51%,

17.99%, 13.76%, 15.56% and 18.10% were obtained

in 5 training and test attempts. We will consider an

average of 16.38% accuracy.

This would be better visualized in Table 1.

Table 1: Tensorﬂow Model Test on average.

Train

#Audios

Validation

#Audios

Test

#Audios

Accuracy

on average

Try 1 Try 2 Try 3 Try 4 Try 5

Case 1

Noisy Data

9,828 945 945 14.81% 19.26% 13.33% 15.56% 11.11% 14.81%

Case 2

Clean Data

1,404 135 135 16.38% 16.51% 17.99% 13.76% 15.56% 18.10%

Figure 6a shows a heat map pertaining to the accu-

racy of this model when trained and tested with noisy

data. And Figure 6b shows another heat map, but this

time when trained and tested with data without noise.

(a) For noisy audio and 17.99% of accuracy.

(b) For clean audio and 19.26% of accuracy.

Figure 6: Heatmaps of Tensorﬂow models trained and

tested with audios.

Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks

953

Table 2: Azure Cognitive Services Model.

Group B Group C Group D

Weight Error rate Accuracy Error rate Accuracy Error rate Accuracy

70% 4.22% 95.78% 13.33% 86.67% 2.96% 97.04%

30% 4.35% 95.65% 15.87% 84.13% 3.70% 96.30%

10% 5.13% 94.87% 18.52% 81.48% 4.44% 95.56%

Azure Cognitive Services Model. The conﬁguration

of hyper-parameters in this case, the weight of the

new training data. It was tested for the cases where

the training weights were 70%, 30% and 10%. This

is shown in Table 2.

4.2.3 Baselines

We re-trained the wav2vec2 model mentioned above

with our data from “Group A” in 14 different versions,

at 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25 and 30

epochs to have evidence of how many epochs learn

best. Table 3, shows the results of each test. Due to

Table 3: Groups B, C and D tested with each epochs.

Model Group B Group C Group D

Standard 63.87% 43.49% 65.93%

5 epochs 95.06% 70.26% 98.52%

6 epochs 94.87% 71.22% 97.78%

7 epochs 95.32% 72.49% 98.52%

8 epochs 85.06% 60.74% 87.41%

9 epochs 96.30% 72.80% 98.52%

10 epochs 96.30% 73.44% 98.52%

11 epochs 95.71% 73.44% 97.78%

12 epochs 81.48% 57.14% 80.74%

13 epochs 96.36% 74.50% 97.78%

14 epochs 82.20% 57.04% 83.70%

15 epochs 80.12% 55.24% 79.26%

20 epochs 88.30% 64.55% 89.63%

25 epochs 77.58% 57.99% 75.56%

30 epochs 77.45% 59.05% 76.30%

what is presented in the table 3 we can conclude that

the hyper-parameter “number of epochs” with which

the wav2vec 2.0 model, presented a better learning

was with 13 epochs. It has better precision in Grups B

and C. Group D has less weight as it has fewer audios.

4.3 Discussion

Observations for the DeepSpeech model:

• Very uneven results were achieved; the results of

the implemented neural network returned us a per-

centage of 0% and 100%.

• Training with a lot of data with little training and

few convolution layers, the results tend to return

0% accuracy. However, with little data and a lot of

training, the model tends to return 100% accuracy

making it an unstable model.

• We could modify the hyper-parameters or in-

crease the epochs and layers in order to receive

more accurate results, however, we didn’t have

much time at the date the models were being made

and trained for comparisons.

Observations for the Tensorﬂow model from 0:

• A lot of data is needed when you want to start

training a speech-to-text model.

• By using the free version, this model was limited

and only a speciﬁc number of outputs could be

given (labels).

• The model couldn’t sometimes differentiate

speech from noise, the model compares the wave-

form of the spectrogram and that interference

makes it much more difﬁcult to ﬁnd the word to

which each sound belongs.

• The size of the duration of every audio long be-

tween 1 second to 4 seconds made learning very

difﬁcult, as it seems to compare the audio wave-

form of the spectrogram very simply. If it detects

that an audio has the sounds of 2 audios in dif-

ferent positions it can consider them as different

words even though they are not, reducing the ac-

curacy percentage.

Observations for the model in Azure Cognitive Ser-

vices:

• Allows you to set the language of the audio input.

• It is a very powerful model, that is evidenced by

the high accuracy rate.

• This model is a paid model compared to

wav2vec2 which is free.

Main Conclusions.

• Pertaining to the recommendations of the intervie-

wees:

– The game has an easy learning curve, quite in-

tuitive to use, intuitive at ﬁrst use.

– Simple but plain but attractive and colorful de-

sign.

– It could be focused on another target audience

such as children and elderly.

– More games could be added.

– The letters indicating how to play within the

mini-games could be more detailed to help

guide the player a little more when starting the

game.

– Some mentioned that the lighting looked un-

clear.

– The large minority criticized the choice of col-

ors, as better color combinations could be cho-

sen for the game.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

954

– 20% did not understand how to get started as

it was only mentioned to them that the game

worked by voice commands.

• Belonging to the wav2vec2 model:

– It uses the torch library which is not compat-

ible with Windows, it can only be trained by

virtual machines, other operating systems or by

Google Colab.

– This model does a good job to isolate the

speech from the noise, it also does not lose ac-

curacy due to the size of the audio, compared to

the Tensorﬂow model that needed the duration

of the audios to be very close to each other for

example.

– The language of the input cannot be speci-

ﬁed, that is done by the same model inter-

nally, which can cause a precision problem for

a syllable, word or phrase that sounds the same

in two different languages. For example, “i”

(spanish) and “e” (english) or “ji” (spanish) and

“he” (english).

5 CONCLUSIONS

By implementing the W2V2 convolutional neural net-

work model that transforms speech to text and in-

creasing the accuracy percentage of the trained words,

the computer can effectively understand the keywords

provided to it so that they can be used in this case as

input for the video game.

In the future, in terms of the video game, we

would like to increase the catalog of games, add difﬁ-

culty levels, some visual aspects such as color palette

or more detailed instructions on the screens. On the

other hand, as the speech-to-text method is such a

versatile topic, similar techniques could be applied in

other areas, such as voice-controlled ofﬁce tools, cell

phone applications to understand the functionalities

of the cell phone for people with other disabilities,

etc. As long as you have a good speech-to-text model

or implement one and train it long enough and accu-

rately, any model is valid. Evenmore, the user expe-

rience can be improven greatly with animation mod-

els (Silva et al., 2022) or story creation (Fern

andez-

Samill

an et al., 2021).

REFERENCES

Anjos, I., Marques, N. C., Grilo, M., Guimar

aes, I., Mag-

alh

aes, J., and Cavaco, S. (2020). Sibilant consonants

classiﬁcation comparison with multi- and single-class

neural networks. Expert Syst. J. Knowl. Eng., 37(6).

Fern

andez-Samill

an, D., Guizado-D

ıaz, C., and Ugarte, W.

(2021). Story creation algorithm using Q- learning in

a 2d action RPG video game. In IEEE FRUCT.

Garnica, C. C., Archundia-Sierra, E., and Beltr

an, B.

(2020). Prototype of a recommendation system of ed-

ucative resources for students with visual and hearing

disability. Res. Comput. Sci., 149(4):81–91.

Hersh, M. A. and Mouroutsou, S. (2019). Learning technol-

ogy and disability - overcoming barriers to inclusion:

Evidence from a multicountry study. Br. J. Educ. Tech-

nol., 50(6):3329–3344.

Hwang, I., Tsai, Y., Zeng, B., Lin, C., Shiue, H., and Chang,

G. (2021). Integration of eye tracking and lip motion

for hands-free computer access. Univers. Access Inf.

Soc., 20(2):405–416.

Jung, S., Son, M., Kim, C., Rew, J., and Hwang, E. (2019).

Video-based learning assistant scheme for sustainable

education. New Rev. Hypermedia Multim., 25(3):161–

181.

Kandemir, H. and Kose, H. (2022). Development of adap-

tive human-computer interaction games to evaluate at-

tention. Robotica, 40(1):56–76.

Laptev, A., Korostik, R., Svischev, A., Andrusenko, A.,

Medennikov, I., and Rybin, S. (2020). You do not need

more data: Improving end-to-end speech recognition

by text-to-speech data augmentation. In CISP-BMEI,

pages 439–444. IEEE.

Mavropoulos, T., Symeonidis, S., Tsanousa, A., Gian-

nakeris, P., Rousi, M., Kamateri, E., Meditskos, G.,

Ioannidis, K., Vrochidis, S., and Kompatsiaris, I.

(2021). Smart integration of sensors, computer vision

and knowledge representation for intelligent monitor-

ing and verbal human-computer interaction. J. Intell.

Inf. Syst., 57(2):321–345.

Nevarez-Toledo, M. and Cedeno-Panezo, M. (2019). Appli-

cation of neurosignals in the control of robotic pros-

thesis for the inclusion of people with physical dis-

abilities. In INCISCOS, pages 83–89.

O’Shaughnessy, D. D. (2008). Invited paper: Automatic

speech recognition: History, methods and challenges.

Pattern Recognit., 41(10):2965–2979.

O’Shea, K. and Nash, R. (2015). An introduction to convo-

lutional neural networks. CoRR, abs/1511.08458.

Silva, S., Sugahara, S., and Ugarte, W. (2022). Neuranima-

tion: Reactive character animations with deep neural

networks. In VISIGRAPP (1: GRAPP).

Song, I., Jung, M., and Cho, S. (2006). Automatic gen-

eration of funny cartoons diary for everyday mobile

life. In Australian Conference on Artiﬁcial Intelli-

gence, volume 4304 of Lecture Notes in Computer

Science, pages 443–452. Springer.

Sundaram, D., Sarode, A., and George, K. (2019). Vision-

based trainable robotic arm for individuals with motor

disability. In UEMCON, pages 312–315. IEEE.

Yi, C., Wang, J., Cheng, N., Zhou, S., and Xu, B. (2020).

Applying wav2vec2.0 to speech recognition in various

low-resource languages. CoRR, abs/2012.12121.

Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks

955