Model for Real-Time Subtitling from Spanish to Quechua Based on

Cascade Speech Translation

Abraham Alvarez-Crespo, Diego Miranda-Salazar and Willy Ugarte

Universidad Peruana de Ciencias Aplicadas, Lima, Peru

Keywords:

Speech Recognition, Machine Translation, Quechua Revitalization.

Abstract:

The linguistic identity of many indigenous peoples has become relevant in recent years. The speed with which

many of these have been lost faces many countries of the world with a serious reduction of their cultural

heritage. In South America, the critical situation of vulnerability of many of its native languages is alarming.

Even languages such as Quechua, widely spoken in the region, face an early disappearance due to a low rate

of inter-generational transmission. Aware of this problem, we have proposed the development of a translation

and subtitling system for short ﬁlm videos from Spanish to Quechua. The proposal contemplates the use of

cinema to promote and retain the language. For its realization, we have built a solution that combines a Spanish

Voice Recognition system and our proposal for a Quechua Machine translation model. This will be integrated

with a desktop application that will also subtitle the ﬁlm videos. In the tests we carry out, we have obtained

better translation indicators than past proposals; in addition to the validation of Quechua-speaking users of the

tool’s value. Aware of this problem, we have proposed a speech-to-text translation model that could be used as

a resource for language revitalization. For its realization, we developed a cascade architecture that combines

a Spanish speech recognition module and our proposal of a Quechua machine translation module, ﬁne-tuned

from a Turkish NMT model and a parallel public dataset. Additionally, we developed a subtitling algorithm

to be joined with our solution into a real-time subtitling desktop application for clips of ﬁlms. In the tests we

carry out, we have obtained better BLEU and chrF scores than previous proposals; in addition to the validation

of the translation returned in the subtitles by the Quechua speakers consulted.

1 INTRODUCTION

Numerous nations and international organizations

have recently stepped up their efforts to safeguard,

preserve, and promote the thousands of indigenous

languages that risk extinction by the turn of the cen-

tury. Proof of this can be found in UNESCO’s an-

nouncement of an exclusive decade for the rescue of

several of these. However, the conﬂict’s diverse na-

ture deteriorates in some regions. The risk scenario

for numerous languages on the American continent

was already covered in (Nordhoff and Hammarstr

om,

2012). Due to this, many South American countries

have started taking steps to conserve the linguistic di-

versity of their citizens. These haven’t, however, had

a signiﬁcant inﬂuence on lowering vulnerability in na-

tions like Peru. Furthermore, recent research

reveals

https://orcid.org/0000-0002-7510-618X

Encuesta Nacional: Percepciones y actitudes sobre

diversidad cultural y discriminaci

etnico-racial, IPSOS,

2018

how poorly these are promoted and revitalized. With

a populace that is only familiar with less than a ﬁfth

of the 47 indigenous languages spoken in their nation

and that is increasingly inclined to give up their na-

tive languages in favor of Spanish. As a result, the

situation is severe for many minority indigenous lan-

guages. Even for one of the largest in the region such

as Quechua, which is close to 10 million speakers in

the Andean regions of countries such as Colombia,

Ecuador, Bolivia, Argentina and Peru (Rios, 2011).

And it’s because of this that it faces a signiﬁcant dis-

appearance problem due to its low rate of intergener-

ational transfer.

This makes the preservation of this and other lan-

guages important. A signiﬁcant loss of its cultural

and historical value results from the population aban-

doning its native language. Therefore, it is essen-

tial to develop methods that allow the revitalization

of Quechua. For this research, we will focus on the

standard version of Southern Quechua, speciﬁcally

the Quechua Chanka variety. We will take advan-

Alvarez-Crespo, A., Miranda-Salazar, D. and Ugarte, W.

Model for Real-Time Subtitling from Spanish to Quechua Based on Cascade Speech Translation.

DOI: 10.5220/0011783300003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 837-844

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

837

tage of the power of computational translation models

to break the state of linguistic isolation to which the

Quechua population has been exposed in such impor-

tant aspects as education, health, or entertainment.

The work is challenging, though, because there

have only been a limited number of studies done in

languages with a nature similar to their own. Due

to the fact that Quechua is an agglutinative lan-

guage, which means that it heavily relies on mor-

phemes added as sufﬁxes (Rios, 2011). And with

subject-object-verb (SOV) grammatical ordering, it

has a disadvantage over more widely used languages

like Spanish or English. Because of this, our ap-

proach is built on a less promising area of research

and with a lower threshold for success. Another

point is the low-resource pain of many indigenous

languages, Quechua included (Mager et al., 2018).

Which refers to languages with parallel corpora that

contain fewer than 0.5 million parallel sentences (Le-

Cun et al., 2015). This characteristic has a negative

effect on the results and the capability to explore con-

temporary solutions that required big amount of data.

There have been small-scale non-computational

and computational initiatives in the past aimed at re-

viving Quechua. Regarding the ﬁrst group: There

are various educational or social programs run by

the Peruvian government to assimilate the Quechua

community (Hornberger and Coronel-Molina, 2004;

Sumida Huaman, 2020). However, they are unable

to provide written or spoken resources in Quechua,

organize plans with precise goals or implement poli-

cies that encourage inclusion rather than segregation.

Regarding the second group, recent years have seen

the growth of datasets and limited machine learn-

ing techniques addressing Quechua (Oncevay, 2021;

Chen and Abdul-Mageed, 2022; Edgar et al., 2012;

Ortega and Pillaipakkamnatt, 2018) speech synthesis,

text translation systems, and voice recognition.

To overcome the task we will incorporate natu-

ral language processing (NLP) technologies for the

Quechua subtitling of cinematic content as educa-

tional tools that support the language’s preserva-

tion. Our suggested proposal converts the video’s

dialogues into waveform representations in the ﬁrst

place. Then a speech recognition model(SR) based

on Wav2Vec (Baevski et al., 2020) will analyze to

determine the content and convert it into written dis-

course representations. The data will then be numer-

ically transformed using the SentencePiece tokenizer

before our suggested machine translation model (MT)

that uses the Transformer-base model (Vaswani et al.,

2017) translates each dialogue and is then applied to

the process of subtitling the original movie.

Our main contributions are as follows:

• We design an integrated cascade architecture for

speech-to-text (S2T) translation between Spanish

and Quechua languages.

• We train a new machine translation model using

the transfer learning method.

• We develop a desktop application to subtitle clips

of movies to Quechua.

This paper is organized as follows. Therefore, in

Section 2, an analysis of the state of the art considered

for this work will be made. Section 3, ﬁrst, introduces

the technologies used for the development of the pro-

posed solution and then, describes the whole process

of deploying the software. Finally, Section 4 show

the experimental protocol, the results obtained, and

the discussion. To conclude with Section 5

2 RELATED WORKS

The challenge of converting acoustic voice signals

into text or even audio is intricate and multifaceted.

The main method has been the cascaded model built

upon machine translation (NMT) and voice recog-

nition (ASR) systems. As a consequence, for a

long time, the advancement in the literature has been

driven by general advances in the ASR and NMT

models as well as a transition from the loosely cou-

pled cascade’s most fundamental form to one with

a tighter coupling. Although end-to-end trainable

encoder-decoder models and other new modeling

techniques have lately generated a number of changes

in the ﬁeld’s perspective (Jia et al., 2019). Despite

this, the literature’s effectiveness and strength are cor-

related with dominant languages. The ﬁeld hasn’t

been studied or developed enough for many indige-

nous dialects.

In (Nayak et al., 2020), the authors take an

application-focused approach to the ﬁeld. Its appli-

cation to ﬁlm dubbing is discussed, and the aspect of

source and destination conversation length alignment

suggested by another work is addressed. Its encoding

module is where it makes a contribution to literature,

composed of attention modules that execute the tem-

poral alignment of the word with the time sequence of

the video and recurrent networks that catch audio se-

quences. However, in our work, we make use of a part

of the voice recognition module that records the tem-

porality of the dialogues and subsequently permits the

subtitling. By doing this, we can completely ignore

the video processing and concentrate on the audio.

In (Tjandra et al., 2019), it is suggested to

train speech-to-speech translation systems without

language supervision using complicated transformer

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

838

networks. The suggestion alludes to works where

attention-layer networks bypass a phase of the tra-

ditional cascade to do direct translations with fewer

modiﬁcations and, thus, a greater awareness of non-

lexical voice metadata. His contribution concentrates

on making them more effective while concentrating

on words that can’t be translated linguistically. For

this, a red transformer that performs direct translation

is trained using unsupervised discrete representations

of the vocal signals of the two languages.

In (Kano et al., 2020), the authors examine end-to-

end translation in languages with different syntactic

structures. The experiment highlights the extra effort

needed to translate words in a different order and how

it affects the outcomes when compared to pairings of

syntactically related languages. The work’s recom-

mendation is to create an encoder-decoder architec-

ture for ASR and NMT systems trained on curriculum

learning, in order to take into account the syntactic

distinctions between the two studied languages (en -

jp). In this instance, we’ve used transfer learning on

a model that was previously trained using pairings of

syntactically various languages.

In (Oncevay, 2021), the authors propose neu-

ral machine translation modules for indigenous lan-

guages of South America. The proposal considers the

tokenization and translation procedure. Through test-

ing and ﬁne tweaking, SentencePiece and the trans-

former basis model are utilized to identify the ideal

parameters for many languages. They do not, how-

ever, account for the syntactic variation of the model

that was previously trained in en-es. In this instance,

we employ a trained model of language pairs that are

more syntactically related to es - quy.

3 CASCADE SPEECH

TRANSLATION FROM

SPANISH TO QUECHUA

3.1 Preliminary Concepts

We have read up on speech translation (ST) systems in

order to design the proposed solution. These systems

use deep learning algorithms to take the speech as a

starting point and, depending to the ﬁnal representa-

tion could perform a speech-to-text(S2T) or speech-

to-speech(S2S) translation. In recent years, the state

of the art has been turning between the cascade’s lev-

els reduction and the addition of paralinguistic infor-

mation in the translation process. For instance, tradi-

tional ST models examined in (Baldridge, 2004), used

to divide the translation task into speech recognition

(SR), machine translation (MT), and text-to-speech

(TTS); while newer approaches try to avoid one side

of the text conversion (Tjandra et al., 2019) or even

train the models only with parallel audio corpora (Jia

et al., 2019). Although, this is possible only for the

most widely spoken languages. The least studied lan-

guages still need to improve each level of the cascade;

with even higher priority, increase data collection.

Deep Learning: The concept refers to models

that are made up of several processing layers, to learn

representations of data at various levels of abstrac-

tion. With the use of this technique, the state-of-the-

art in important areas for the creation of voice trans-

lation models, such as speech recognition, automatic

translation, or speech synthesis, has greatly advanced.

By employing the backpropagation technique to sug-

gest changes to a machine’s internal parameters, deep

learning may uncover detailed structures in massive

data sets (LeCun et al., 2015). By employing the

backpropagation technique to suggest changes to a

machine’s internal parameters that are used to calcu-

late the representation in each layer from the repre-

sentation in the previous layer, deep learning may un-

cover detailed structures in massive data sets.

Convolutional Neural Network (CNN): One or

more pairs of convolutional and maximum pooling

layers make up a CNN. A convolution layer applies

a series of ﬁlters that duplicate themselves through-

out the whole input space and process discrete local

portions of the input. By extracting the maximum ﬁl-

ter activation from various points inside a given win-

dow, a max-pooling layer produces a lower resolution

version of the convolution layer activations (Abdel-

Hamid et al., 2012). This increases tolerance for lit-

tle variations in the locations of an object’s compo-

nent pieces and translation invariance. Higher layers

handle more complicated portions of the input using

broader ﬁlters that operate on lower-resolution inputs.

In order to classify the total inputs, the top fully linked

layers aggregate inputs from all locations.

Speech Recognition: The idea alludes to AI’s ca-

pacity to identify speech-related vocalizations in hu-

man speech. Replicate voice signals as a string of

words using computational models, in other words.

Speech Recognition (SR) evolved and has been made

possible by the use of statistical and neural model-

ing techniques, which have made it possible to rec-

ognize and comprehend lexical speech data in a vari-

ety of languages and practical settings. The emphasis

of these systems has also been shifted to include fea-

tures of the speaker’s identity in addition to interpret-

ing speech content. In any case, Figure 1 illustrates

the general methodology used for speech recognition,

which starts with feature extraction by making use

Model for Real-Time Subtitling from Spanish to Quechua Based on Cascade Speech Translation

839

(a) A pair of CNN convolution layers and max-pooling layers are dis-

played in the diagram, where weights shared by all convolution layer

bands are indicated by the same line style (Abdel-Hamid et al., 2012).

(b) General Architecture of Speech recognition.

Figure 1: Architectures of CNN and Speech recognition.

of feature extraction methods such as Wavelets, etc.

Once they’re obtained, a decoder is employed to de-

tect the speech (Haridas et al., 2018)

Machine Translation: The phrase alludes to the

capacity of computers to automatically translate hu-

man languages. It was based on simple rule sys-

tems for a long time. Today, there exist systems

with probabilistic or neural models that are more ef-

fective thanks to new parallel and non-parallel data

sources. As shown in Figure 1, a solution for Ma-

chine Translation is the application of a neural ma-

chine translation encoder-decoder framework. The

encoder network converts each input token of the

source-language phrase into a low-dimensional real-

valued vector (word embedding) and then encodes the

sequence of vectors into distributed semantic repre-

sentations, which the decoder network creates token

by a token from left to right (Zhang and Zong, 2020)

Transformer Model: The Transformer is a model

architecture that foregoes recurrence in favor of draw-

ing global relationships between input and output via

an attention mechanism, it allows for substantially

higher parallelization and can achieve new levels of

translation quality after a few hours. The Trans-

former is the ﬁrst transduction model to generate rep-

resentations of its input and output using just self-

attention rather than sequence-aligned RNNs or con-

volution (Vaswani et al., 2017).

Figure 2 illustrates the encoder-decoder structure

of most neural sequence transduction models; the en-

coder in this example translates an input series of

symbol representations (x1,..., xn) to a sequence of

continuous representations z = (z1, ..., zn). The de-

coder, given z, produces a symbol output sequence

(y1,..., yn) one element at a time. This fundamental

architecture is followed by the Transformer (see Fig-

ure 2), which, as shown in the image, utilizes layered

self-attention and point-wise, entirely connected lay-

ers for both the encoder and decoder.

3.2 Method

Our work’s key contribution is the cascade structur-

ing of NLP systems for speech-to-text (S2T) trans-

lation between Spanish and Quechua (shown in Fig-

ure 3) and the application of learning transfer to pre-

viously trained machine translation models. The ﬁrst

was with the Spanish (sp) - Finnish (ﬁ) language pair,

and the second was a multilingual model for trans-

lating from Italic languages(itc), Spanish included, to

Turkish (tr).

Cascade Architecture: The cascade approach

to Speech Translation (ST) is based on a pipeline

that concatenates an Automatic Speech Recognition

(ASR) system followed by a Machine Translation

(MT) system. This setup generates efﬁciency in solv-

ing morphological problems.

Speech Recognition: In this manner, we spread

the work in a Spanish speech recognition module that

converts voice inputs into a string of words while

maintaining the lexical richness of the original inputs.

We used the Spanish-speaking wav2vec supervised

model of automatic voice recognition for this chal-

lenge. As is shown in Figure 4, audio is converted

into a two-dimensional waveform (a signal represen-

tative of the audio). Then, this source entered a con-

volutional network (CNN) to be converted into a la-

tent representation. Finally, a supervised red trans-

former that forecasts a result using linear projections

will contextualize the latent data.

Machine Translation: It is necessary to translate

a discussion into Quechua after we have identiﬁed it.

Therefore, an attention-based neural network trans-

lation module is now applied to the expected text.

SentencePiece is utilized to tokenize the Quechua

datasets for this task. After this, we used trans-

fer learning to the Transformer-base model (Vaswani

et al., 2017) with the default conﬁguration in Marian

NMT (Junczys-Dowmunt et al., 2018) to ﬁne-tunned

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

840

(a) A neural machine translation encoder-decoder framework. The

encoder converts the input sequences into distributed semantic rep-

resentations, which the decoder uses to generate the output se-

quence (Zhang and Zong, 2020).

(b) Transformer model architecture (Vaswani et al.,

2017).

Figure 2: Architecture models.

Figure 3: Wav2Vec Architecture: Speech Recognition

model.

the network, in order to get the contextual represen-

tation which will be aligned to the most propensity

Quechua result. The fundamental principle of transfer

learning is to use the parent model’s parameters as the

starting point for our proposal rather than beginning

from scratch and randomly initializing the parame-

ters (Zoph et al., 2016; Nguyen and Chiang, 2017).

4 EXPERIMENTS

We will go over the experiments that our project

has gone through in this section. The datasets were

obtained from many domains and freely available

sources.

The public corpus of Spanish (es) and Quechua

Ayacucho (quy) offered by AmericasNLPs

has

shown to be the most pertinent (Agic and Vulic,

2019).

Available in https://github.com/AmericasNLP/

americasnlp2021

4.1 Experimental Protocol

In order for the studies to be repeated by others, we

will describe the setup and procedures used.

Development Environment: About the environ-

ment in which the experiments were developed, we

used our own computers. They had an Intel Core

I5 processor, 32GB of RAM, and one GPU on an

NVIDIA GeForce RTX 3060. We used python to cre-

ate the models, along with Pytorch, its deep-learning

framework for Nvidia graphics cards. A large number

of additional libraries were also used, including sen-

tencepiece’s tokenizer, evaluate’s evaluation metrics,

and transformers for the transfer learning application.

Dataset: The dataset that was used for the experi-

ments came from open sources. They were all parallel

corpora for machine translation. The data set includes

Spanish text that has been translated into southern

Quechua standard (Quechua chanka). A variant that

is spoken in several parts of Peru. The Americas-

NLP Shared Task included the two primary training

datasets. The ﬁrst, JW300 (Agic and Vulic, 2019) is

made up of Jehovah’s Witnesses writings and is avail-

able in OPUS, while the second contains ofﬁcial dic-

tionaries from the Ministry of Education and other

sources. We also, clean the datasets because they

were noisy and not cleaned. Lines are reduced ac-

cording to several heuristics: removal of URLs, num-

bering of sentences or book appendages, a sentence

that has more symbols or numbers than actual words,

or it may have a word ratio where one side’s words are

ﬁve times longer or shorter than the other, etc. The re-

sult of this process is shown in Table 1.

Models Training: All model experiments were

run using Pytorch in a physical GPU environment

with 32 GB of RAM, for 8 epochs using batches of 16

tuples per dataset, with the Adam optimizer and try-

Model for Real-Time Subtitling from Spanish to Quechua Based on Cascade Speech Translation

841

Figure 4: Proposed S2T Translation Architecture.

Table 1: Statistics and cleaning for all parallel corpora.

Corpus S (orig.) S (clean) % clean

JW300 125,008 103,293 -17.4%

DICT 9,643 5,925 -38.6%

Table 2: Number of parallel sentences per split.

Corpus Train Test

JW300 92,964 10,329

DICT 5,333 593

Table 3: Testing Results.

BLEU chrF

JW300

(a) TL Bilingual 10.48 49.63

(b) TL Multilingual 12.78 51.20

DIC

(d) TL Bilingual 3.48 30.22

(e) Baseline 9.55 40.33

ing to minimize the loss value, taking an average of

80 minutes per epoch with the most extensive dataset.

JW300 training typically took 9 hours, whereas DIC

training only required 3 hours.

Evaluation: To qualify the performance of the

models, we utilized the BLEU (Papineni et al., 2002)

and chrF (Popovic, 2015) scores to evaluate them.

The goal is to reach the highest score to validate an ef-

ﬁcient translation. We used Wandb to get a full review

of the executions and their particular conﬁgurations.

Models Training: The entire code and dataset ac-

cord the proposed solutions and experiments is pub-

licly available at the following repository: https://

github.com/Malvodio/CinemaSimi.

4.2 Results

To compare our results we set a baseline model that

takes the best scores obtained on (Oncevay, 2021).

Because to the best of our knowledge, this is the one

who achieve the highest performance in evaluation

metrics of Quechua machine translation and they also

used the datasets provided by AmericasNLP Shared

Task. This baseline was built with a 6 self-attention

layer (Vaswani et al., 2017) and 8 heads to ﬁrst pre-

train it with a Spanish-English parallel corpus and

then ﬁnetune it with 10 indigenous language pairs;

to produce a multilingual MT model. For our mod-

els, we leverage publicly accessible pre-trained mod-

els from Huggingface (Wolf et al., 2020) as provided

by Helsinki-NLP (Tiedemann and Thottingal, 2020).

The pre-trained MT models released by Helsinki-

NLP are trained on OPUS, an open-source parallel

corpus (Tiedemann, 2012). Underlying these mod-

els is the Transformer architecture of Marian-NMT

framework implementation (Junczys-Dowmunt et al.,

2018). Each model has also 6 self-attention layers

in the encoder and decoder parts, and each layer has

8 attention heads. The models we speciﬁcally use

were pretrained with OPUS Spanish-Finnish data and

Spanish and other italic languages-Turkish data. We

choose this model because their source language is

Spanish so they will have good Spanish subword em-

beddings. And for the Quechua, we take the consid-

erations mentioned by (Ortega and Pillaipakkamnatt,

2018). They point out that the task needs a similar

agglutinate language with a similar word order. In

our investigations, the Finnish and Turkish ﬁt with

it. Following recommended practices, we used a uni-

form sampling on our datasets to avoid under-ﬁtting

the low-resource condition of our pair language. For

the experiments, each dataset was divided into 90%

training data and 10% testing data. In Table 2, the

volumes are shown.

The results are shown in Table 3 and as is seen our

model gets better scores than the selected baseline.

We also add a graph that shows the loss reduction of

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

842

(a) Progression on loss reduction per epoch on the

training process.

(b) Evolution of test chrF score per epoch.

Figure 5: Evolution of metrics per epoch.

the model around the training process (see Figure 5).

While Figures 5 and 5 show the gain over the BLEU

and chrF score by epochs of the validation process.

4.3 Discussion

One of the most interesting outcome on the perfor-

mance of the models was the impact of the dataset.

The lower scores on the experiments over the DICT

dataset were not expected. The results show that, the

training end with a high underﬁtting. We wonder if

this was caused for the reduced number of sentences.

And obviuous the extremely lower resource state of

this dataset don’t let the model learn over a whole do-

main of the language. But we think that the reason

why the model can’t even learn in its same domain

could be the low average words per sentence, less than

a third of the JW300 average.

On other hand, the translation scores obtain over

the JW300 dataset with both proposed models, the

multilingual (itc - tr) and the bilingual (es - ﬁ), are bet-

ter than the base model choosed, as shown in Table 3.

This positive results could indicated that the strategy

used to approach this work was successful. Concen-

trate our effors on ﬁne-tunning existing solutions that

have similarities with the problem addressed work

highly enough to sustain our hipotesys. Being aware

of the limitations that Quechua or other indigenous

languages have, allows us to optimize learning on

models that have already effectively learned similar

aspects of these languages. The based model also

applies transfer learning, but between a pair of lan-

guages (es - en), which are syntactically and typo-

graphically distinct from Quechua. These approuch

might don’t let his model learn well the nature of the

language. Another interesting point is the distinc-

tions between the bilingual and multilingual models

results over JW300. We think the lower results of the

ﬁrst is caused for the syntactic ordering type of these

languages. Because if we noticed, both Finnish and

Turkish are agglutinating languages. But just the last

has the same syntactic ordering as Quechua.

Quechua translation scores are still well below

what is expected for solutions involving widely spo-

ken languages. This keeps the task very challenging

and highlights the need for a more complete and ex-

tensive corpus of Quechua. Finally, the differences

between the BLEU and chrF scores are seen in the

way they measure translation efﬁciency, while the for-

mer is a word-level standard, the latter allows us to

assess the character level. Very worthy considering

the agglutinative nature of Quechua.

5 CONCLUSION

We conclude that the results of our best model val-

idate the approach proposed, even despite the ”low

resource” limitation of the language. Using trans-

fer learning strategy allowed us to retrain models that

share similarities with the quechua speech translation

instead of developing them from scratch. There is a

chance to improved the tokenization with a cleaner

corpus and deﬁne a estrategy for out-of-domain val-

ues could also improved the results. In future works,

we hope to be able to close the translation cycle

by adding the speech synthesis module to the pro-

posed architecture; so that we can get the dubbing

of a movie. We also want to explore the transla-

tion of other Quechua variants, using robotic inter-

faces (Burga-Gutierrez et al., 2020) or generating text

in Quechua (de Rivero et al., 2021).

Model for Real-Time Subtitling from Spanish to Quechua Based on Cascade Speech Translation

843

REFERENCES

Abdel-Hamid, O., Mohamed, A., Jiang, H., and Penn, G.

(2012). Applying convolutional neural networks con-

cepts to hybrid NN-HMM model for speech recogni-

tion. In IEEE ICASSP.

Agic, Z. and Vulic, I. (2019). JW300: A wide-coverage

parallel corpus for low-resource languages. In ACL.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020).

wav2vec 2.0: A framework for self-supervised learn-

ing of speech representations. In NeurIPS.

Baldridge, J. (2004). Verbmobil: Foundations of Speech-to-

Speech Translation. Nat. Lang. Eng., 10(2).

Burga-Gutierrez, E., Vasquez-Chauca, B., and Ugarte, W.

(2020). Comparative analysis of question answering

models for HRI tasks with NAO in spanish. In SIM-

Big.

Chen, W. and Abdul-Mageed, M. (2022). Improv-

ing neural machine translation of indigenous lan-

guages with multilingual transfer learning. CoRR,

abs/2205.06993.

de Rivero, M., Tirado, C., and Ugarte, W. (2021). For-

malstyler: GPT based model for formal style trans-

fer based on formality and meaning preservation. In

KDIR.

Edgar, J., Mu

noz, V., Antonio, J., Tello, C., Alexander, R.,

and Castro Mamani, R. (2012). Let’s speak quechua:

The implementation of a text-to-speech system for the

incas’ language. In IberSPEECH.

Haridas, A. V., Marimuthu, R., and Sivakumar, V. G.

(2018). A critical review and analysis on techniques

of speech recognition: The road ahead. Int. J. Knowl.

Based Intell. Eng. Syst., 22(1).

Hornberger, N. H. and Coronel-Molina, S. M. (2004).

Quechua language shift, maintenance, and revitaliza-

tion in the andes: the case for language planning.

International Journal of the Sociology of Language,

2004(167).

Jia, Y., Weiss, R. J., Biadsy, F., Macherey, W., Johnson, M.,

Chen, Z., and Wu, Y. (2019). Direct speech-to-speech

translation with a sequence-to-sequence model. In

ISCA INTERSPEECH.

Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T.,

Hoang, H., Heaﬁeld, K., Neckermann, T., Seide, F.,

Germann, U., Aji, A. F., Bogoychev, N., Martins, A.

F. T., and Birch, A. (2018). Marian: Fast neural ma-

chine translation in C++. In ACL.

Kano, T., Sakti, S., and Nakamura, S. (2020). End-to-

end speech translation with transcoding by multi-task

learning for distant language pairs. IEEE ACM Trans.

Audio Speech Lang. Process., 28.

LeCun, Y., Bengio, Y., and Hinton, G. E. (2015). Deep

learning. Nat., 521(7553).

Mager, M., Gutierrez-Vasques, X., Sierra, G., and Meza-

ız, I. V. (2018). Challenges of language technolo-

gies for the indigenous languages of the americas. In

ACL COLING.

Nayak, S., Baumann, T., Bhattacharya, S., Karakanta, A.,

Negri, M., and Turchi, M. (2020). See me speaking?

differentiating on whether words are spoken on screen

or off to optimize machine dubbing. In ACM ICMI

Companion.

Nguyen, T. Q. and Chiang, D. (2017). Transfer learning

across low-resource, related languages for neural ma-

chine translation. In AFNLP IJCNLP(2).

Nordhoff, S. and Hammarstr

om, H. (2012). Glot-

tolog/langdoc: Increasing the visibility of grey liter-

ature for low-density languages. In ELRA LREC.

Oncevay, A. (2021). Peru is multilingual, its machine trans-

lation should be too. In Workshop on NLP for Indige-

nous Languages of the Americas. ACL.

Ortega, J. and Pillaipakkamnatt, K. (2018). Using mor-

phemes from agglutinative languages like quechua

and ﬁnnish to aid in low-resource translation. In

AMTA LoResMT@AMTA.

Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2002).

Bleu: a method for automatic evaluation of machine

translation. In ACL.

Popovic, M. (2015). chrf: character n-gram f-score for au-

tomatic MT evaluation. In ACL WMT@EMNLP.

Rios, A. (2011). Spell checking an agglutinative language:

Quechua. In 5th Language and Technology Confer-

ence: Human Language Technologies as a Challenge

for Computer Science and Linguistics. Fundacja Uni-

wersytetu im. A. Mickiewicza.

Sumida Huaman, E. (2020). Small indigenous schools: In-

digenous resurgence and education in the americas.

Anthropology & Education Quarterly, 51(3).

Tiedemann, J. (2012). Parallel data, tools and interfaces in

OPUS. In ELRA LREC.

Tiedemann, J. and Thottingal, S. (2020). OPUS-MT - build-

ing open translation services for the world. In EAMT.

Tjandra, A., Sakti, S., and Nakamura, S. (2019). Speech-

to-speech translation between untranscribed unknown

languages. In IEEE ASRU.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention is all you need. In NIPS.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,

C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,

S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).

Transformers: State-of-the-art natural language pro-

cessing. In ACL EMNLP (Demos).

Zhang, J. and Zong, C. (2020). Neural machine trans-

lation: Challenges, progress and future. CoRR,

abs/2004.05809.

Zoph, B., Yuret, D., May, J., and Knight, K. (2016). Trans-

fer learning for low-resource neural machine transla-

tion. In ACL EMNLP.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

844