Speech Recognition for Indigenous Language Using Self-Supervised

Learning and Natural Language Processing

Satoshi Tamura, Tomohiro Hattori, Yusuke Kato and Naoki Noguchi

Gifu University, 1-1 Yanagido, Gifu, Japan

Keywords:

Speech Recognition, Self-Supervised Learning, Neural Machine Translation, Transformer, HuBERT,

Indigenous Language, Under-Resourced Language.

Abstract:

This paper proposes a new concept to build a speech recognition system for an indigenous under-resourced

language, by using another speech recognizer for a major l anguage as well as neural machine translation and

text autoencoder. D eveloping the recognizer for minor languages suffers from the lack of training speech data.

Our method uses natural language processing techniques and text data, to compensate the lack of speech data.

We focus on the model based on self-supervised learning, and utilize its sub-module as a feature extractor. We

develop the recognizer sub-module for indigenous languages by making translation and autoencoder models.

We conduct evaluation experiments for every systems and our paradigm. It is consequently found that our

scheme can build the recognizer successfully, and improve the performance compared to the past works.

1 INTRODUCTION

Automatic Speech Recognition (ASR) is a technique

to transcribe human speech. In recent yea rs, speech

recogn ition technology has been widely used in a lot

of environments, improving work efﬁciency and en-

hancing quality of life. Voice assistance and voice

input are ofte n employed on smartphones and smart

speakers. ASR is also used to take minutes in on-

line meetings and on-site conferences. Spee ch trans-

lation based on ASR is expected to make our inter-

national com munication richer and easier. ASR has

made remarkable progress for this decade, with the

rapid development of Deep Learning (DL). Many re-

searchers have attempted to build DL models, result-

ing in signiﬁcant improvements in recognition accu-

racy. Several languages with large populations, such

as English, Mandarin, and Spanish, now have high-

performance ASR technolo gy, as DL requires huge

data sets, and it is relatively easier to do so. Develop-

ers are also interested in having such the techniques

for languages spoken in emerging markets, such as

Hindi, Bahasa Indonesia, and so on.

It is k nown that there a re more than 7,000 lan-

guages on this plane t. However, only a few lan-

guages have been well studied for DL-based ASR,

while most indigenous languages having small pop-

ulations used in limited area s have not yet, due to

the lack of spoken data. To discuss this issue, UN -

ESCO, ELRA (European Language Resource Asso-

ciation) and several societies jointly organized a con-

ference on L anguage Technologies for All (LT4All)

in 2019 (UNESCO, 2019). We need to strongly en-

courag e researcher s and developers to build ASR sys-

tems for these minor languages. And in order to do

so, we should develop and improve DL techn iques in

under-resource d conditions.

Recently, Self-Supe rvised Le a rning (SSL) has

been fo cused on in the DL ﬁeld. SSL allows us to uti-

lize unla beled data or low-quality data, and to obtain

effective feature representation for subsequent tasks.

In ASR resear c h works, several SSL techniq ues such

as HuBERT (Hsu et al., 2021) and wav2vec (Baevski

et al., 2020) are o ften employed. Focusing on the Hu-

BERT model, we have developed a speech recogni-

tion scheme for under-resourced languages (Hattori

and Tamura, 2023). In our previous work, we ﬁrstly

prepare d an English ASR system consisting of a pre-

trained HuBERT and a shallow DL model for recog-

nition. We secondly applied ﬁne-tuning to the system

using Japanese speec h data, to obtain the recognizer

for Japanese.

This paper proposes a new paradigm of ASR for

indigenous languages, enhan cing the recognition per-

formance. In our work, we employ several Natu-

ral Language Processing (NLP) techniques, such as

Neural Machine Translation (NMT) and text autoen-

coder. We assume that the HuBERT-based English

Tamura, S., Hattori, T., Kato, Y. and Noguchi, N.

Speech Recognition for Indigenous Language Using Self-Supervised Learning and Natural Language Processing.

DOI: 10.5220/0012396300003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 779-784

ISBN: 978-989-758-684-2; ISSN: 2184-4313

779

Feature representation

. . . . . . . . . . . . . .

Transformer

Japanese encoder

Japanese text

Transformer

Japanese decoder

Japanese text

. . . . . . . . . . . . . .

Feature representation

Transformer

English decoder

English text

. . . . . . . . . . . . . .

Transformer

Japanese text

encoder

Japanese text

(b) Japanese-English NMT

English speech signal

CNN Encoder

Feature representation

Transformer

Pre-trained HuBERT-

based encoder

English decoder

English transcription

(a) English ASR

. . . . . . . . . . . . . .

Japanese speech signal

CNN Encoder

Feature representation

Transformer

Pre-trained HuBERT-

based encoder

Japanese decoder

Japanese transcription

(d) Japanese ASR

. . . . . . . . . . . . . .

(Japanese as an indigenous language and English as a major language. Bold indicates model training or ﬁne-tuning)

Figure 1: Our proposed concept to build an AS R method for an indigenous language.

ASR can be divided into two sub -modules: a fe a ture

extraction module based on HuBERT and a recogni-

tion module. By using the rec ognition module as a n

English decoder and adding a new Japanese encoder,

we build a Japanese-English NMT system. Subse-

quently, we choose the Japanese encoder and pre-

pare a new Japane se decoder to make a text autoen-

coder. Finally, we can develop a Japanese ASR sys-

tem by employing the HuBERT feature extractor and

the Japanese dec oder. The novelty of this work is that,

we can build a n ASR system using le ss or ideally no

speech data of a target indigenous language, thanks

to state-of-the-art SSL and NLP technology. In addi-

tion, using NLP techniques is expected to enhance the

semantic fea ture representation, improving the ASR

performance. As far as the authors know, there is no

other work regarding indigenou s ASR incorporating

HuBERT, NMT and text autoencoder.

The rest of this paper is o rganized as follows. Sec-

tion 2 explains o ur concept in detail. Experiments are

reported in Section 3. Section 4 concludes this work.

2 METHODOLOGY

Figure 1 illustrates our proposed scheme to build an

ASR sysem for an indigenous language. In this pa-

per, we use Japanese as an indigenous minor language

and English as a major language; though Japanese has

a large population, th e reason why Japanese is cho-

sen in this work is that, we can compare our scheme

with conventional D L-based ASR methods as large-

Table 1: Training and ﬁne-tuning settings.

(a) E. (b) J-E (c) J. (d) J.

ASR NMT AE ASR

# epochs 40 50 50 40

Batch size 8 256 256 8

Optimizer RAdam Adam Adam RAdam

Learning rate 1e-4

Loss function Cross entropy loss

size Japanese corpora are available, and the authors

can easily conduct subjective evaluation to the results.

(a) First of all, a DL-based SSL model is prepared,

that is trained using a number of speec h data of

the major language. The model is used as a fea-

ture extractor from speech data. We employ a

pre-train ed English HuBERT mode l in this work.

We then build an English ASR system using the

model followed by a transformer-based decoder.

We believe that a feature vector extracted by the

ﬁrst part includes contextua l or semantic informa-

tion, and the second model can be performed as a

speech recognizer to generate English sente nces.

(b) Suppose a text encoder for an indigenous lan-

guage, in this case Japanese. Using the en-

coder and the decoder introduced above, we build

an NMT system from the indigenous language

(Japanese) to the major language (English). Whe n

training the sy stem, all the para meters in the En-

glish decod er are ﬁxed, while the model parame-

ters in the Japanese encoder are optimized using

Japanese-English parallel text data.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

780

Table 2: Examples of English ASR results in Figure 1 (a). Underlined words mean recognition errors.

Reference Recognized characters

THE WHOLE CAMP WAS COLLECTE D BEFORE A ROT CABIN

ON THE OUTER EDGE OF THE CLEARING

THE WHOLE CAMP WAS COLLECTED BEFORE A RUDE CAB IN

ON THE OUTER EDGE OF THE CLEARING

INTO THE LAND BEYOND THE SYRIAN DESERT BUT EITHER

OF THEM DREAMED THAT THE SCATTERED AND DISUNITED

TRIBES OF ARABIA WOULD E V ER COMBINE OR BECOME A

SERIOUS DANGER

INTO THE LAND BEYOND THE SYRIAN DESERT BUT NEITHER

OF THEM DREAMED THAT THE SCATTERED AND DISUNITED

TRIBES OF ARABIA WOULD E V ER COMBINE OR BECOME A

SERIOUS DANGER

indigenous language. Next, we make an text au-

toencod er for the m inor language, consisting of

the above encoder as well as the decoder. We ap-

ply model train ing using the text data written in

the indigenous language, only to the decoder.

(d) Finally, we use the HuBERT encoder as a feature

extractor and the text decoder for the indigenous

languag e, to build an ASR system for the mino r

languag e. It is said that English phonemes fully

cover Japanese vowels and consonants, therefore,

the English feature extractor is expected to also

work for Japanese speech data as well. Note that

in this paper, to improve the p erformance we ap-

ply ﬁne-tuning not only to the decoder but also to

a part of the encoder.

The ad vantage of this scheme is that, ideally we do

not need any speech data for the indigenous la nguage,

or only a few data may be signiﬁcant for ﬁne-tuning

to ﬁnalize the A SR model. As mentioned above, it

is har d to collect speech data for such a minor lan-

guage with a small population . In our scheme we uti-

lize a pre-trained SSL-based feature extractor, that is

originally built for different languages, because a hu-

man speech production system is language inde pen-

dent . Furthermo re, for indigen ous languages, it is rel-

atively easier to collect text data than speech data; w e

can obtain text data from ofﬁcial government docu-

ments, textbooks, news sites and internet articles such

as Wikipedia.

3 EXPERIMENT

We conducted exper iments to evaluate the effective-

ness of our proposed approach . First, we report

preliminar y experime ntal results about training data

size for an indigenous langua ge. Second, we exam-

ine our NMT and autoencoder performance to check

Japanese encoder and decoder. Finally, we evaluate

our Japanese ASR. Table 1 shows model training an d

ﬁne-tunin g settings in the following experiments.

3.1 Preliminary Experiments

3.1.1 Machine Translation

In our pr evious work, we investigated the inﬂuence of

training data size and model complexity in NMT. We

used the M ultiUN and Wikipedia data sets provided in

OPUS ( Tiedemann, 201 2), in order to obtain par allel

sentences. We then chose German-English sentence

pairs as a training data set. Though German has a

large population, in this experiment German is treated

as an indigenous language, while English was a ma-

jor language . We employed a pr e-trained NMT model

provided by OpenNMT (Klein et al. , 2017), that was

based on a tiny transformer; the encoder and decoder

had six layers respectively. A transformer model was

then explored w ith different settings, such as the num-

ber of layers in the encoder and decoder parts, and the

number of training sente nces.

It turns out that, with the small data set, we can

build an NM T model, which achieves roughly the

same performance as the pre-trained model, by ad-

justing the hyperparameters; we should make an en-

coder for indigenous language smaller to m a intain

the translation performance, while the decoder should

still be large because it directly affects the perfor-

mance. It is well known that the larger the training

data set becomes, the better NMT performance is. On

the other hand, it is sometimes hard to obtain larger

data sets. Acco rding to our preliminary results, in this

work we decided to use 10,000 sentences in the fol-

lowing experiments, which is quite small compared to

the data set used in existing works.

3.1.2 English Speech Recognition

Next, we tested an English ASR shown in Figure 1

(a). We adopted an English Hu BERT model provid ed

by Facebook, which was trained using 960 -hour spo-

ken data f rom Librispeech (Panayotov et al., 2015).

The transformer consisted of a CNN encoder a nd a

12-layer transfor mer. As an En glish text decoder, we

employed a two-layer transformer, each having 12 at-

tention heads. When building the ASR system, in the

encoder transformer, we ﬁxed the eight layers on the

Speech Recognition for Indigenous Language Using Self-Supervised Learning and Natural Language Processing

781

Table 3: Examples of input, reference and t r anslated sentences in Figure 1 (b).

Input Japanese text Reference text Tr anslated English text

かれにあいたいの

i — w a n t — t o — s e e — h i m i — — — — — m e e t — h i m

わたしのむすこがちょうど

ぴあのをはじめたんです

m y — s o n — s t a r t e d —

p l a y i n g — p i a n o

m y — s o n — — — — —

— — — — — p i a n o

よかったらつかってください

p l e a s e — u s e — i t u s e — — — — — — — — — —

† ’—’ indicates space.

Table 4: Examples of Japanese autoencoder results in Figure 1 (c).

Reference Reconstructed characters

かれにあいたいのかれはあううすす

わたしのむすこがちょうどぴあのをはじめたんですわたしのののぴあのはちかかがはじめたです

よかったらつかってくださいよかかかつかうういい

input side, while ﬁne-tuning the remaining four lay-

ers on the output side. The decoder was trained from

scratch. We used the Librispeech test-clean-100 data

set to re-train the model, a nd a one-hou r subset of Lib-

riLightLimited (Liu et al., 2019) for evaluation.

Table 2 shows reco gnition result examples. As a

result, the English ASR achieved a word error rate

of 11.04%. We ﬁnally conﬁrmed that the feature en-

coder and the English text decoder were properly pre-

pared for the following exper iments.

3.2 Experiments in NLP

3.2.1 Japanese-English Machine Translation

First, we checked our machine translation model, de-

picted in Figure 1 (b). We utilized a JESC corpus

(Pryzant et al., 2018 ), consisting o f Japanese-English

subtitle pairs. From the c orpus, we ran domly se-

lected 10,000 pairs for model training, 1,000 for vali-

dation, and 1,0 00 for evaluation. As explained in Sec-

tion 2 , only the Japanese enco der was trained, while

the decoder was derived from the E nglish ASR sys-

tem. The architecture of the Japan e se encoder was

the same as that of the English decoder. Note that

our Japanese encode r only accepted Japanese hira-

gana characters. Japanese characters were ﬁrstly con-

verted to IDs, followed by the embedding pr ocess to

obtain a 768-dimensional vecto r for each character,

the size of whic h was the same as the input/output

size of th e English decoder.

Table 3 shows examples of translation results,

where the BLEU score is 0.20. Although it was

not sufﬁcient, it can be seen that our model can cor-

rectly translate several words that probably app eared

in the training data set. It is generally acceptable that

words which did not appear in the training data set

can hardly be translated correctly, because the model

did not know the terms. It is ﬁna lly concluded that

the translation system was trained, and can tran slate

Japanese sentences into English sentences to some ex-

tent.

3.2.2 Japanese Text Autoencoder

Second, we investigated our Japanese text autoen-

coder, illustrate d in Figure 1 (c). We used the same

data set as in the NMT experimen t above; only

Japanese sentences were used this time. The encoder

was the same as the Japanese enco der in NMT, which

was ﬁxed thro ughout this experiment. The decoder

had the same model architecture, and was op timized

using the Japa nese sentences.

Table 4 indic a te s th e results, and the BLEU score

is 0. 24. Similar to the N MT results, some words could

be reconstructed correctly, while the other parts could

not. In spite that the output sentences seem to be in-

appropriate perha ps due to the lack of vocabulary in

the data set, it can be said that semantic information

may still be retained in our model.

In NMT and autoencoder experiments, we ob-

served still lower BLEU performance, mainly due to

the lack of vocabulary. It is of course needed to im-

prove the scores, however, it was unknown that such

the systems could contribute to o ur ﬁnal goal, to build

a better ASR system. We then moved to the next ASR

experiment usin g the above NLP models.

3.3 Experiments in ASR

3.3.1 Our Proposed Method

Finally, we evaluated our Japanese ASR system,

shown in Figure 1 (d ). For ﬁne-tuning, we prepared

5,880 Japanese spoken sentences from the Commo n

Voice 7 .0 Japanese data set (Ardila et al., 2019). Re-

garding a recognition model, the English HuBERT-

based feature extractor wa s chosen, in addition to the

Japanese text decoder. The whole decoder and the

four layers on the output side in the encoder were

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

782

Table 5: Examples of recognition results of our proposed method in Figure 1 (d).

Reference (English translation) Recognized characters

がめんがこうこくだらけでみにくてしょうがない

(The screen is full of ads, making it difﬁcult to see.)

がめんがこうぽくだらけでみにくてしょうがない

こまっているひとはほっておけないせいかく

(I have a personality that cannot leave people in need.)

こまっているひとはほうっておうけないせいかく

う

かれらはゆうびんはいたついんにわいろをわたし

なんとかそのてがみをてにはいれました

(They bribed the postman and managed to get the letter.)

かれらはゆうびんはいったついんにわいろをわた

しなんとかそんてがめをてにいでました

おなじないようのちしきでもじょうしきとかがく

とではありかたがちがっている

(Even for the same content knowledge, common sense

and science have different ways of being.)

おなじないようのちしきでもじょうしきとかがく

とではありかたがちがっている

Table 6: Examples of recognition results of the competitive method.

Reference (English translation) Recognized characters

りかはにがてだがどっぷらーこうかだけはおぼえ

てる

(I’m not good at science, but I only remember the

Doppler effect)

でぃかーはにがてだがどっぷらーこうかだけはお

ぼえてる

なぜならさいこうのゆうじんはかれしかいないか

ら

(Because he is the best friend I have.)

なぜならさいこうのゆうじんはかでしかいないか

ら

ふたりはたけーなといった

(Two said it was expensive.)

ふたりはたけいなとをいった

どうめいのすーぱーでもおみせごとにしなぞろえ

にとくちょうがある

(Even within the same chain of supermarkets, each

store has its own unique selection of products.)

どうめいのすーぱーでもおみせごとにしなぞくち

ょうがある

Table 7: Character err or rates of Japanese ASR systems.

Model CER [%]

Proposed 20.18

Baseline 27.97

then ﬁne-tuned. We also obtained test da ta , i.e. 1,928

Japanese spoken sentences, from the same data set.

Table 5 shows examples of Japanese rec ognition

results. The who le character err or rate is 20.18%. It

is found that the recognition re sults seem to be accept-

able; they are not perfect, but in many cases we can

easily guess the meaning.

3.3.2 Competitive Baseline Method

For comparison, we also built another Japanese ASR

model as a baseline, which was similar to ( H a ttori

and Tamura, 2023). We prepared a model consist-

ing of the same architectu re as our proposed scheme:

the En glish HuBERT-ba sed feature extractor and the

Japanese recognizer. Similar to our previous work,

we carried out ﬁne -tuning to the four laye rs in the en-

coder, and trained the decoder fro m scratch. Table 6

indicates rec ognition examples. We then obtained a

character erro r rate of 27.9 7%, which means that our

scheme achieved a 7.79% absolute improvement or a

27.85% relative error reduction over the baseline.

Table 7 summa rizes both accu racy. Still, the per-

formance needs to be improved fo r practical use.

However, as mentioned it is hard to collect speech

data for any minor language, and the lack of the data

set causes low ASR accuracy. It is found that o ur

scheme could comp e nsate the performance by em-

ploying NLP technology and text data. Consequently,

we believe that the effectivene ss of our proposed con-

cept to build a better indigenous ASR method using

another SSL-based ASR for a major language as well

as NLP techniques is clariﬁed.

Speech Recognition for Indigenous Language Using Self-Supervised Learning and Natural Language Processing

783

4 CONCLUSION

This pape r proposed a novel framework to build an

ASR system for under-resourced languages by utiliz-

ing an SSL-b ased ASR system for a major language,

as well as NLP technology such as NMT and text au-

toencod er. First, we made English ASR, Japanese-

English NMT and Japanese text autoencoder. We

checked their performance, and conﬁrmed that we

had developed th e m well. Second, we built a Japanese

ASR using sub-modules of the above systems. We

condu c te d evaluation exp eriments, and it is found that

we could successfully develop the system with ac-

ceptable accuracy.

As our future works, we need to improve mod-

els to achieve higher BLEU scores using larger data

sets. Investigating the relationship between the da ta

size and the performance is also useful, sin ce col-

lecting a larger database for indigenous languages re-

quires higher costs. We will also compare our scheme

with the state-of-the-art NLP and ASR system so that

we could know the performanc e upper limit, which is

useful for future improvement. It is also obvious that

applying the proposed technique to real indigenous

languag es is included in our future tasks.

REFERENCES

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler,

M., Meyer, J., Morais, R., Saunders, L., Tyers,

F. M., and Weber, G. (2019). Common voice: A

massively-multilingual speech corpus. arXiv preprint

arXiv:1912.06670.

Baevski, A., Zhou, H., Mohamed, A., and Auli, M.

(2020). wav2vec 2.0: A framework for self-

supervised learning of speech representations. arXiv

preprint arXiv:2006.11477.

Hattori, T. and Tamura, S. (2023). Speech recognition for

minority languages using hubert and model adapta-

tion. In International Conference on Pattern Recog-

nition Applications and Methods (ICPRAM), pages

350–355.

Hsu, W.- N ., Bolte, B., Tsai, Y.-H. H., Lakhotia, K.,

Salakhutdinov, R., and Mohamed, A. (2021). Hu-

bert: Self-supervised speech representation learning

by masked prediction of hidden units. IEEE/ACM

Transactions on Audio, Speech, and Language Pro-

cessing.

Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M.

(2017). Opennmt: Open-source toolkit for neural ma-

chine translation. arXiv preprint arXiv:1701.02810.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao,

J., and Han, J. (2019). On the variance of the

adaptive learning rate and beyond. arXiv preprint

arXiv:1908.03265.

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.

(2015). Librispeech: an asr corpus based on public

domain audio books. In International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

Pryzant, R., Chung, Y., Jurafsky, D., and Britz, D.

(2018). JESC: Japanese-english subtitle corpus. arXiv

preprint arXiv:1710.10639.

Tiedemann, J. (2012). Parallel data, tools and interfaces

in OPUS. In International Conference on Language

Resources and Evaluation (LREC), pages 2214–2218.

UNESCO (2019). LT4All. https://lt4all.org/en/index.html.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

784