A Blended Attention-CTC Network Architecture for Amharic

Text-image Recognition

Birhanu Hailu Belay

1,3

, Tewodros Habtegebrial

, Marcus Liwicki

, Gebeyehu Belay

and Didier Stricker

1,4

Technical University of Kaiserslautern, Kaiserslautern, Germany

Lulea University of Technology, Lulea, Sweden

Bahir Dar Institute of Technology, Bahir Dar, Ethiopia

DFKI, Augmented Vision Department, Kaiserslautern, Germany

Keywords:

Amharic Script, Blended Attention-CTC, BLSTM, CNN, Encoder-decoder, Network Architecture, OCR,

Pattern Recognition.

Abstract:

In this paper, we propose a blended Attention-Connectionist Temporal Classiﬁcation (CTC) network archi-

tecture for a unique script, Amharic, text-image recognition. Amharic is an indigenous Ethiopic script that

uses 34 consonant characters with their 7 vowel variants of each and 50 labialized characters which are de-

rived, with a small change, from the 34 consonant characters. The change involves modifying the structure

of these characters by adding a straight line, or shortening and/or elongating one of its main legs including

the addition of small diacritics to the right, left, top or bottom of the character. Such a small change affects

orthographic identities of character and results in shape similarly among characters which are interesting, but

challenging task, for OCR research. Motivated with the recent success of attention mechanism on neural ma-

chine translation tasks, we propose an attention-based CTC approach which is designed by blending attention

mechanism directly within the CTC network. The proposed model consists of an encoder module, attention

module and transcription module in a uniﬁed framework. The efﬁcacy of the proposed model on the Amharic

language shows that attention mechanism allows learning powerful representations by integrating information

from different time steps. Our method outperforms state-of-the-art methods and achieves 1.04% and 0.93% of

the character error rate on ADOCR test datasets.

1 INTRODUCTION

Amharic is an ofﬁcial working language of the Fed-

eral Democratic Republic of Ethiopia and it is the

second most widely spoken Semitic language in the

world next to Arabic. Amharic is spoken by more

than 100 million people in the country and it is

also widely spoken in different countries like Eritrea,

USA, Israel, Somalia and Djibouti (Meshesha and

Jawahar, 2007; Amh, ; Mekuria and Mekuria, 2018).

In Amharic script, there are about 317 differ-

ent alphabets including 238 core characters, 50 labi-

alaize characters, 9 punctuation marks and 20 numer-

als which are written and read, like English, from left

to right (Meshesha and Jawahar, 2007; Belay et al.,

2019a; Belay et al., 2019b). There are multiple doc-

uments, containing religious and academic contents,

written in Amharic script dated back from 12

cen-

tury (Meyer, 2006). Since then, these documents are

stored in different places such as Ethiopian Orthodox

Tewahdo Churches, public and academic libraries in

the form of hardcover books. With a digitization cam-

paign, many of these manuscripts are collected from

different sources. However, they are still preserved

in a manual catalog and/or scanned copies of them in

Microﬁlm format (Wion, 2006).

The shape and structural formation of sample ba-

sic Amharic characters with their unique features are

depicted in Figure 2.

Numerous works, in area of Optical Character

Recognition (OCR) and Document Image Analysis

(DIA), have been done and widely used for decades

to digitize various historical and modern documents

(Breuel et al., 2013; Maitra et al., 2015; Mondal et al.,

2017; Martınek et al., 2020). Researchers achieved a

high recognition accuracy and most scripts now have

commercial off-the-shelf OCR applications. How-

ever, OCR often gives a better recognition result only

Belay, B., Habtegebrial, T., Liwicki, M., Belay, G. and Stricker, D.

A Blended Attention-CTC Network Architecture for Amharic Text-image Recognition.

DOI: 10.5220/0010284204350441

In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2021), pages 435-441

ISBN: 978-989-758-486-2

435

Figure 1: Shape formation of sample basic Amharic char-

acters (Belay et al., 2020). Orders of consonant-vowel vari-

ants (34 × 7). Characters in the ﬁrst column are consonants

and the others are derived variants. Vowels are derived by

adding diacritics and/or remove part of consonants and the

orthographic identities of each character vary across rows

as marked with the red color.

for a speciﬁc use cases, moreover there are multiple

indigenous scripts, like Amharic, which are underrep-

resented in the area of Natural Language Processing

(NLP) and DIA (Belay et al., 2019b).

Even though OCR research for Amharic script

started in 1997 (Alemu, 1997), it is still in its infancy

and it is still an open area of research. Since then,

attempts have been made to develop Amharic OCR

(Meshesha and Jawahar, 2007; Alemu, 1997; Cow-

ell and Hussain, 2003; Assabie and Bigun, 2009) us-

ing different statistical machine learning techniques.

Recently, following the success of deep learning,

other attempts are also made to develop a model for

Amharic OCR and achieved relatively promising re-

sults (Belay et al., 2019a; Belay et al., 2019b; Belay

et al., 2018; Reta et al., 2018; Gondere et al., 2019).

In literature, attempts to Amharic OCR neither

shown results on large dataset nor considering all pos-

sible characters used in Amharic writing system. Re-

cently publish work (Belay et al., 2019b), introduced

an Amharic OCR database called ADOCR. We took a

sample text-line image from ADOCR database whose

word formation and character arrangements in a sam-

ple word are illustrated in Figure 2.

ድ

ረ

ለ

ማ

ግ

Amharic characters

Corresponding English letters

Figure 2: Sample Amharic text-line image from the dataset.

A word marked by red box is composed of ﬁve individual

Amharic characters and the corresponding sounds of each

character is described with English letters using red color.

Convolutional and Recurrent networks have been

used in Amharic OCR. In this paper we aim to push

the limits of Amharic OCR models by introducing at-

tention mechanism. Therefore, in this paper, we pro-

pose an attention based CTC network called Blended

Attention CTC (BACTC) for Amharic text-line im-

age recognition. BACTC has a CNN-LSTM layers as

encoder followed by attention and CTC layers which

used to pick the only important features of encoded

inputs and transcription respectively.

The rest of the paper is organized as follows. Sec-

tion 2 talks about related works. Our proposed model

is presented in section 3, and section 4 presents the

detail of datasets. In the last two sections, experimen-

tal results and conclusions are presented respectively.

2 RELATED WORK

Previous research on Amharic OCR was focused on

the statistical machine learning techniques. Most of

these techniques are segmentation based and charac-

ter level OCR models (Meshesha and Jawahar, 2007;

Belay et al., 2019a; Cowell and Hussain, 2003; Belay

et al., 2018). The only exception work were Assabie

(Assabie and Bigun, 2009) who proposed a segmenta-

tion free OCR based on HMM model for ofﬂine hand-

written Amharic word recognition. Recently pub-

lished works (Addis et al., 2018) and (Belay et al.,

2019b) proposed a Bidirectional LSTM (BLSTM)

network architecture with CTC for Amharic text-line

image recognition. An end-to-end learning, that uses

CNN, LSTM and CTC in a uniﬁed framework (Belay

et al., 2020), is also proposed for Amharic OCR and

achieved a better recognition performance.

Following the ﬁrst Amharic OCR research, which

was only able to recognize a character written with

Washera font and 12 point type, attempted by Worku

in 1997 (Alemu, 1997), other research works have

been made including typewritten (Teferi, 1999), ma-

chine printed (Meshesha and Jawahar, 2007; Be-

lay et al., 2018), Amharic document image recogni-

tion and retrieval (Meshesha, 2008), Ethiopic number

(Reta et al., 2018) and handwritten (Gondere et al.,

2019) recognition.

Based on recurrent neural network, several

OCR techniques have been studied and demon-

strated groundbreaking performances for multiple

Latin and Non-Latin scripts. BLSTM with CTC

for Amharic text-image recognition (Belay et al.,

2019b), Convolutional Recurrent Neural Network

(CRNN) for Japanese handwritten recognition (Ly

et al., 2017), segmentation free Chinese handwritten

text recognition (Messina and Louradour, 2015), a

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

436

hybrid Convolutional-LSTM for text-image recogni-

tion (Breuel, 2017), Multidimensional LSTM for Chi-

nese handwritten recognition (Wu et al., 2017), com-

bined Connectionist Temporal Classiﬁcation (CTC)

with Bidirectional LSTM for unconstrained online

handwriting recognition (Graves et al., 2008).

Attention based networks have been intensively

applied in the area of NLP tasks and came up with

successive results in neural machine translation (Bah-

danau et al., 2014; Luong et al., 2015; Ghader and

Monz, 2017) and speech recognition (Das et al., 2019;

Watanabe et al., 2017). The most works so far with

attention mechanism has focused on neural machine

translation. However, researchers have recently ap-

plied attention in different research areas. Therefore,

it becomes popular and a choice of many researchers

in the area of OCR.

Attention mechanism is now also widely applied

for recognizing handwritten texts (Poulos and Valle,

2017; Chowdhury and Vig, 2018), characters in the

wild (Lee and Osindero, 2016; Huang et al., 2019;

Huang et al., 2016), and handwritten mathematical

expression (Zhang et al., 2018; Li et al., 2020). At-

tention mechanism assists the network in learning the

correct alignment between the input image pixels and

the target characters. In addition, it improves the abil-

ity of the network in extracting the most relevant fea-

ture for each part of the output sequence. Inspired

by the success of attention in sequence to sequence

translation, we continue to focus on OCR tasks, and

we integrate the capability of attention mechanism in

the CTC network so as utilize the beneﬁts from both

techniques.

3 OVERVIEW OF THE

PROPOSED MODEL

The proposed blended attention-CTC model, as

shown in Figure 3, consists of three modules. The

encoder module takes the input features x and maps

them to a higher-level feature representation h

enc

,...,h

enc

). The attention module takes

the output features h

enc

of the encoder module and

computes the context vector from each hidden fea-

tures. The output of the attention module, atten-

tion context information, is passed to the Soft-max

layer in order to produce a probability distribution,

P(y

,..., y

t-1

|x) over the given input sequence x. Fi-

nally, the probability distribution P goes to the CTC-

decoder for transcription.

The proposed method integrates attention mecha-

nism, LSTM and CTC layers. The intuition for the

use of the attention layer is to infer a more power-

CNN-layers

Encoder-LSTM Encoder-LSTM

Encoder-LSTM

Input text-image

...

softmax-1

...

Softmax-2

CTC-decoder

Output text

የአያቴን ታሪኮች ይዤላችሁ

...

Attention layer

Figure 3: The proposed blended attention-CTC model.

Alignment score (a

) of encoder hidden state (h

enc

) at each

time-step is computed using a scoring function described

in (3), just by propagating the h

enc

through fully con-

nected network (FC). Attention distribution, called atten-

tion weights (s

) are computed by running all alignment

scores over a soft-max (Soft-max-1) layer. To compute the

alignment vector , we multiply each h

enc

with its corre-

sponding soft-maxed score (s

). Then, the sum of all align-

ment vectors produced the context vector (C

vec

), which is

an aggregated information. Once the C

vec

is obtained, it

passes through the second soft-max layer (Soft-max-2) for

probability distribution over the n possible characters in the

ground-truth (GT). The output of Soft-max-2 is a sequence

of T time steps of (n + 1) characters which is then decoded

using CTC-decoder.

ful hidden representation through a weighed a context

vector (C

vec

). The attention based weighting offers a

powerful way to aggregate inputs from different time

steps. The weighted context vector is computed using

Equation (1). The training objective of the proposed

model follows the same CTC training objective ex-

plained in (Graves et al., 2008).

vec

∑

i=1

enc

(1)

where the s

is an attention weight of each annotation

enc

computed by soft-maxing its corresponding at-

tention score using Equation (2),

exp(a

)

∑

k=1

exp(a

)

(2)

where a

is the alignment score of h

enc

at each time

A Blended Attention-CTC Network Architecture for Amharic Text-image Recognition

437

step t and it can be computed using Equation (3).

= f (h

enc

),for i = 1, ...,T

(3)

The function f in Equation (3) is a feed-forward neu-

ral network with tanh function. The intuition of this

scoring function is to let the model to learn the align-

ment weights together with the translation while train-

ing the whole model layers.

During training the blended attention-CTC model,

a CTC loss function (l

CTC

) is used to train the net-

work from end-to-end. For training data D, the func-

tion l

CTC

can be deﬁned as in Equation (4).

CTC

= −log

∏

(x,z)∈D

p(z|x)

(4)

where x = (x

,...., x

) is the input sequence with

length T, and z = (z

,..., z

) is the corresponding

output sequence in ground-truth for C < T in every

pair of x and z. The P(z/x) is computed by multi-

plying the probability of labels along the path π that

contains output label over all time steps t as shown in

Equation (5).

p(π|x) =

∏

p(π

,t|x) (5)

where t is the time step and π

is the label of path π at

A target label in path π is obtained by mapping

reduction function B, using the example explained in

(Belay et al., 2019b), that convert a sequence of Soft-

max output for each frame to a label sequence by

removing repeated labels and blank tokens from the

sequences of character (C) with the highest score (i)

generated using Equation (6).

4 DATASET

To the best of our knowledge, ADOCR (Belay

et al., 2019b) database is the only publicly available

Amharic OCR dataset with the benchmark experi-

mental results. Therefore, to train and evaluate the

proposed model, we use the ADOCR database intro-

duced by (Belay et al., 2019b) and which is freely

available at http://www.dfki.uni-kl.de/

∼

belay/.

The original Amharic OCR database composed of

337,337 Amharic text-line images, each with multi-

ple word instance collected from different sources. In

this database there are about 40,929 printed text-line

images written with power Geez font 197,484 and

98,924 text-line images synthetically generated with

Power Geez and Visual Geez fonts respectively.

In addition, the ADOCR database contains 280

unique Amharic characters and punctuation marks

which are mutually exist in both the training and test

samples. All images a 48 by 128 pixels gray-scale,

and the maximum string length of the ground-truth

text is 32 characters.

Then the images are normalized into 32 by 128

pixels as stated in (Belay et al., 2020). and sample

Amharic text-line images from ADOCR database are

given in Figure 4.

Figure 4: Sample Amharic text-line images from ADOCR

database: a) Printed text-line images written by power Geez

font type. (b) Synthetic text-line images generated with

Power Geez font type. (c) Synthetic text-line images gener-

ated with the Visual Geez font.

5 EXPERIMENTS

We train and test our model on the ADOCR database

which contains 318,706 training and 18,631 test sam-

ples. Training details and experimental results are

presented below.

5.1 Training

Our model is trained with the ADOCR database.

Similar to (Belay et al., 2020), images are scaled

to 32 by 128 pixels so as to minimize computa-

tions. Since there is no explicitly stated validation

data, in ADOCR dataset, we randomly selected 7%

of the training samples for validation samples. In

this study, we ﬁrst implement and train an attention

based encoder-decoder network proposed by Bah-

danau (Bahdanau et al., 2014) and then the blended

attention-CTC network is formulated. In the later

model, the main contribution of this paper, we di-

rectly taking the advantage of attention mechanism

and CTC network as an integrated framework and

trained in an end-to-end fashion.

When training both models, we use two bi-

directional LSTM, each with 128 hidden units and

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

438

Figure 5: Training & validation losses of model training

with different network settings: (a). CTC loss with an

LSTM-CTC model (Belay et al., 2019b). (b) CTC loss with

a CNN-LSTM-CTC model (Belay et al., 2020). (c) CE loss

of attention based encoder-decoder model. (d) CTC loss of

the proposed blended attention-CTC model.

dropout rate of 0.25, on top of seven convolutional

layers that are stacked serially with ReLU activa-

tion function as an encoder. In the attention based

encoder-decoder model, the decoder has a unidirec-

tional LSTM with 128 hidden units while in the

blended attention-CTC approach, the decoder LSTM

is removed from the previous model and then the

CTC objective function that are blended with atten-

tion mechanism is in place.

The convolutional layers of the encoder module is

composed of seven convolutional layers which have a

kernel size of 3 × 3, except that the one on top is with

a 2× 2 kernel size, four max-pooling layers with pool

sizes of 2 for the ﬁrst pooling layer, and 2 × 1 for the

remaining pooling layers. Strides are ﬁxed to one, and

the ‘same’ padding is used in all convolutional layers.

The number of feature maps are 64, 128, 256, 256,

512, 512, 512 from bottom to top layers.

We use a batch size of 128 with Adam optimizer

and each model, Attention based Encoder-Decoder

(AED) and BACTC, is trained for 10 and 15 epochs

respectively. The attention based encoder-decoder

model minimizes a categorical-cross entropy (CE)

loss while the blended attention-CTC model tried to

minimize the CTC-based loss described in Equation

(4). Once the probability of labels obtained from the

trained models, we use best path decoding (Graves,

2008) to generate a character (C

) that has the maxi-

mum score at each time step t.

= argmax

|x),for t = 1,2,...,T (6)

We implement both model with Keras Application

Program Interface (API) on a TensorFlow backend.

The learning loss of the proposed model and other

models trained using different network settings are

depicted in Figure 5.

5.2 Results

The performance of the proposed blended-attention

model is evaluated against three test datasets, and

compared it with state-of-the-art approaches. Table 1,

presents the details of experimental results with Char-

acter Error Rate (CER) and the proposed model im-

proves the recognition performance by 0.67–7.50%

from the original paper (Belay et al., 2019b) and by

0.16–0.52% from the recent published paper which

use the same test dataset (Belay et al., 2020). CER is

computed, using Equation (7), by counting the num-

ber of characters inserted, substituted, and deleted in

each sequence and then dividing by the total number

of characters in the ground truth (Belay et al., 2019b).

CER(P,T) =

∑

n∈P,m∈T

D(n,m)

× 100, (7)

where q is the total number of target character labels

in the ground truth, P and T are the predicted and

ground-truth labels, and D(n,m) is the edit distance

between sequences n and m.

Figure 6: Sample text-line image with the corresponding

predicted and ground-truth texts. Characters, in the pre-

dicted text, marked with red rectangle are wrongly pre-

dicted by both Attention based Encoder-decoder(AED) and

Blended Attention-CTC( BACN).

We also implemented the attention based encoder-

decoder model, without the CTC network, and the

performance of this model is evaluated with the three

ADOCR test datasets. The blended attention-CTC

model outperforms the attention based encoder de-

coder model by 24.09%.

A Blended Attention-CTC Network Architecture for Amharic Text-image Recognition

439

Table 1: Comparison of test results (CER).

#test-set image-type Font-type CER (%)

Addis (Addis et al., 2018)

12 pages printed - 2.12%

Belay (Belay et al., 2019b) 2,907 Printed Power Geez 8.54%

Belay (Belay et al., 2019b) 9,245 Synthetic Power Geez 4.24%

Belay (Belay et al., 2019b) 6,479 Synthetic Visual Geez 2.28%

Belay (Belay et al., 2020) 2,907 Printed Power Geez 1.56%

Belay (Belay et al., 2020) 9,245 Synthetic Power Geez 3.73%

Belay (Belay et al., 2020) 6,479 Synthetic Visual Geez 1.05%

Ours 2,907 Printed Power Geez 1.04%

Ours 9,245 Synthetic Power Geez 3.57%

Ours 6,479 Synthetic Visual Geez 0.93%

* Denotes methods tested on different datasets.

As we observed the empirical results, the atten-

tion based encoder-decoder model implemented with-

out CTC, becomes poor when the sequence length

increases. In most cases, the ﬁrst 4 to 6 characters

are always correctly predicted while the rest errors

have no any patterns. Such character errors are not

observed in the blended attention-CTC model. In

summary, the proposed blended attention-CTC model

outperforms all the state-of-the-art models on the

ADOCR test datasets.

6 CONCLUSION

In this paper, we have introduced a blended attention-

CTC network called BACTC for Amharic text-line

image recognition. The proposed method consists of

a Bidirectional LSTM, stacked on top of CNN lay-

ers, as an encoder and a CTC layer as a decoder.

To enhance the hidden layer feature representation,

the attention mechanism is embedded between the

LSTM and CTC network layers without changing the

CTC objective function and the training process. All

the encoder, attention and CTC modules are trained

jointly from end-to-end.

We evaluated our model with both synthetically

generated and printed Amharic text-line images and

a signiﬁcant improvement is achieved on all the three

ADOCR test datasets compared with state-of-the-art

model results. Thus, we can conclude that the blended

attention-CTC network is more effective for Amharic

text image recognition than widely used attention-

based encoder-decoder and CNN-LSTM-CTC based

networks as well. This work can be potentially

extended and applied for handwritten Amharic text

recognition.

REFERENCES

Will Amharic be AU’s lingua franca? https://www.press.

et/english/?p=2654#. Accessed: 2020-01-14.

Addis, D., Liu, C.-M., and Ta, V.-D. (2018). Printed

ethiopic script recognition by using lstm networks. In

2018 International Conference on System Science and

Engineering (ICSSE), pages 1–6. IEEE.

Alemu, W. (1997). The application of ocr techniques to the

amharic script. An MSc thesis at Addis Ababa Univer-

sity Faculty of Informatics.

Assabie, Y. and Bigun, J. (2009). Hmm-based handwritten

amharic word recognition with feature concatenation.

In 2009 10th International Conference on Document

Analysis and Recognition, pages 961–965. IEEE.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma-

chine translation by jointly learning to align and trans-

late. arXiv preprint arXiv:1409.0473.

Belay, B., Habtegebrial, T., Liwicki, M., Belay, G., and

Stricker, D. (2019a). Factored convolutional neural

network for amharic character image recognition. In

2019 IEEE International Conference on Image Pro-

cessing (ICIP), pages 2906–2910. IEEE.

Belay, B., Habtegebrial, T., Liwicki, M., Gebeyehu, and

Stricker, D. (2019b). Amharic text image recognition:

Database, algorithm, and analysis. In 2019 The 15th

International Conference on Document Analysis and

Recognition (ICDAR 2019). IEEE.

Belay, B., Habtegebrial, T., Meshesha, M., Liwicki, M., Be-

lay, G., and Stricker, D. (2020). Amharic ocr: An end-

to-end learning. Applied Sciences, 10(3):1117.

Belay, B., Habtegebrial, T., and Stricker, D. (2018).

Amharic character image recognition. In 2018 IEEE

18th International Conference on Communication

Technology (ICCT), pages 1179–1182. IEEE.

Breuel, T. M. (2017). High performance text recognition

using a hybrid convolutional-lstm implementation. In

2017 14th IAPR International Conference on Docu-

ment Analysis and Recognition (ICDAR), volume 1,

pages 11–16. IEEE.

Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., and Shafait,

F. (2013). High-performance ocr for printed english

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

440

and fraktur using lstm networks. In Document Analy-

sis and Recognition (ICDAR), 2013 12th International

Conference on, pages 683–687. IEEE.

Chowdhury, A. and Vig, L. (2018). An efﬁcient end-to-end

neural model for handwritten text recognition. arXiv

preprint arXiv:1807.07965.

Cowell, J. and Hussain, F. (2003). Amharic character recog-

nition using a fast signature based algorithm. In In-

formation Visualization, 2003. IV 2003. Proceedings.

Seventh International Conference on, pages 384–389.

IEEE.

Das, A., Li, J., Ye, G., Zhao, R., and Gong, Y. (2019).

Advancing acoustic-to-word ctc model with attention

and mixed-units. IEEE/ACM Transactions on Au-

dio, Speech, and Language Processing, 27(12):1880–

1892.

Ghader, H. and Monz, C. (2017). What does attention in

neural machine translation pay attention to? arXiv

preprint arXiv:1710.03348.

Gondere, M. S., Schmidt-Thieme, L., Boltena, A. S., and

Jomaa, H. S. (2019). Handwritten amharic charac-

ter recognition using a convolutional neural network.

arXiv preprint arXiv:1909.12943.

Graves, A. (2008). Supervised sequence labelling with re-

current neural networks [ph. d. dissertation]. Techni-

cal University of Munich, Germany.

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and

Fern

andez, S. (2008). Unconstrained on-line hand-

writing recognition with recurrent neural networks. In

Advances in neural information processing systems,

pages 577–584.

Huang, W., He, D., Yang, X., Zhou, Z., Kifer, D., and Giles,

C. L. (2016). Detecting arbitrary oriented text in the

wild with a visual attention model. In Proceedings of

the 24th ACM international conference on Multime-

dia, pages 551–555.

Huang, Y., Luo, C., Jin, L., Lin, Q., and Zhou, W. (2019).

Attention after attention: Reading text in the wild with

cross attention. In 2019 International Conference on

Document Analysis and Recognition (ICDAR), pages

274–280. IEEE.

Lee, C.-Y. and Osindero, S. (2016). Recursive recurrent

nets with attention modeling for ocr in the wild. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 2231–2239.

Li, Z., Jin, L., Lai, S., and Zhu, Y. (2020). Improv-

ing attention-based handwritten mathematical expres-

sion recognition with scale augmentation and drop at-

tention. In 2020 17th International Conference on

Frontiers in Handwriting Recognition (ICFHR), pages

175–180. IEEE.

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-

fective approaches to attention-based neural machine

translation. arXiv preprint arXiv:1508.04025.

Ly, N.-T., Nguyen, C.-T., Nguyen, K.-C., and Nakagawa,

M. (2017). Deep convolutional recurrent network for

segmentation-free ofﬂine handwritten japanese text

recognition. In 2017 14th IAPR International Con-

ference on Document Analysis and Recognition (IC-

DAR), volume 7, pages 5–9. IEEE.

Maitra, D. S., Bhattacharya, U., and Parui, S. K. (2015).

Cnn based common approach to handwritten character

recognition of multiple scripts. In Document Analy-

sis and Recognition (ICDAR), 2015 13th International

Conference on, pages 1021–1025. IEEE.

Martınek, J., Lenc, L., and Kr

al, P. (2020). Building an

efﬁcient ocr system for historical documents with little

training data.

Mekuria, G. T. and Mekuria, G. T. (2018). Amharic text

document summarization using parser. International

Journal of Pure and Applied Mathematics, 118(24).

Meshesha, M. (2008). Recognition and retrieval from doc-

ument image collections. PhD thesis, IIIT Hyderabad,

India.

Meshesha, M. and Jawahar, C. (2007). Optical character

recognition of amharic documents. African Journal of

Information & Communication Technology, 3(2).

Messina, R. and Louradour, J. (2015). Segmentation-free

handwritten chinese text recognition with lstm-rnn.

In 2015 13th International Conference on Document

Analysis and Recognition (ICDAR), pages 171–175.

IEEE.

Meyer, R. (2006). Amharic as lingua franca in ethiopia.

Lissan: Journal of African Languages and Linguis-

tics, 20(1/2):117–132.

Mondal, M., Mondal, P., Saha, N., and Chattopadhyay, P.

(2017). Automatic number plate recognition using cnn

based self synthesized feature learning. In Calcutta

Conference (CALCON), 2017 IEEE, pages 378–381.

IEEE.

Poulos, J. and Valle, R. (2017). Character-based handwrit-

ten text transcription with attention networks. arXiv

preprint arXiv:1712.04046.

Reta, B. Y., Rana, D., and Bhalerao, G. V. (2018). Amharic

handwritten character recognition using combined

features and support vector machine. In 2018 2nd In-

ternational Conference on Trends in Electronics and

Informatics (ICOEI), pages 265–270. IEEE.

Teferi, D. (1999). Optical character recognition of typewrit-

ten amharic text. Master’s thesis, School of Informa-

tion studies for Africa, Addis Ababa.

Watanabe, S., Hori, T., Kim, S., Hershey, J. R., and Hayashi,

T. (2017). Hybrid ctc/attention architecture for end-

to-end speech recognition. IEEE Journal of Selected

Topics in Signal Processing, 11(8):1240–1253.

Wion, A. (2006). The national archives and library of

ethiopia: six years of ethio-french cooperation (2001-

2006).

Wu, Y.-C., Yin, F., Chen, Z., and Liu, C.-L. (2017). Hand-

written chinese text recognition using separable multi-

dimensional recurrent neural network. In 2017 14th

IAPR International Conference on Document Analy-

sis and Recognition (ICDAR), volume 1, pages 79–84.

IEEE.

Zhang, J., Du, J., and Dai, L. (2018). Track, attend,

and parse (tap): An end-to-end framework for on-

line handwritten mathematical expression recogni-

tion. IEEE Transactions on Multimedia, 21(1):221–

233.

A Blended Attention-CTC Network Architecture for Amharic Text-image Recognition

441