Transfer Learning for Handwriting Recognition on Historical Documents

Adeline Granet, Emmanuel Morin, Harold Mouch

ere, Solen Quiniou and Christian Viard-Gaudin

LS2N, UMR CNRS 6004, Universit

e de Nantes, France

Keywords:

Handwriting Recognition, Historical Document, Transfer Learning, Deep Neural Network, Unlabeled Data.

Abstract:

In this work, we investigate handwriting recognition on new historical handwritten documents using transfer

learning. Establishing a manual ground-truth of a new collection of handwritten documents is time consuming

but needed to train and to test recognition systems. We want to implement a recognition system without

performing this annotation step. Our research deals with transfer learning from heterogeneous datasets with a

ground-truth and sharing common properties with a new dataset that has no ground-truth. The main difﬁculties

of transfer learning lie in changes in the writing style, the vocabulary, and the named entities over centuries

and datasets. In our experiment, we show how a CNN-BLSTM-CTC neural network behaves, for the task

of transcribing handwritten titles of plays of the Italian Comedy, when trained on combinations of various

datasets such as RIMES, Georges Washington, and Los Esposalles. We show that the choice of the training

datasets and the merging methods are determinant to the results of the transfer learning task.

1 INTRODUCTION

Historical documents are more and more digitized to

preserve them and to ease their accessibility and dif-

fusion. Thus, information retrieval within historical

documents is a real challenge. Moreover, the quantity

of images is so large that manual information mining

remains a time-consuming task. Over the last decade,

historical data have become a principal target for clas-

siﬁcation (Cloppet et al., 2016), line detection (Mur-

dock et al., 2015), and keyword spotting (Puigcerver

et al., 2015). Standard end-to-end text recognition

systems consist of three steps (Fischer et al., 2009):

manual labelling of data to create a ground truth; spe-

ciﬁc pre-processing operations such as denoising doc-

uments and segmenting them into blocks, lines, or

words; training of a dedicated recognizer using the

alignment between text images and manual labels.

In (Llad

os et al., 2012), the authors study the

problem of handwriting recognition (HWR) without

training data for historical documents. Using a non

dedicated dataset is ill-advised because there may be

several problems caused by signiﬁcant differences in

terms of period and geographical area, which often

affect the script style. Finding a training dataset for

keyword spotting or handwriting recognition, meet-

ing the desired distinctive characteristics, is a com-

plicated task. However, some studies attempt to use

modern data spotting systems for historical docu-

ments, as (Frinken et al., 2010) which mixes differ-

ent resources to transcribe another resource based on

its target vocabulary: this approach is called Trans-

fer Learning. We are especially interested in Trans-

ductive Transfer Learning which focuses on domain

adaptation (Pan and Yang, 2010) by using various

source and target data for the same given task. We

want to use this method and to push it even further

by multiplying the number of annotated data used as

sources and by adding parameter transfer.

In this paper, we are studying a new resource (ﬁ-

nancial records of the Italian Comedy) using a min-

imum amount of information and without a ground-

truth. This prevents the direct use of traditional meth-

ods of HWR (see section 2). That is why we want to

build a recognition system able to transfer knowledge

on unknown data, without annotating more data. To

overcome the lack of data, the Transductive Transfer

Learning seems to be a good alternative. The chosen

recognition system is a BLSTM-CTC system (Bidi-

rectional Long Short-Term Memory and Connection-

ist Temporal Classiﬁcation) including a Full Convolu-

tion Network, among the current state-of-the-art sys-

tems; this avoids abstracting the speciﬁc extraction of

features. Three datasets are used during the training

step of the network, each dataset sharing at least one

feature with our Italian Comedy data.

The rest of this paper is organized as follows. We

present the state of the art on handwriting recogni-

tion systems in Section 2. Then, we describe our

HWR system, its structure, and its relevant post-

432

Granet, A., Morin, E., Mouchère, H., Quiniou, S. and Viard-Gaudin, C.

Transfer Learning for Handwriting Recognition on Historical Documents.

DOI: 10.5220/0006598804320439

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 432-439

ISBN: 978-989-758-276-9

preprocessing steps, in Section 3. The dataset we use

is presented in Section 4 and we report our experi-

ments and results in Sections 5 and 6.

2 STANDARD HWR SYSTEMS

At the end of the 90’s, Hidden Markov Models

(HMMs) had become a reference due to their abil-

ity to learn sequentially. Moreover, they could in-

tegrate knowledge in the form of lexicons and lan-

guage models to better label sequences (Bunke et al.,

1995). Rapidly, they were strengthened by neural net-

works and hybrid neuro-markovian systems further

improved the local and global representation of char-

acters (Koerich et al., 2002).

At the same time, neural networks evolved with

new types of neurons, namely recurrent neurons.

Contrary to a simple neuron, a recurrent neuron al-

lows a connection to itself. Thus, recurrent neu-

ral networks (RNN) store more information from

all inputs during training (Senior and Robinson,

1998). Then, the Long Short-Term Memory (LSTM)

block (Hochreiter and Schmidhuber, 1997) appeared

to solve the problem of the vanishing gradient. This

block allows the training of the network to converge.

Recently, multi-dimensional neural networks with

LSTM (MDLSTM) have outperformed traditional

networks. The ﬁrst one was a bidirectional recur-

rent network with LSTM (BLSTM); it became a refer-

ence for HWR systems (Fischer et al., 2009). Nowa-

days, multidimensional recurrent neural networks

won competitions such as (Grosicki and El Abed,

2009). In (Graves and Schmidhuber, 2009), the mul-

tidimensional part of MDLSTM is made of four par-

allel layers across each direction on raw images. All

the context can thus be used without any restriction.

In HWR, all the previously introduced networks

cannot automatically align the input sequence with its

labels. So, (Graves, 2012) presented a connection-

ist temporal classiﬁcation (CTC) layer which com-

putes a sequence of labels to avoid the segmentation

of the input sequence into characters or words. An-

other important step in HWR is feature extraction.

Although traditional methods such as HOG (Tera-

sawa and Tanaka, 2009) have proven their efﬁciency,

new methods integrating convolutional neural net-

works (CNN) have begun to replace them (Suryani

et al., 2016). Other ﬁelds like Visual Recognition

and Description also use methods based on CNN and

LSTM (Donahue et al., 2015). We can distinguish

two different approaches: the ﬁrst one is a simple

CNN and includes at least one layer of full connected

neurons at the end of the network, and the second one

mainly uses convolution and max pooling layers.

Multilingual systems were developed in paral-

lel of MDLSTM. Some use identical conﬁgura-

tions on independent training datasets (Voigtlaen-

der et al., 2016), others dedicate one speciﬁc layer

for each language and for each task in a recurrent

neural network (Moysset et al., 2014) but few sys-

tems are trained on multilingual datasets at the same

time (Kozielski et al., 2014). Regarding monolingual

or multilingual HWR systems, it is common to use n-

gram language models at the word or character level,

and a dictionary closed on the training set to improve

the decoding step. In (Oprean et al., 2013), the au-

thors use Wikipedia to create a dynamic dictionary

for each word detected as out-of-vocabulary.

3 FCN-BLSTM-CTC

RECOGNITION SYSTEM

In our HWR systems, the concept of Transductive

Transfer Learning is performed through the training

and the validation of the weights of the neural network

on a set of three datasets. Then, the saved weights are

used either to perform tests directly on the new dataset

or to perform ﬁne-tuning by initializing the new net-

work. Our neural networks are made up of two parts:

feature extraction and handwriting recognition.

The feature extraction part is directly integrated

in the HWR system with a fully convolutional neural

network (FCN). A graphical representation of the two

architectures that we are using in our experiments is

presented in Figure 1. The neural network takes an

input image with a ﬁxed-height of 120 pixels and a

variable width denoted t. The common part of both

networks is composed of three layers of convolution

with a kernel size of 5x5 and same-padding (corre-

sponding to the gray part of Figure 1). Then, the ﬁrst

network, called CNN 32, has 3 layers of convolution

with 32 ﬁlters, while the second one, called CNN 128,

has 3 layers of convolution with a ﬁlter number which

doubles: 64 and 128 ﬁlters. Each of the convolutional

layers is followed by several max-pooling layers. Fi-

nally, the last shape obtained has dimensions 32x1x

for CNN 32, and 128x1x

for CNN 128. This ﬁrst

part is directly connected to the second part of the net-

work, aka the handwriting recognition system.

The second part of our neural networks respects

the initial structure of the BLSTM neural network

proposed in (Graves, 2012). The network is com-

posed of two hidden layers, forward and backward.

Both of them are made up of 100 LSTM blocks.

Through the training, these layers are independent,

one using the information following the time, and the

Transfer Learning for Handwriting Recognition on Historical Documents

433

(120 x t)

py(X)

(120 x t x 8)

(60 x t x 16)

(30 x t x 32)

(10 x t/2 x 32)

(10 x t/2 x 64)

(5 x t/4 x 32)

(5 x t/4 x 128)

(1 x t/20 x 32)

(1 x t/20 x 128)

Figure 1: Graphical representation of the two architectures of the convolutional neural network part used for feature extraction.

The blue version is called CNN 32, and the pink one is called CNN 128.

other from the future to the past. LSTM blocks con-

trol the inﬂuence of long-term information across the

network; it is interesting for long images such as lines

of text. Then, the weighted sum of both hidden lay-

ers is provided to the output layer built with 75 soft-

max neurons. This corresponds to 52 lower and upper

characters, 10 digits, and some punctuation symbols.

At each time step, one output neuron represents one

character and an additional neuron acts as a “joker”,

called “blank” label. Finally, the CTC is applied to the

output in order to label the sequence. The provided

output can be decoded by several algorithms, such as

the Token Passing Algorithm, or the Preﬁx Search De-

coding that can include a language model. In our ex-

periments, the Best Path Decoding is used: at each

time step, the most active node is selected, which ﬁ-

nally gives the most probable path (Graves, 2012). To

obtain the ﬁnal sequence, all consecutive character la-

bels are deleted except the ﬁrst one, as well as blank

labels. With this method, we stay in the framework

where we have no prior knowledge on the data.

4 CASE STUDY: RECORDS OF

THE ITALIAN COMEDY

This paper deals with transcription and information

extraction issues from historical documents with few

or no annotated data. Our research aims at providing a

handwriting recognition solution for documents of the

Italian Comedy, from the 18

century. They are pro-

vided by the BnF

(Biblioth

eque nationale de France)

as part of the ANR project CIRESFI. This data con-

sists of more than 28,000 pages of ﬁnancial records

covering one century. Several evolutions were noticed

within this dataset. The ﬁrst one is related to the lan-

guage which switches from Italian (with several di-

alects) to French. The second one is a change in the

structure preserving the quantity of information. For

one day, we can ﬁnd the date, titles of the plays, rev-

enues, expenses, actor names, and also some notes (as

shown on Figure 2) but this layout ﬂuctuates over the

decades. Further works are in progress on the detec-

tion and segmentation of these ﬁelds for each page.

http://gallica.bnf.fr/accueil/

Figure 2: Example of a ﬁnancial daily record for the Italian

Comedy with identiﬁcation ﬁelds.

In this part of the project, our study focuses on the

title ﬁeld. This can be explained by the large collec-

tion of play titles of the Italian Comedy. The title ﬁeld

contains the list of plays that have been performed

that day. Sometimes, this list gives more information,

such as if it was a premiere, if it was played in special

places, or in front of the king’s court. An example

of such complementary information is shown in Fig-

ure 2. The title ﬁeld explains that this was the ﬁrst per-

formance of “Sophie ou le mariage cach

e” (“Sophie

or the secret marriage”) which was a comedy in three

acts, preceded by “Arlequin toujours Arlequin” (“Ar-

lequin always Arlequin”). Sometimes, actor names

replace the names of their characters in the title. Thus,

our collection of titles can not be considered as a

ground truth but as a source of information.

The writing style is also an issue. At the beginning

of the century, Italian actors wrote the records them-

selves. From the mid-century, there was only one

writer for thirty years. Furthermore, there are differ-

ences as compared to the contemporary writing: spe-

cial characters, as the long form of ’s’ (Figure 3(a));

evolution of the spelling, like using ’i’ or ’j’ indis-

tinctly (Figure 3(b)); abbreviations such as “&c.” for

“ etc ” or symbols such as ’◦’ (Figures 3(c) and 3(d)).

Thanks to a participative annotation website

, we

were able to collect information on the position of

http://recital.univ-nantes.fr/

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

434

(a) ”Rose” (b) ”Invisible”

Figure 3: Special characters and abbreviations in the Italian

Comedy documents.

the ﬁelds in the pages as well as their transcription.

The algorithm of Seam Carving proposed by (Ar-

vanitopoulos and S

usstrunk, 2014) has been applied

to segment the blocks into title lines. Finally, we

have manually validated the collected transcriptions

of each block, each line, as well as the segmented

lines themselves. Thus, we collected 971 validated

lines with their transcriptions. To create the datasets

for training, validation and test, we paid attention to

distribute the lines of titles in both French and Ital-

ian among the various datasets. Moreover, title lines

coming from the same page were not separated.

The aim of our work is not to provide another

HWR method. Thus, we use CNN-BLSTM neural

networks with CTC to produce knowledge transfer at

two levels: from modern data to historical documents

based on (Frinken et al., 2010) and from at least one

language to another one (or more). We want to solve

two problems: the ﬁrst one is related to the language

which is mainly in French; the second one is the old

style of our data. This is why we use two types of data

sources: the ﬁrst one is data in French and the other

one is data from the same period.

5 EXPERIMENTAL SETUP

5.1 Datasets

To apply transductive transfer learning, we need to

carefully choose the data used to train our systems.

To our knowledge, there are no available annotated

data meeting all our criteria: from the 18

century,

in French and Italian, with a closed vocabulary on

the Italian Comedy. Thus, each selected dataset has

at least one characteristics in common with our data.

The descriptions of the datasets are shown in Table 1.

Georges Washington (GW). The GW dataset (Fis-

cher et al., 2012) contains 20 pages of letters from

George Washington to his associates during the 18

century. The writing style is very similar from letter

to letter. The images are binarized and normalized.

To be comparable to the state-of-the-art, the original

partition of the dataset in train, validation and test is

used in our work. In fact, as four partitions of the data

exist, we randomly selected one of these partitions.

Los Esposalles (ESP). The ESP dataset (Romero

et al., 2013) consists of 173 pages of old Spanish mar-

riage records. The whole pages are provided with

the segmentations and transcriptions of their words.

There is only one writer for these pages. The raw im-

ages are used with a normalization of the height of

120 pixels (like the GW images).

RIMES (RM). The RM dataset (Augustin et al.,

2006) is a French database used in several ICDAR

competitions. It is composed of 12,723 pages of ad-

ministrative letters written by 1,300 volunteers. The

gray scale images are used with the same height nor-

malization as the others. The ofﬁcial split for the IC-

DAR 2011 competition (Grosicki and El-Abed, 2011)

provides 12,111 lines (including 11,333 lines for the

training set) and 66,979 words (including 51,739

words for the training set). We keep this distribution.

Italian Comedy (CI). The CI dataset is a French and

Italian dataset describing play titles. It is composed of

151 play titles. The images were also standardized at

120 pixels high and in gray scale too.

From all datasets, we have removed characters in

the ground truth such as ’#’, ’/’ or ’$’ and replaced

them by a “joker” character because they could not

appear in the Italian Comedy. The accentuated char-

acters and also the tied letter in the RIMES transcrip-

tions are replaced by their simple form. Thus, we con-

sider that the ’

e’ character is one form of ’e’ like ’s’

has a long form and a short form in the 18

century.

5.2 System Training

We have already detailed the architecture of our sys-

tem in section 3 as well as some parameters. We car-

ried out our experiments in successive stages. The

ﬁrst experiment allowed us to optimize the architec-

ture of our system by evaluating it on two labeled

datasets. With the best parameters obtained, we re-

alized another experiment in order to select the best

pairing of the dataset and to test the word recogni-

tion task on the Italian Comedy data. These results

will be used for the third and ﬁnal experiment which

involves directly testing the transfer learning and the

ﬁne-tuning on our Italian Comedy data.

In transfer learning, the generalization capability

of the classiﬁer must be maximized: the classical

technique of early stopping is used in order to se-

lect the best network. To avoid overﬁtting, training

Transfer Learning for Handwriting Recognition on Historical Documents

435

Table 1: The selected datasets. In the upper part, each common point with the Italian Comedy is shown in bold. In the lower

part, the data distribution on the training, validation and test sets is given as well as the name of the associated dataset.

Dataset Georges Washington Los Esposalles RIMES Italian Comedy

Language English Spanish French French (and Italian)

Period 18

century 18

century 21

century 18

century

Pixel value Binarized Grayscale Grayscale Grayscale

Words

Train 2,402 (GW

) 45,102 (ESP

) 51,739 (RM

) -

Validation 1,199 5,637 7,464 -

Test 1,292 5,637 7,776 -

Lines

Train 325 (GW

) - 11,333 (RM

) 582 (CI

)

Validation 168 - 1,332 195

Test 163 - 778 194

continues until the Negative Log-Likelihood (NLL)

computed by the CTC is no longer decreasing during

20 epochs on all the validation datasets. One set of

weights of the neural network is backed up for each

validation dataset only if the NLL drops. The training

is realized through all the parts of the network: fea-

ture extraction with the FCN, handwriting recognition

with the BLSTM, and data labeling with the CTC.

Our ﬁrst experiments on GW demonstrated that

the training is performed more efﬁciently on the line

images when it is gradually done. Hence, all images

in the training set are sorted in an ascending order ac-

cording to their label length. In this way, the training

step goes from isolated characters to words, and ﬁ-

nally to long lines. First, experiments exclusively run

on the word datasets. Then, we extend them to the

lines. This allows an increase in the performance of

the BLSTM-CTC across all datasets.

To evaluate the performance of our system, we

used the recognition rate at the character level (CRR)

and at the word level (WRR). The CRR is deﬁned by

CRR =

N − (S + D + I)

with N the number of characters in the reference im-

ages, S the number of character substitutions, D the

number of character deletions, and I the number of

character insertions. The WRR is computed similarly

on words. These two measures are case sensitive and

treat the space as a character into a line. For the de-

coding step, no dictionary or language models were

used on the three labeled datasets. Through our exper-

iments we aims at deﬁning a simple system capable of

performing the transfer learning.

6 EXPERIMENTAL RESULTS

Previous experiments showed that the training phase

is more efﬁcient when it is gradually realized. In the

learning from scratch on the lines of GW, after 100

epochs, the cost stays high and the recognition rate

is about 10.7%. Nonetheless, if the network is ﬁne

tuned with the weights from the previous training on

the word images, the recognition rate reaches 77.3%

in 33 epochs. So, this gradual training has been used

for the following experiments.

6.1 Optimizing the System Architecture

This part presents the optimization of the system ar-

chitecture. Indeed, the parameters of each separate

part of our system (CNN and BLSTM-CTC) have an

impact on the quality of the transcriptions, as well

as on the transfer phase. Those experiments aim at

evaluating the various parameters that can inﬂuence

our system, i.e. the CNN size in terms of number of

extracted features and number of cells in the hidden

LSTM layers.

The GW

dataset is used more often for word

spotting (Rath and Manmatha, 2007; Fischer et al.,

2012; Frinken et al., 2012) than for HWR or word

recognition (Lavrenko et al., 2004). The latter uses

a HMM with a cross validation and 19 pages to train

the system. It achieves a CRR of 56.8% without the

out-of-vocabulary words. With only 15 pages to train

our system, we achieve a CRR of 28.7%. In order to

improve our results and deal with the limited amount

of data in GW

, we have paired it with the more ex-

tensive dataset RM

, for each experiment.

Table 2 presents the results obtained by four sys-

tems trained on the RM

∪ GW

training set and

tested on RM

and GW

test sets. The systems differ

from the number of features (32 or 128) and from the

number of cells in their two LSTM layers (50 or 100).

When the number of extracted features is set to 32,

we can see that increasing the number of cells leads

to a better rate on the characters. When the number

of features is increased to 128, the character recogni-

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

436

Table 2: Results for several systems with the same training

set, RM

∪ GW

Features Cells Test CRR WRR

32 50

40.5 22.2

63.5 34.8

32 100

43.9 22.2

67.6 39.9

128 100

48.5 25.5

70.1 41.7

tion rate on GW

evolves from 40.5% to 48.5%. The

same observation can be made on RM

, where the

character recognition rate increases by 6.6%. How-

ever, the increase in the word recognition rate is not

as important on the two test sets. Finally, these results

allow us to conclude that the best conﬁguration is ob-

tained with 128 feature outputs from the CNN and

with 100 LSTM cells, which was originally deﬁned

by (Graves and Schmidhuber, 2009). These settings

will be kept for the rest of our experiments. We can

also conclude that the addition of a large dataset helps

to improve the results.

6.2 Optimizing the Word Datasets

Now that the architecture of the system has been set

up, we have to deﬁne the datasets that will be used

for training and testing our system. Among the three

datasets, two must be dedicated to training and the

third one must play the role of our CI data since we

only have a limited amount of CI data with a ground-

truth. The RIMES dataset must be part of the training

dataset because it is the only one French dataset that

we have here.

Table 3: Results obtain on word images for systems with

128 features and 100 cells for the LSTM layers.

Id Train Test CRR WRR

∪ GW

48.5 25.5

70.1 41.7

ESP

9.0 0.3

∪ ESP

6.3 0.3

71.1 42.0

ESP

91.1 75.9

Among studies on the RIMES dataset, (Pham

et al., 2014) use a deep recurrent neural network com-

posed of MDLSTM and CNN. The authors vary the

number of LSTM cells from 30 to 200. They obtain a

84.9% character recognition rate with 50 cells, and a

84.2% character recognition rate with 100 cells. Pro-

gressively, our system (presented in Table 3) tends

to reach this state-of-the-art. The high recognition

rate obtained with RM and ESP on the words push

us to select them for the last series of experiments on

CI. These ﬁrst experiments on transfer learning at the

word level shows that it is a difﬁcult task.

6.3 Experimenting Transfer Learning

In the previous experiments, we were in a phase of

training and testing on labeled data. The last exper-

iments deal with the learning transfer process on the

line images, especially those of CI. It is interesting to

evaluate the impact that can have the addition or not of

target data during the training. Table 4 presents the re-

sult of three different experiments which include tests

on CI. Each of them uses the saved weights from an

experiment (indicated by an ID in the train column).

Table 4: Results for systems with 128 features, 100 cells

for the LSTM layers and ﬁne-tuned with the saved weights

from the different experiences.

Id Train Test CRR

) ∪

∪ ESP

10.6

) ∪ CI

25.5

) ∪ CI

28.7

Our user case corresponds to the ﬁrst line. With

the help of annotated data, we could achieve the re-

sults obtained on lines 2 and 3. At ﬁrst, we add RM

while keeping ESP

and using the saved weights from

the last experiment E

on the words presented in Ta-

ble 3. In the experiments, we found out that ﬁne-

tuning is only useful when the resources used to set

up the weights are still present because the system for-

gets. Then, these saved weights are used to perform

ﬁne-tuning on CI

. This specialization of the network

on the target data during the learning allows to in-

crease the recognition rate of the characters by 15%.

Finally, in order to observe the impact of the “ space ”

character during the learning, the weights saved when

learning on E

are chosen to perform ﬁne-tuning on

CI. It shows a 3.2% increase of the character recog-

nition rate with respect to the rate obtained when the

weights saved on E

were used.

Figures 4 and 5 show the curves obtained when

ﬁne-tuning the learning on CI

, from RM

∪ ESP

versus RM

∪ ESP

. In addition to improving the

character recognition rate, we note that the learning

is faster, 38 epochs against 27 epochs, when using

instead of RM

. Furthermore, the initial vali-

dation NLL is twice as low with RM

and it quickly

reaches a stability level, when the “space” character

was learned before. With the ﬁne-tuning on RM

, the

Transfer Learning for Handwriting Recognition on Historical Documents

437

Figure 4: Fine-tuning of CI

on E

: evolution of Log-Likelihood by epoch

Figure 5: Fine-tuning of CI

on E

evolution of Log-Likelihood by epoch

character recognition rate on the validation set grows

gradually. We can notice a sharp fall at the begin-

ning of the learning curve: it seems like the system

forgets part of what it has learned to better special-

ize to CI

. With RM

, the character recognition rate

curve behaves as the training continues as before the

ﬁne-tuning. It is also interesting to note that, here,

the breakpoint on the validation NLL is just after the

best value reached for the character recognition rate.

Furthermore, we can see that, with just one iteration

on a small set of target data, the character recognition

rate already exceeds the ﬁrst experiment we had con-

ducted by testing directly on CI

. Thus, a very sim-

ple system architecture without any language model

helps to successfully achieve the learning transfer.

7 CONCLUSION

We performed experiments on transductive transfer

learning for the task of handwriting recognition on a

new historical dataset. Firstly, we optimized the pa-

rameters of the system to obtain the simplest and most

performing system. Then, we deﬁned the best pairing

of datasets to realize the transfer learning. Moreover,

it is necessary to pay attention to the balance of the

representation of words in a dataset because this has

a strong impact on the system. The last experiment

with the CI data shows that even if both measures of

recognition rate are low, we have a progression.

Our experiments allow us to conclude that HWR

systems quickly specialize on learning data. We have

found out that the addition of one or more resources

makes it possible to improve character recognition.

For now, it is still necessary to add a small amount

of target data in the learning, to achieve a minimum

recognition rate. In the long term, we want to build a

system based on capitalization of resources in order to

avoid those phases of annotation and manual valida-

tion. Our aim remains to be able to carry out transfer

learning on data without prior costly knowledge.

These results will guide our future experiments.

Besides the addition of resources and their balance

control, we can explore several orientations. We

will focus on solutions which do not require costly

ground-truth. For example, using unsupervised train-

ing can take advantage of thousands unannotated

available pages. This can be done by using an autoen-

coder for feature extraction layers instead of our FCN.

Another low cost resource are dictionaries which can

be collected from topic related corpus but this solu-

tion implies to add a language model post-processing

of the network outputs. Finally we can also set up

a deeper network with several LSTM layers as it is

often used in the state-of-the-art. However, increas-

ing the system size runs against the results presented

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

438

here to obtain a good recognition rate on data that are

not used during learning. There is still some works

to succeed in the recognition of new digitized docu-

ments from multilingual and multi-period resources.

REFERENCES

Arvanitopoulos, N. and S

usstrunk, S. (2014). Seam carving

for text line extraction on color and grayscale histori-

cal manuscripts. In ICFHR, pages 726–731.

Augustin, E., Brodin, J.-m., Carr

e, M., Geoffrois, E.,

Grosicki, E., and Preteux, F. (2006). RIMES evalu-

ation campaign for handwritten mail processing. In

ICFHR.

Bunke, H., Roth, M., and Schukat-Talamazzini, E. G.

(1995). Off-line cursive handwriting recognition us-

ing hidden markov models. PR, 28(9):1399–1413.

Cloppet, F., Eglin, V., Kieu, V., Stutzmann, D., and Vin-

cent, N. (2016). ICFHR 2016 competition on the clas-

siﬁcation of medieval handwritings in latin script. In

ICFHR, pages 590–595.

Donahue, J., Anne Hendricks, L., Guadarrama, S.,

Rohrbach, M., Venugopalan, S., Saenko, K., and Dar-

rell, T. (2015). Long-term recurrent convolutional

networks for visual recognition and description. In

CVPR, pages 2625–2634.

Fischer, A., Keller, A., Frinken, V., and Bunke, H. (2012).

Lexicon-free handwritten word spotting using charac-

ter HMMs. PRL, 33(7):934–942.

Fischer, A., W

uthrich, M., Liwicki, M., Frinken, V., Bunke,

H., Viehhauser, G., and Stolz, M. (2009). Automatic

transcription of handwritten medieval documents. In

VSMM, pages 137–142.

Frinken, V., Fischer, A., Bunke, H., and Manmatha, R.

(2010). Adapting BLSTM neural network based key-

word spotting trained on modern data to historical

documents. In ICFHR, pages 352–357.

Frinken, V., Fischer, A., Manmatha, R., and Bunke, H.

(2012). A novel word spotting method based on

recurrent neural networks. IEEE Trans. on PAMI,

34(2):211–224.

Graves, A. (2012). Supervised Sequence Labelling with Re-

current Neural Networks. Springer.

Graves, A. and Schmidhuber, J. (2009). Ofﬂine handwrit-

ing recognition with multidimensional recurrent neu-

ral networks. In NIPS, pages 545–552.

Grosicki, E. and El Abed, H. (2009). ICDAR 2009 hand-

writing recognition competition. In ICDAR, pages

1398–1402.

Grosicki, E. and El-Abed, H. (2011). ICDAR 2011:

French handwriting recognition competition. In IC-

DAR, pages 1459–1463.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Koerich, A. L., Leydier, Y., Sabourin, R., and Suen, C. Y.

(2002). A hybrid large vocabulary handwritten word

recognition system using neural networks with hidden

markov models. In ICFHR, pages 99–104.

Kozielski, M., Doetsch, P., Hamdani, M., and Ney, H.

(2014). Multilingual off-line handwriting recognition

in real-world images. In DAS, pages 121–125.

Lavrenko, V., Rath, T. M., and Manmatha, R. (2004). Holis-

tic word recognition for handwritten historical docu-

ments. In DIAL, pages 278–287.

Llad

os, J., Rusi

nol, M., Forn

es, A., Fern

andez, D., and

Dutta, A. (2012). On the inﬂuence of word repre-

sentations for handwritten word spotting in historical

documents. IJPRAI, 26(05):1263002–1–25.

Moysset, B., Bluche, T., Knibbe, M., Benzeghiba, M. F.,

Messina, R., Louradour, J., and Kermorvant, C.

(2014). The A2iA multi-lingual text recognition sys-

tem at the second maurdor evaluation. In ICFHR,

pages 297–302.

Murdock, M., Reid, S., Hamilton, B., and Reese, J. (2015).

ICDAR 2015 competition on text line detection in his-

torical documents. In ICDAR, pages 1171–1175.

Oprean, C., Likforman-Sulem, L., Popescu, A., and Mok-

bel, C. (2013). Using the web to create dynamic

dictionaries in handwritten out-of-vocabulary word

recognition. In ICDAR, pages 989–993.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Tran. on KDE, 22(10):1345–1359.

Pham, V., Bluche, T., Kermorvant, C., and Louradour, J.

(2014). Dropout improves recurrent neural networks

for handwriting recognition. In ICFHR, pages 285–

290.

Puigcerver, J., Toselli, A. H., and Vidal, E. (2015). ICDAR

2015 competition on keyword spotting for handwrit-

ten documents. In ICDAR, pages 1176–1180.

Rath, T. M. and Manmatha, R. (2007). Word spotting for

historical documents. IJDAR, 9(2):139–152.

Romero, V., Forn

es, A., Serrano, N., S

anchez, J. A., Toselli,

A. H., Frinken, V., Vidal, E., and Llad

os, J. (2013).

The ESPOSALLES database: An ancient marriage li-

cense corpus for off-line hwr. PR, 46(6):1658–1669.

Senior, A. W. and Robinson, A. J. (1998). An off-

line cursive handwriting recognition system. PAMI,

20(3):309–321.

Suryani, D., Doetsch, P., and Ney, H. (2016). On the bene-

ﬁts of convolutional neural network combinations in

ofﬂine handwriting recognition. In ICFHR, pages

193–198.

Terasawa, K. and Tanaka, Y. (2009). Slit style HOG feature

for document image word spotting. In ICDAR, pages

116–120.

Voigtlaender, P., Doetsch, P., and Ney, H. (2016). Hand-

writing recognition with large multidimensional long

short-term memory recurrent neural networks. In

ICFHR, pages 2228–233.

Transfer Learning for Handwriting Recognition on Historical Documents

439