Impact of Training LSTM-RNN with Fuzzy Ground Truth
Martin Jenckel
1,2
, Sourabh Sarvotham Parkala
2
, Syed Saqib Bukhari
1
and Andreas Dengel
1,2
1
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
2
TU Kaiserslautern, Kaiserslautern, Germany
Keywords:
Document Analysis, OCR, LSTM, Fuzzy Ground Truth.
Abstract:
Most machine learning algorithms follow the supervised learning approach and therefore require annotated
training data. The large amount of training data required to train state of the art deep neural networks changed
the methods of acquiring the required annotations. User annotations or completely synthetic annotations are
becoming more and more prevalent replacing careful manual annotations by experts. In the field of OCR recent
work has shown that synthetic ground truth acquired through clustering with minimal manual annotation yields
good results when combined with bidirectional LSTM-RNN. Similarly we propose a change to standard LSTM
training to handle imperfect manual annotation. When annotating historical documents or low quality scans
deciding on the correct annotation is difficult especially for non-experts. Providing all possible annotations in
such cases, instead of just one, is what we call fuzzy ground truth. Finally we show that training an LSTM-
RNN on fuzzy ground truth achieves a similar performance.
1 INTRODUCTION
With the rise of neural networks and deep learning,
collecting large amounts of training data and their an-
notation became one of the major challenges in ma-
chine learning. Instead of handcrafting annotations
for data sets with a couple thousand data points, often
the data is collected from social media sources like
Twitter, Flickr or YouTube (Wang et al., 2011; Althoff
et al., 2013; You et al., 2015) and instead of hand-
crafted annotations user generated tags, keywords or
comments are used. This allows for data sets with
millions of data points with comparably low effort.
While this works great for images and videos, it is
rather difficult to employ it for other data types like
historical documents.
For images and videos most people can describe
what they see fairly well, but for historical documents
it can be difficult to read characters like the exam-
ples shown in Figure 1 and provide accurate anno-
tations. Even more so if the language on the doc-
ument is old, not used anymore or changed signif-
icantly compared to its modern form. Additionally
historical documents often show various degradations
making it even harder to identify the characters with-
out the right expertise. In practice this makes annotat-
ing historical documents time consuming and there-
fore also expensive with a single page often costing
Figure 1: Examples for historical documents with degraded
or otherwise hard to identify characters.
100USD or more to transcribe.
In this paper we propose a new way of annotating
historical documents for the purpose of LSTM-RNN
based Optical Character Recognition (OCR) systems.
Annotating ambiguous characters, either through high
levels of degradations or ambiguity in the characters
themselves, can often be done by identifying the cor-
rect word, surrounding words or the overall context
of the sentence and text. Without this linguistic ex-
pertise it becomes difficult to make the correct deci-
sion. We propose, that instead of removing useful in-
formation from the annotation by possibly giving the
wrong annotation, the transcription should contain all
possibilities. LSTM-RNN based OCR systems can
388
Jenckel, M., Parkala, S., Bukhari, S. and Dengel, A.
Impact of Training LSTM-RNN with Fuzzy Ground Truth.
DOI: 10.5220/0006592703880393
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 388-393
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
O
C
R
1 2 3
T-2
T-1
T
...
O
C
R
1 2 3
T-2
T-1
T
...
O
Figure 2: In combination with LSTM-RNN the CTC loss function considers all possible alignments between networks outputs
and ground truth. Compared to the classic case (left), fuzzy ground truth (right) effectively increases the number of possible
alignments (red).
use ”fuzzy ground truth” without a significant loss in
overall accuracy. This allows for transcribing histori-
cal documents without extensive knowledge about the
language and requires only basic knowledge about the
character shapes. At the same time it reduces the over-
all annotation time because ambiguous characters do
not have to be resolved by using additional tools and
time.
In Section 2 the details about how LSTM-RNNs can
handle the fuzzy annotation compared to normal an-
notation is explained. In Section 3 there is a descrip-
tion of the experimental setup and an introduction to
the data set. Lastly in Section 5 we will conclude the
performance and give an outlook on further research
topics.
2 Long-Short-Term Memory -
Recurrent Neural Networks
A Long-Short-Term-Memory - Recurrent Neural Net-
work (LSTM-RNN) is a RNN consisting of LSTM-
cells. They were introduced to solve the problem of
vanishing gradients during gradient descent training
of classical RNN networks (Werbos, 1990; Hochreiter
and Schmidhuber, 1997). LSTM-RNNs excel at se-
quence learning tasks and can align unsegmented data
with their corresponding ground truth when combined
with Connectionist Temporal Classification (CTC)
loss (Graves et al., 2006). This properties make them
especially useful for OCR where they produce state of
the art results (Ul-Hasan et al., ; Karayil et al., 2015;
Simistira et al., 2015).
An LSTM-cell has three gating mechanisms that reg-
ulate the impact of the input via the input gate, the
previous cell state via the forget gate and the output
via the output gate. Especially the forget gate and
the cells internal state allow it to ”remember” infor-
mation through multiple time steps and similarly ”re-
member” errors backwards through time during gra-
dient descent optimization.
The state equations for each LSTM-cell are as fol-
lows:
f
t
= σ(w
x f
· x
t
+ w
h f
· h
t1
+ b
f
) (1)
i
t
= σ(w
xi
· x
t
+ w
hi
· h
t1
+ b
i
) (2)
v
t
= tanh(w
xv
· x
t
+ w
hv
· h
t1
+ b
v
) (3)
o
t
= σ(w
xo
· x
t
+ w
ho
· h
t1
+ b
o
) (4)
C
t
= f
t
·C
t1
+ i
t
· v
t
(5)
y
t
= o
t
·tanh(C
t
) (6)
where
f , i and o are the forget, input and output gate,
C is the cell state and y the cell output, while t
iterates of all time steps
b
j
, j { f , i,o,v} are the bias units for the forget,
input and output gate and the input squashing,
w
i j
are the weight connection between i and j,
σ is the logistic sigmoid function given as:
σ =
1
1+exp(x)
and tanh is the tangent hyperbolic function
tanh(x) =
e
2x
1
e
2x
+1
,
Storing information through multiple time steps al-
lows the LSTM to learn temporal relations in the data.
Impact of Training LSTM-RNN with Fuzzy Ground Truth
389
In terms of OCR this means it inherently builds a sta-
tistical language model based on temporal relations of
characters and words.
The most common architecture for LSTM-RNNs is
bidirectional LSTM (BILSTM). Two LSTM-RNNs
consisting of an input layer and a hidden layer go
through the sequence from both sides simultaneously
and feed into a single Softmax layer. Since the num-
ber of LSTM-cells in the input layer has to be fixed,
all input sequences need the same input size at every
time step.
2.1 Connectionist Temporal
Classification Loss
The Connection Temporal Classification (CTC) loss
allows RNNs to be trained on unsegmented data
(Graves et al., 2006). Similar to other loss functions
the idea is to maximize the likelihood, that the out-
put of the network aligns with the given ground truth.
For this all possible alignments between network out-
put and ground truth have to be considered. This is
done by aligning each ground truth character with a
sequence of network outputs, while keeping the tem-
poral order consistent. This can be understood as ex-
tending the ground truth to the same length of the in-
put sequence by adding ”blank” characters that refer
to no output, or duplicating the same character mul-
tiple times which is equivalent to multiple consecu-
tive inputs referring to the same character in the tran-
scription. For example ”....aa.b. and ”.aabb...”, with
the ”blank” label ”.”, are two possible extensions of
the transcription ”ab” to a sequence length of 8. The
likelihood of a labeling l is therefore the sum over
the probabilities of all possible extensions T (l) to the
length of the network output, given the network out-
put x.
p(l|x) =
πT (l)
p(π|x) (7)
By using a forward-backward algorithm one can then
calculate p(l|x) as:
p(l|x) =
n
α
t
(l
n
)β
t
(l
n
) (8)
with α
t
(l
n
) and β
t
(l
n
) being the forward and back-
ward variables for the n-th symbol l
n
in the label-
ing l at network output t. The LSTM-RNN can then
be trained by maximizing the log-likelihood over all
possible extensions. A more detailed mathematical
discourse can be found in (Graves et al., 2006) and
(Graves, 2012).
2.2 Training with Fuzzy Ground Truth
We propose using fuzzy ground truth for ambiguous
characters rather than absolute ground truth. This
can be especially useful when people without lan-
guage expertise transcribe historical documents with
degraded or otherwise ambiguous characters. Fuzzy
ground truth means providing the CTC loss function
with multiple options l
n
1
and l
n
2
for a specific ele-
ment of the transcription l
n
. Training on wrong anno-
tation has two negative effects. It does not reinforce
the correct label and reinforces the wrong label in-
stead. This means network weights would be updated
to produce the wrong output. By providing fuzzy
ground we can prevent these wrong weight updates.
The LSTM-RNN will not be able to learn which of the
options is the correct label from the fuzzy annotation
alone, since both will be correct in terms of CTC-loss.
The intrinsic statistical language modeling of LSTM-
RNNs however will generate a bias towards the more
likely outcome.
For the CTC-loss multiple ground truth options rep-
resent additional possible extensions of the labeling
(Figure 2). The forward and backward variables are
defined as:
α
t
(l
n
) = y
t
l
n
s
i=n1
α
t1
(i) l
n
{b, l
n2
}
s
i=n2
α
t1
(i) else
(9)
β
t
(l
n
) =
n+1
i=n
β
t+1
(i)y
t
l
i
l
n
{b, l
n+2
}
n+2
i=n
β
t+1
(i)y
t
l
i
else
(10)
From equation 9 it becomes clear, that the forward
variables α
t
(l
n
1
) and α
t
(l
n
1
) only differ in the network
output y
t
l
n
i
. The following forward variables α
t
+
(l
n
)
therefore get the same input from α
t
(l
n
1
) and α
t
(l
n
2
)
up to the factor y
t
l
n
i
. This means we can instead write
α
t
(l
n
1/2
) = y
t
l
n
1/2
s
i=n1
α
t1
(i) l
n
{b, l
n2
}
s
i=n2
α
t1
(i) else
(11)
with y
t
l
n
1/2
= y
t
l
n
1
+ y
t
l
n
2
.
β
t
(l
n
1
) and β
t
(l
n
2
) do not depend on y
t
l
n
i
and are there-
fore the same. For any β
t
(l
n
) we can therefore also
replace every y
t
l
i
with the sum (y
t
l
i
1
+ y
t
l
i
2
).
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
390
Table 1: Common confusions for historical documents in
Latin script. All confusions are both ways.
p l l l l l l t a
P t 1 r i / r
n n 1 c a a b f
m u i e o u h
3 EXPERIMENTAL SETUP
Due to the high cost of manual transcriptions, rather
than reannotating our data, we decided to syntheti-
cally generate fuzzy ground truth in two different sce-
narios based on existing annotations.
3.1 Scenario I
In the first scenario we consider annotations by an
expert and compare it with synthetic non-expert and
fuzzy ground truth annotations. In this experiment
the expert’s annotation has an accuracy of 100% and
serves as a base line. The non-expert’s annotation has
a 50% chance of correctly choosing between two op-
tions for randomly selected characters. For the fuzzy
ground truth both, the wrong and the correct transcrip-
tion options are given. The fuzzy ground truth will
therefore have exactly twice as many fuzzy charac-
ters as the non-expert’s ground truth has errors in the
worst case scenario. While the ambiguous characters
are selected randomly, the possible confusions are se-
lected to be realistic. For a full list of confusions see
Table 1. All confusions are 5% and symmetric.
3.2 Scenario II
A second scenario for fuzzy ground truth arises from
the work of (Jenckel et al., 2016). In their setup
they combine clustering with BILSTM and manage
to improve the ground truth generated via clustering
through LSTM training.
K-Means clustering results in hard bounded clusters
assigning each point to the closest cluster center. To
generate fuzzy ground truth Gaussian mixture model
(GMM) clustering is used instead. It is a probabilis-
tic approach and well established within the machine
learning community (Reynolds, 2015). GMM cluster-
ing assigns a probability p(y = k
i
|x) that a data point
x is in cluster k
i
. This will allow us to naturally gen-
erate fuzzy ground truth based on the posterior proba-
bilities. The GMM algorithm optimizes the weighted
sum of Gaussians g(x|µ,σ) with mean µ and covari-
ance σ:
p(x|λ) =
i
w
i
g(x|µ
i
,σ
i
) (12)
using an Expectation-Maximization (EM) algorithm.
The algorithm is initialized with the results of a k-
Means clustering.
Before clustering the text lines were segmented into
individual characters using a connected-component
based algorithm and then resized to uniform size of
32x32 pixel while keeping the aspect ratio constant.
In both cases the algorithm follows the description
given in (Jenckel et al., 2016). To reduce the high di-
mensionality of the raw features for the GMM a PCA
has been added to the pipeline.
3.3 OCRopus
As a base for the LSTM implementation we used
OCRopus
1
, a complete OCR software package. Its
implementation has shown to produce state of the art
results while being easy to use (T. M. Breuel, A. Ul-
Hasan, M. Al Azawi, F. Shafait, 2013). Beside the
LSTM framework it provides us with a text line nor-
malization method based on Gaussian filtering and
affine transformation as described in (Yousefi et al.,
2015).
3.4 Data
The data set consists of 100 binarized pages and the
corresponding transcription of a Latin version of the
15th century novel ”Narrenschiff”
2
. One of the main
challenges when working with historical documents
is the comparably small size of the data sets. In or-
der to train on a maximum of training samples only 2
pages were used as the test data set leaving 3263 text
lines with a total of 83780 characters for training. The
test set consists of 103 lines with 2876 characters.
3.5 Parameters
In both scenarios some free parameters had to be set.
For the clustering the number of components used in
PCA as well as the number of cluster centers in the
GMM had to be chosen. For the PCA the top 10
components were used. Setting the number of clus-
ter centers for the GMM clustering to k = 200 fol-
lows the same strategy as described in (Jenckel et al.,
2016). The idea is to largely overestimate the number
of clusters during clustering. During the following an-
notation of the clusters, similar clusters provided with
1
https://github.com/tmbdev/ocropy
2
http://kallimachos.de/kallimachos/index.php
Impact of Training LSTM-RNN with Fuzzy Ground Truth
391
Table 2: Results for scenario 3.1. Comparison of the Char-
acter Error Rates (CER) for the LSTM-RNN training with
perfect ground truth from an expert, worst case ground truth
from a non-expert and fuzzy ground truth containing both
transcription options.
T 0 T 1 T 2
Expert 0 2.9 2.9
Non Expert 3.3 4.2 3.8
Fuzzy 6.7 2.7 3.0
the same label are indirectly merged.
For the LSTM implementation in OCRopus some pa-
rameters have to be set as well. OCRopus only sup-
ports a single hidden layer whose size is set to 100.
Another parameter is the size of the input layer which
is equivalent to the height of the text lines. As re-
ported in (Jenckel et al., 2016), 48 is a reasonable
value. Learning rate and momentum were kept at the
default values of 1e 4 and 0.9.
4 RESULT
For evaluation of the two proposed scenarios from
section 3.1 and 3.2, we report the Character Error
Rate (CER) of the model with the lowest CER on seen
and unseen data. All CER are calculated using the
Levenstein-distance. For seen data we use a 3 page
subset of the training data called T 1. The best model
on T1 is then evaluated on our 2 page test set T 2. The
CER on the training data set T 0 describes the error
introduced to the ground truth by wrong annotations.
The experts annotation serves as a base line and is
considered to be perfect (0% error). For fuzzy ground
truth instead we give the percentage of characters with
multiple annotations.
The results for the first scenario are seen in Table 2.
While the ground truth from an expert leads to state of
the art CER for this data set, the non-experts ground
truth with 3.3% total error leads to 4.2% CER after
training and 3.8% on the test set. This is in line with
the results from (Jenckel et al., 2016). Even though
the error in the ground truth was increased by 3.3%,
the resulting error compared to perfect ground truth
only rises by about 1%. The LSTM trained with fuzzy
ground truth however achieves almost the same accu-
racy as the 100% accurate ground truth from the ex-
pert, even though 6.5% of the data was fuzzy. The dif-
ferences between the model trained with the experts
transcription and the fuzzy transcription are within the
normal variance.
In the second scenario we used the pipeline described
in (Jenckel et al., 2016) but replaced the hard bounded
k-Means clustering with the probabilistic GMM clus-
Table 3: Results for scenario 3.2. Comparison of the
Character Error Rates (CER) for LSTM-RNN training after
GMM clustering. In the case of fuzzy ground truth the CER
on T 0 is the rate of characters with multiple non-identical
transcriptions.
T 0 T 1 T 2
1st 32.9 24.5 22.1
2nd 32.4 26.6 27.3
3rd 33.5 27.7 28.9
Fuzzy 6.5 24.3 23.1
tering. After segmentation, clustering and manual an-
notation our accuracy was rather low with a CER of
27.4%. For the fuzzy ground truth we assigned the
label of every cluster with a posterior probabilities
higher than 0.05. For comparison we also trained
LSTM models on generated ground truth by using
the label of the cluster with the highest probability to
every character, the label of the second most prob-
able cluster and the label of the third most proba-
ble cluster (if available). We evaluated on the same
data sets T 1 and T 2. The results are shown in Ta-
ble 3. Due to the high CER after clustering the CER
are still high. However LSTM training increased the
overall performance significantly, cutting CER by one
third. Choosing the second most probable cluster is
the better choice for some characters slightly improv-
ing the ground truth. However it also is worse for
other data points and leads to a worse overall perfor-
mance. Training on fuzzy ground truth performed on
the same level as taking the most probable cluster for
each data point, while outperforming the two alterna-
tive annotations. Similar to the previous scenario the
rate of fuzzy characters in the ground truth was 6.5%.
5 CONCLUSION AND OUTLOOK
In this paper we have proposed a time conserv-
ing change in annotating historical documents using
fuzzy ground truth. We also showed that in theory that
non-expert’s annotations can lead to similar results
like expert’s ones. We also explained how LSTM-
RNN with CTC can handle fuzzy ground truth and
showed that training LSTM with fuzzy ground truth,
where at least one of the transcription options is the
correct ground truth, does not reduce the performance
significantly. We conclude that the LSTM-RNN with
CTC can successfully select the correct option from
the provided possibilities through statistical language
modeling.
With regards to the bad results when applying GMM
clustering to the segmented data we conclude that
GMM is not a good choice for this type of clustering.
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
392
However it provided us with a natural way of generat-
ing fuzzy ground truth in the context of the ”anyOCR”
pipeline proposed in (Jenckel et al., 2016).
The scope of this paper only covered the feasibility
of non-expert annotations which was shown through
the use of synthetic data. Therefore we plan further
evaluation with real non-expert annotations. While
we have shown that training on fuzzy ground truth can
be beneficial in the area of historical documents, fur-
ther analysis on different scripts and document types
is needed as well.
In the future we plan to further explore the possibili-
ties of using fuzzy ground truth when training LSTM
networks, like ground truth options with different seg-
mentations like ”m” and ”in”. Another future fo-
cus will be on automatically generating fuzzy ground
truth. While the proposed method reduces the need
for language experts it is still costly to annotate the
data by hand.
ACKNOWLEDGEMENTS
This work was partially funded by the BMBF (Ger-
man Federal Ministry of Education and Research),
project Kallimachos (01UG1415C).
REFERENCES
Althoff, T., Borth, D., Hees, J., and Dengel, A. (2013).
Analysis and forecasting of trending topics in online
media streams. In Proceedings of the 21st ACM inter-
national conference on Multimedia, pages 907–916.
ACM.
Graves, A. (2012). Supervised sequence labelling. In Super-
vised Sequence Labelling with Recurrent Neural Net-
works, pages 5–13. Springer.
Graves, A., Fernndez, S., Gomez, F. J., and Schmidhuber,
J. (2006). Connectionist temporal classification: La-
belling Unsegmented Sequence Data with Recurrent
Neural Networks. In ICML’06.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. In Neural Computation, pages 1735–1780.
Jenckel, M., Bukhari, S. S., and Dengel, A. (2016). anyocr:
A sequence learning based ocr system for unlabeled
historical documents. In 23rd International Confer-
ence on Pattern Recognition (ICPR’16), Mexiko.
Karayil, T., Ul-Hasan, A., and Breuel, T. M. (2015). A
Segmentation-Free Approach for Printed Devanagari
Script Recognition. In ICDAR, Tunisia.
Reynolds, D. (2015). Gaussian mixture models. In Ency-
clopedia of biometrics, pages 827–832. Springer.
Simistira, F., Ul-Hasan, A., Papavassiliou, V., Gatos, B.,
Katsouros, V., and Liwicki, M. (2015). Recognition of
Historical Greek Polytonic Scripts Using LSTM Net-
works. In ICDAR, Tunisia.
T. M. Breuel, A. Ul-Hasan, M. Al Azawi, F. Shafait (2013).
High Performance OCR for Printed English and Frak-
tur using LSTM Networks. In ICDAR, Washington
D.C. USA.
Ul-Hasan, A., Ahmed, S. B., Rashid, S. F., Shafait, F., and
Breuel, T. M. Offline Printed Urdu Nastaleeq Script
Recognition with Bidirectional LSTM Networks. In
ICDAR’13, USA.
Wang, H., Klaser, A., Schmid, C., and Liu, C.-L. (2011).
Action recognition by dense trajectories. pages 3169–
3176. IEEE.
Werbos, P. (1990). Backpropagation through time: what
does it do and how to do it. In Proceedings of IEEE,
volume 78.
You, Q., Luo, J., Jin, H., and Yang, J. (2015). Robust im-
age sentiment analysis using progressively trained and
domain transferred deep networks. In CoRR, volume
abs/1509.06041.
Yousefi, M. R., Soheili, M. R., Breuel, T. M., and Stricker,
D. (2015). A Comparison of 1D and 2D LSTM Ar-
chitectures for Recognition of Handwritten Arabic. In
DRR-XXI, USA.
Impact of Training LSTM-RNN with Fuzzy Ground Truth
393