
 
context dependent error correction. In order to 
correct such errors, powerful language processing 
tools are needed. Examples of such attempts are 
presented in (Meknavin et al., 1998 and Tong and 
Evans, 1996), where sequences of parts of speech 
are evaluated for likelihood of occurrence and 
unlikely sequences are marked as possible errors. 
3 A STATISTICAL APPROACH 
FOR SOLVING THE OCR GAPS 
PROBLEM 
Unlike most of the research that is focused on 
improving the detection rate of characters, in this 
paper we are focusing on a different aspect: the 
recovery of text that cannot be recognized, either 
because it is too damaged or simply missing. This 
paper tackles the issue of the reconstruction of 
damaged documents based on the prediction of the 
most plausible word sets that could fill in the 
missing areas that resulted from the impossibility of 
recognizing the original words used in the 
documents. From now on, these missing areas will 
be referred to as “gaps”. Every gap has a very 
important property that is the most important factor 
which influences the accuracy of the recovery 
process: its dimension, usually expressed by the 
number of characters or words if we consider the 
text under analysis as a continuous stream of text. 
The solution that we propose in this paper is 
intended for the recovery of text chunks that 
represent pieces of phrases from the original 
document and it is based on two assumptions. The 
first one is related to the intra-document similarity: 
we assume that a model of the document can be built 
based on the existing text and that the missing text 
also respects this model. We considered that the 
document model has two components: the style 
model, representing the structure of the text and the 
language model, depicting the vocabulary used by 
the author, the n-grams that were built with these 
words and the frequency of the n-grams. These two 
models are combined in order to identify the word 
sets that could fit in the gaps. Two heuristics have 
been developed to allow us to benefit from the style 
model. Regarding the language model, there is a 
problem that sometimes new words that haven’t 
been used before in the document could appear in 
the gaps, but these words cannot be discovered using 
only the language model of the document, since 
these words are simply missing from it. This 
problem leads us to the use of the Google corpus and 
to the second assumption: the corpus dimension is 
large enough to subsume most of the language 
models of the documents posted on the Internet and 
in the meantime, any word that does not appear in 
this corpus, should not be considered as a possible 
candidate to fill in the gaps. 
Considering these two assumptions to be true, 
our solution starts with the identified gaps and 
follows a few steps in order to identify the missing 
words. First of all, the style model of the document 
is used in order to identify the dimension of the gap. 
Therefore, we consider two heuristics: estimated 
character count and estimated word count. The 
estimated character count is a numeric value which 
is determined based on the margins and indentation 
of the recovered document format, on the existing 
characters that were correctly identified and that are 
in the gap’s vicinity and on some statistical 
information regarding the document under analysis 
(mean and deviation of the number of characters per 
phrase). This value is used to determine a maximum 
and a minimum number of characters that could fill 
in the gap. The estimated word count is also a 
numeric value, which uses the estimated character 
count and some statistical information regarding the 
mean and deviation of the number of characters per 
word and the mean and deviation of the number of 
words per phrase observed in the document. This 
value is used to determine a range for the number of 
words that we are looking for in order to fill in the 
gap. 
Once having estimated the number of words we 
are looking for, we are able to start using the 
language model. At this point, there are a couple of 
heuristics that can be used. First of all, the gaps do 
not usually start or end with whitespace characters 
representing the limit between distinct words, so one 
could scan the document for partial words at the 
beginning or at the ending of the gaps. Using both 
the n-grams corpus and the words that have been 
correctly identified before and after the gap, it is 
easier to detect the whole words starting from the 
characters representing parts of them. Since the 
maximum dimension for n-grams in the corpus is 5-
grams, the detection starts from the previous four 
words before the gap in order to identify the first 
word missing from the gap. We consider that these 
four words represent the starting words from a 5-
gram, and we try to identify which is the most 
probable word to follow this combination. The same 
method is applied to the next four words after the 
gap in order to determine the last word missing from 
the gap, considering that these words represent the 
ending words from a 5-gram, and trying to detect the 
FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
439