Historical Document Processing: A Survey of Techniques,
Tools, and Trends
James Philips and Nasseh Tabrizi
Department of Computer Science, East Carolina University, Greenville, North Carolina, U.S.A.
Keywords: Historical Document Processing, Archival Data, Handwriting Recognition, Optical Character Recognition,
Digital Humanities.
Abstract: Historical Document Processing (HDP) is the process of digitizing written material from the past for future
use by historians and other scholars. It incorporates algorithms and software tools from computer vision,
document analysis and recognition, natural language processing, and machine learning to convert images of
ancient manuscripts and early printed texts into a digital format usable in data mining and information retrieval
systems. As libraries and other cultural heritage institutions have scanned their historical document archives,
the need to transcribe the full text from these collections has become acute. Since HDP encompasses multiple
sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals
and conference proceedings. This paper surveys the major phases of HDP, discussing standard algorithms,
tools, and datasets and finally suggests directions for further research.
1 INTRODUCTION
Historical Document Processing (HDP) is the process
of digitizing written and printed material from the
past for future use by historians. Digitizing historical
documents preserves them by ensuring a digital
version will persist even if the original document is
destroyed or damaged. Since many historical
documents reside in libraries and archives, access to
them is often hindered. Digitization of these historical
documents thus expands scholars’ access to archival
collections as the images are published online and
even allows them to engage these texts in new ways
through digital interfaces (Chandna et al 2016;
Tabrizi 2008). HDP incorporates algorithms and
software tools from various subfields of computer
science to convert images of ancient manuscripts and
early printed texts into a digital format usable in data
mining and information retrieval systems. Drawing
on techniques and tools from computer vision,
document analysis and recognition, natural language
processing, and machine learning, HDP is a hybrid
field. This paper surveys the major phases of HDP,
discussing techniques, tools, and trends. After an
explanation of the authors’ research methodology,
digitization challenges, techniques, standard
algorithms, tools, and datasets are discussed, and the
paper finally concludes with suggestions for further
research.
2 METHODOLOGY
2.1 Research Rationale
This paper examines the evolution of the techniques,
tools, and trends within the HDP field over the past
twenty-two years (1998-2020). The authors believe
this extended scope is warranted: No prior study was
found that summarized the HDP workflow for both
handwritten archival documents and printed texts.
Prior studies have focused on one dimension of the
problem, such as layout analysis, image binarization,
or actual transcription. Very few discussed aspects of
a full historical document processing workflow.
2.2 Article Selection Criteria
This research focuses on historical documents written
in Latin, medieval and early modern European
vernaculars, and English reflecting the current state of
the HDP field: most of the work on historical archival
documents has focused on western scripts and
manuscripts. From the initial collection of 300+
Philips, J. and Tabrizi, N.
Historical Document Processing: A Survey of Techniques, Tools, and Trends.
DOI: 10.5220/0010177403410349
In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 341-349
ISBN: 978-989-758-474-9
Copyright
c
2020 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
341
articles chosen, 50 were selected for this survey. This
survey emphasizes the computer science dimension
of HDP, especially machine learning methodologies,
software tools, and research datasets. The authors
envision other computer scientists, digital humanists,
and software developers interested in HDP and
cultural heritage as their primary audience.
3 TECHNIQUES AND TOOLS
3.1 Archival Document Types and
Digitization Challenges
Historical documents broadly defined include any
handwritten or mechanically produced document
from the human past. Many have been preserved in
the archives of museums and libraries, which have
pursued extensive digitization efforts to preserve
these invaluable cultural heritage artifacts. An
enduring goal within the field of document image
analysis has been achieving highly accurate tools for
automatic layout analysis and transcription (Baechler
and Ingold 2010).
A typical HDP workflow proceeds through
several sequential phases.
Figure 1: The steps in a conventional HDP workflow for
handwritten and printed documents.
After image acquisition, the document image is
pre-processed and handwritten text recognition
(HTR) or optical character recognition (OCR) is
performed. This phase yields a transcription of the
document’s text. This transcription is the input to
natural language processing and information retrieval
tasks.
Prior to the 15
th
century, the majority of historical
documents were texts produced by hand. After
Gutenberg’s printing press, published works were
produced on the printing presses while private
documents continued to be done by hand. This
dichotomy in document types beginning in the Early
Modern era led to diverse document types that must
be dealt with differently during the HDP process.
The eclectic nature of all handwritten documents
challenges automatic software tools. Medieval
manuscripts are often more legible, and the inter-
character segmentation of minuscule script are easier
to train machine learning-based classifiers for than
the continuous cursive of early modern handwritten
texts. However, significant challenges in medieval
documents are their complex layouts and intricate
artwork (Simistira et al 2016). Continuous cursive
script in Early Modern documents is challenging
during the HTR phase, while medieval documents
present greater challenges during layout analysis.
Other challenges with historical documents include
bleed-through from the opposite sides of the pages,
illegible handwriting, and image resolution quality.
The earliest printed texts, known as incunabula,
have posed the most difficulties for accurate, digital
transcription of printed works (Rydberg-Cox 2009).
Their fonts differ vastly from modern typefaces, and
modern OCR software produces poor recognition
results. The extensive use of textual ligatures also
poses difficulties since they declined in use as
printing standardized. After 1500 greater uniformity
came to printed books, and by the early 19th century,
the mass production of printed texts led to books that
modern layout analysis and OCR tools could reliably
and consistently digitize at scale, as seen in the
digitation efforts of the Internet Archive and Google
Books in partnership with libraries (Bamman and
Smith 2012). This opens up possibilities for
Information Retrieval in archival “Big Data.”
3.2 Techniques
3.2.1 Pre-processing Phase
This pre-processing phase normally includes
binarization/thresholding applied to the document
image, adjustment for skew, layout analysis and text-
line segmentation. Various studies have proposed
various binarization methods including Bolan et al
2010, Messaoud et al 2012, and Roe and Mello 2013.
Dewarping and skew reduction methods have been
proposed in studies including Bukhari et al 2011 and
performance analysis conducted in Rahnemoonfar
and Plale 2013. Layout analysis is one of the most
challenging aspects of HDP. Recent work has also
examined the use of neural networks to restore
degraded historical documents (Raha & Chanda
2019). Due to their complex page layouts, many
studies have focused on layout analysis tools,
algorithms, and benchmark datasets especially for
medieval documents. Baechler and Ingold proposed a
layout model for medieval documents. Using
manuscript images from the E-codices project, they
modeled a medieval manuscript page as several
“layers”: document text, marginal comments,
degradation, and decoration. Overlapping polygonal
KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval
342
boxes are used to identify the constituent layers and
are represented in software via XML.
Gatos et al 2014 developed a layout analysis and
line segmentation software module designed to
produce input to HTR tools. Their work was
incorporated into the Transcriptorium project’s
Transkribus software.
Pintus, Rushmeier, and Yang likewise explore
layout analysis and text-line extraction with an
emphasis on medieval manuscripts. Pintus et al 2015
address the problem of initial calculation of text- line
height. They segment the text regions coarsely and
apply a SVM classifier to produce a refined text line
identification. They note their method is not
adversely affected by skewed texts and usually does
not necessitate any alignment correction.
Yang et al (2017) extend their work on text-height
estimation and layout analysis to an automated
system that can work on a per-page basis rather than
per manuscript. They propose three algorithms, one
for text-line extraction, one for text block extraction,
and one for identifying “special components.” These
use semi-supervised machine learning technique and
focus on medieval manuscripts produced originally
by professional scribes. Their results demonstrate that
the desideratum of automatic algorithmically-layout
analysis with high precision, recall, and accuracy is
drawing nearer to reality.
3.2.2 Handwritten Text Recognition
Due to the inherent challenges of HTR for historical
documents, some studies including (Rath and
Manmatha 2006; Fischer et al. 2012) explored
keyword spotting techniques as an alternative to
producing a complete transcription. Early keyword
spotting techniques approached it as an image
similarity problem. Clusters of word images are
created and compared for similarity using pairwise
distance. Fischer et al explored several data-driven
techniques for both keyword spotting and complete
transcription (Fischer et al. 2009, 2012, 2014). One
problem with word-based template matching is that
the system can only recognize a word for which it has
a reference image. Rare (out of vocabulary) words
cannot be recognized. As a solution, the HisDoc
project applied character-based recognition with
Hidden Markov Model (HMM) to keyword spotting.
For their keyword spotting analysis, they compared
the character-based system with a baseline Dynamic
Time Warp (DTW) system. Using Mean Average
Precision as their evaluation metric, they found that
the HMMs outperformed the DTW system on both
localized and global thresholds for the George
Washington and Parzival datasets (GW: 79.28/62.08
vs 54.08/43.95 and Parzival 88.15/85.53 vs
36.85/39.22). The HisDoc project also compared
HMMs and neural network performance on the
University of Bern’s Historical Document Database
(IAM-HistDB) to produce full transcriptions. They
used a Bi-directional Long Short-term Memory
(BLSTM) architecture that could mitigate the
vanishing gradient problem of other neural network
designs. Each of their nine geometric features used
for training corresponds to an individual node in the
input layer of the network. Output nodes in the
network correspond to the individual characters in the
character set. The probability of a word is computed
based on the character-probabilities. According to
(Fischer, Naji 2014), word error rates were
significantly better for the neural network
architecture than the HMM system on all three sets of
historical document images: St. Gall 6.2% vs 10.6%,
Parzival 6.7% vs 15.5%, and George Washington
18.1% vs 24.1%.
Neural networks continue to be the ascendant
technique within the field for HTR. Granell et al.
2018 examined the use of convolutional recurrent
neural networks for late medieval documents. The
convolutional layers perform automatic feature
extraction which precludes the need for handcrafted
geometric or graph-based features such as those used
by HisDoc. For deep neural network architectures to
be competitive for time efficiency with other
techniques, they require significant computational
power. This is obtained through the use of a GPU
rather than a CPU. Working with the Rodrigo dataset,
they achieved their best results using a convolutional
neural network supplemented with a 10-gram
character language-model. Their word error rate was
14%.
3.2.3 Historical Optical Recognition
As with HTR, historical OCR can be accomplished
with several techniques. However, neural network-
based methods have become more prominent in the
software libraries and literature recently. Since
printed texts in western languages rarely use scripts
with interconnected letters, segmentation-based
approaches are feasible with OCR that are not
practical for HTR. Nevertheless, historical OCR is
drastically more difficult than modern OCR
(Springmann and Lüdeling 2017). One challenge is
the vast variability of early typography. Historical
printings not laid out with modern, digital precision,
and a plethora of early fonts were utilized across
Europe (Christy et al 2017). A multitude of typeface
Historical Document Processing: A Survey of Techniques, Tools, and Trends
343
families exist, including Gothic script, Antiqua, and
Fraktur. Although printing techniques standardized in
the early 19
th
century, printed documents from 15
th
-
19
th
centuries are too idiosyncratic for OCR machine
learning classifiers trained using modern, digital
fonts. Among the most difficult historical texts for
OCR are incunabula due to their extensive use of
ligatures, typographical abbreviations derived from
medieval manuscripts that do not always have a
corresponding equivalent in Unicode, and
unpredictable word-hyphenation across lines
(Rydberg- Cox 2009). The model training limitations
of commercial software such as Abbey Fine Reader
mean that researchers must resort to open source
alternatives such as Tesseract or OCRopus
(Springmann et al. 2014). Tesseract’s classifier can be
trained using either synthetic data (digital fonts that
resemble historical ones) or with images of character
glyphs cropped from actual historical text images.
Tesseract and OCRopus both offer neural network
classifiers Although high accuracies are achievable
with neural networks, some of the same caveats apply
from their use for HTR. These classifiers require
substantial training data, with the corollary of
extensive ground truth that must be created manually,
and this classifier is computationally intensive for
CPUs (Springmann et al 2014).
3.2.4 Software Tools and Datasets
Figure 2: A taxonomy of HDP datasets based on use case
and time-period.
Several software tools and datasets (Figure 2) exist
for researchers and practitioners pursuing historical
document processing. For historical OCR, these
include Abbey FineReader, Tesseract, OCRopus, and
AnyOCR tools and primarily the IMPACT dataset of
early modern European printed texts. Few generic
tools exist for historical HTR tasks, but researchers
do have access to the IAM-HistDB and Rodrigo
datasets. These variously contain images of full
manuscript pages, individual words and characters,
and corresponding ground truth for medieval Latin
and early German and Spanish manuscripts. The
IAM-HistDB also contains the Washington dataset
for historical cursive handwriting recognition. In
addition to software and datasets for the transcription
phase of historical document processing, the Alethia
tool and the IMPACT and Diva-HistDB datasets can
be used for researching layout analysis and other pre-
processing tasks. The rest of this section surveys the
characteristics of the available datasets and discusses
training, testing, and evaluation methodologies.
Few options exist for researchers seeking to work
with medieval manuscript transcription. Two
medieval datasets are included in the IAM-HistDB.
The St. Gall dataset features images of a ninth century
Latin manuscript written in Carolingian script by a
single scribe. Fischer et al utilized the images and
corresponding page transcriptions from J.P. Migne’s
Patrologia Latina previously published to create the
dataset (Fischer et al. 2011). In addition to page
images and transcription, the dataset includes
extensive ground-truth: text-lines and individual
word images have been binarized, normalized, and
annotated with line-level transcription. Originally
developed by the HisDoc project, the dataset has
since been used in further research.
While Latin was the dominant ecclesiastical and
scholarly language of Europe during the medieval
period, some literature was produced in the
vernacular languages. Two datasets exist for
researchers investigating HTR in those vernacular
texts, specifically the Old German and Old Spanish
dialects. Included with the IAM-HistDB, the Parzival
dataset contains manuscript pages of an Arthurian
epic poem written in Old German from the 13th and
15th centuries. The 47 Parzival images are drawn
from three different manuscripts produced by three
scribes using Gothic minuscule script in multi-
column layouts. Like the St. Gall set, the Parzival
collection includes page images and transcription
along with ground truth annotation. Text-lines and
single word images have been binarized, normalized,
and annotated with a full line-level transcription.
Known as the Rodrigo corpus, the Old Spanish
dataset is larger than either the St. Gall or Parzival
datasets at 853 pages. Created for HTR and line
extraction research, the researchers based at the
Universitat Politecnica de Valencia used the digitized
images of an Old Spanish historical chronicle, the
“Historia de Espanana el archbispo Don Rodrigo”
(Serrano et al 2011). The manuscript is from 1545,
and thus can be traced to the the emergence of
printing press technology. Although the creators of
the dataset published results of running a hybrid
KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval
344
HMM-based image classifier with a language model,
Granell et al have used the dataset with deep neural
networks (Granell et al 2018).
The Washington dataset is the third dataset
included in the IAM HistDB. Drawn from the George
Washington papers at the US Library of Congress, its
script is continuous cursive in the English language.
First used in Rath and Manmatha, the HistDoc project
supplemented the dataset with individual word and
text-line images and corresponding ground truth
transcriptions for each line and word (Fischer et al
2010). The Washington dataset is especially valuable
for cursive HTR in historical documents.
The previously described IAM-HistDB datasets
dealt exclusively with historical HTR. As a
benchmark for evaluating pre-processing
performance on medieval documents, the HistDoc
project created the Diva-HistDB. This dataset
contains 150-page images from three different
manuscripts with accompanying ground truth for
binarization, layout analysis, and line segmentation
(Simistira et al 2016). Written in Carolingian script,
two of the manuscripts are from the 11th century, and
one from the 14th century written in Chancery script.
All three manuscripts have a single column of text
surrounded by extensive marginal annotation. Some
pages have decorative initial characters. The layouts
are highly complex. The ground truth concentrates on
identifying spatial and color-based features. Like the
IMPACT dataset, the ground truth is encoded in the
PAGE XML format. The dataset is freely available on
the HistDoc project website.
While most of the HTR and OCR datasets
discussed in this section have focused on Latin
languages or Latin script, a dataset has been created
for HTR and OCR of historical polytonic (i.e.
multiple accents) Greek texts. Introduced by Gatos et
al, the dataset was developed for research on the word
and character recognition as well as line and word
segmentation (Gatos et al 2015). It features 399 pages
of both handwritten and printed Greek text, mostly
from the nineteenth and twentieth century.
3.2.5 Methodologies for Evaluation
Several metrics are used to evaluate the performance
of a historical document processing system. For
handwritten text recognition systems that use image
similarity, precision and recall are two important
performance measures. Precision ascertains how
many of all the relevant results in the dataset were
actually retrieved. For machine learning systems,
transcription performance is evaluated using the
character error rate, word error rate, or sometimes
both if a language model is utilized to enhance the
recognition results. Layout analysis performance is
assessed using the line error rate and segmentation
error rate (Bosch et al 2014).
3.2.6 Software Systems
Cultural heritage practitioners seeking production-
ready tools for their own historical document
preservation projects have two software systems
available that provide a full suite of tools for pre-
processing, machine learning training, and
transcription. These two tools are DIVA-Services
(Würsch et al 2017) and the Transkribus platform
from the EU-sponsored READ project (Kahle et al
2017).
DIVA-Services and Transkribus offer similar
feature sets to the cultural heritage community.
However, they should not be seen as direct
competitors. As a cross-platform software service,
Transkribus is likely the better solution for archivists
seeking an integrated HDP toolchain that requires
minimal or no custom software to be developed.
Since it offers multiple tools for each step in the HDP
process and supports standard formats such as PAGE,
it is ideally suited for archivists who need a reliable
service for a historical document transcription project
that allows support for machine learning training on
new datasets. Due to the platform’s hybrid open
source-closed source nature and lack of tool
modularity (users cannot substitute their own libraries
directly for a Transkribus one), users who need more
flexibility and alignment with open source values
may find DIVA-Services more suited to their needs.
Since DIVA-Services provides separate API calls for
each discrete step in the HDP workflow, this service
is more suitable for computer science researchers and
archivists who need to integrate existing methods
alongside custom software. DIVA-SERVICES and
Transkribus thus offer complementary approaches
that meet the different use cases of members of the
cultural heritage community.
4 RECENT TRENDS
Within the past decade, several research projects have
advanced the field of historical document processing
through the creation of datasets, the exploration of
improved techniques, and the application of existing
tools to digital archival document preservation
efforts. The HisDoc family of projects have made
significant contributions to algorithms, tools, and
datasets for medieval manuscripts. The inaugural
Historical Document Processing: A Survey of Techniques, Tools, and Trends
345
HisDoc project lasted from 2009 to 2013 and
concurrently studied three phases of HDP: layout
analysis, HTR, and document indexing and
information retrieval (Fischer Nijay et al 2014).
While much of their research focused on medieval
documents and scripts, their goal was to create
“generic methods for historical manuscript
processing that can principally be applied to any
script and language (83).”
HisDoc 2.0 was conceived as a direct extension of
the original HisDoc project. Concentrated at the
University of Fribourg, the focus of this project was
advancing digital paleography for archival
documents (Garz et al 2015). The HisDoc 2.0
researchers recognized that historical manuscripts are
complex creations and require multi-faceted solutions
from computer science. Written by multiple scribes
and due to inconsistent layouts, many documents do
not conform to the ideal characteristics explored
during the first HisDoc project. With HisDoc 2.0, the
researchers investigated combining text localization,
script discrimination, and scribal recognition into a
unified system that could be utilized on historical
documents of varying genres and time periods. The
HisDoc 2.0 project made several contributions to the
field. One was DivaServices, a web service offering
historical document processing algorithms with a
RESTful (representational state transfer) API to
circumvent the problem many developers and
practitioners face with the installation of complicated
software tools, libraries, and dependencies (Würsch
et al 2016). Another contribution was the DivaDesk
digital workspace, GUI-based software that makes
computer science algorithms for ground truth
creation, layout analysis, and other common tasks
accessible for humanities scholars (Eichenberger et al
2014). The project explored ground truth creation,
text region and layout analysis with neural networks,
and aspects of scribal identification. Finally, the
project produced and released the Diva-HisDB
dataset.
The IMPACT project was a European Union-
funded initiative to develop expertise and
infrastructure for libraries digitizing the textual
heritage of Europe. Despite the rapid rate of text
digitization by European libraries, the availability of
full-text transcriptions was not keeping pace. With
many libraries solving the same digitization
challenges, solutions to problems were being
duplicated, leading to inefficient use of time and
resources. Moreover, existing OCR software
produced unsatisfactory accuracy for historical
printed books. Through the formation of a pan-
European consortium of libraries, the IMPACT
project consolidated digitization expertise and
developed tools, resources, and best practices to
surmount the challenges of digitization on such an
extensive scale. The project lasted from 2008- 2012.
Among its achievements were the monumental
creation of the IMPACT dataset of historical
document images with ground truth for text and
layout analysis, the development of software tools for
layout analysis, ground truth creation, and optical
character recognition post-correction, the proposal of
the PAGE format, and the exploration of techniques
for OCR, layout analysis, and image correction
(Papadopoulos 2013; Pletschacher &
Antonacopoulos 2010; Vobl et al 2014).
The Early Modern OCR Project (eMOP) was an
effort by researchers at Texas A & M University to
produce transcriptions of the Early English Books
Online and 18th Century Collections Online
databases. Containing nearly 45 million pages
collectively, these two commercial databases are
essential tools for historians studying the literature of
the 15
th
through the 18
th
century. The project
produced accurate transcriptions paired with the
corresponding text images and made available for
crowd-sourced post-correction on the 18thConnect
website using the TypeWright tool; it developed a
true “Big Data” infrastructure to take advantage of
high-performance computing resources for both OCR
and image post-processing. Another important
contribution was the pioneering work on a historical
font database (Heil and Samuelson 2013).
5 CONCLUSIONS
Historical Document Processing transforms scanned
documents from the past into digital transcriptions for
the future. After pre-processing through binarization,
layout analysis, and line segmentation, the images of
individual lines are converted into digital text through
either HTR or OCR. Within the past decade, first
conventional machine learning techniques using
handcrafted features and more recently neural
network-driven methodologies have become
solutions to producing accurate transcriptions from
historical texts from medieval manuscripts and
fifteenth-century incunabula through early modern
printed works. Projects such as IMPACT,
Transcriptorium, eMOP, and HisDoc have made
significant contributions to advancing the scholarship
of the field and creating vital datasets and software
tools. The combined expertise of computer scientists,
digital humanists, historians, and archivists will all be
necessary to meet the challenge of HDP for the future.
KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval
346
As archives continue to be digitized, the volume and
variety of archival data and the velocity of its creation
clearly indicate that this is a “Big Data” challenge.
Accurate transcriptions are a prerequisite for
meaningful information retrieval in archival
documents. The creation of robust tools and
infrastructure for this new phase of historical
document processing will be the mandate of all those
who wish to preserve humanity’s historical textual
heritage in the digital age.
ACKNOWLEDGEMENTS
This research is supported in part by REU grant
#1560037 from the National Science Foundation.
REFERENCES
Baechler, M., & Ingold, R. (2010). Medieval manuscript
layout model. Proceedings of the 10th ACM
Symposium on Document Engineering - DocEng ’10,
275.
Bamman, D., & Smith, D. (2012). Extracting two thousand
years of latin from a million book library. Journal on
Computing and Cultural Heritage, 5(1), 1–13.
Ben Messaoud, I., Amiri, H., El Abed, H., & Märgner, V.
(2012). Binarization effects on results of text-line
segmentation methods applied on historical documents.
2012 11th International Conference on Information
Science, Signal Processing and Their Applications
(ISSPA), 1092–1097.
Bosch, V., Toselli, A. H., & Vidal, E. (2014).
Semiautomatic text baseline detection in large
historical handwritten documents. 2014 14th
International Conference on Frontiers in Handwriting
Recognition, 690–695.
Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., & Shafait,
F. (2013). High-performance ocr for printed english and
fraktur using lstm networks. 2013 12th International
Conference on Document Analysis and Recognition,
683–687.
Bukhari, S. S., Kadi, A., Jouneh, M. A., Mir, F. M., &
Dengel, A. (2017). Anyocr: An open-source ocr system
for historical archives. 2017 14th IAPR International
Conference on Document Analysis and Recognition
(ICDAR), 305–310.
Bukhari, S. S., Shafait, F., & Breuel, T. M. (2012). An
image based performance evaluation method for page
dewarping algorithms using sift features. In M.
Iwamura & F. Shafait (Eds.), Camera-Based Document
Analysis and Recognition (pp. 138–149). Springer.
Chandna, S., Rindone, F., Dachsbacher, C., & Stotzka, R.
(2016). Quantitative exploration of large medieval
manuscripts data for the codicological research. 2016
IEEE 6th Symposium on Large Data Analysis and
Visualization (LDAV), 20–28.
Christy, M., Gupta, A., Grumbach, E., Mandell, L., Furuta,
R., & Gutierrez-Osuna, R. (2018). Mass digitization of
early modern texts with optical character recognition.
Journal on Computing and Cultural Heritage, 11(1), 1–
25.
Clausner, C., Pletschacher, S., & Antonacopoulos, A.
(2011). Aletheia—An advanced document layout and
text ground-truthing system for production
environments. 2011 International Conference on
Document Analysis and Recognition, 48–52.
Fischer, A., Baechler, M., Garz, A., Liwicki, M., & Ingold,
R. (2014). A combined system for text line extraction
and handwriting recognition in historical documents.
2014 11th IAPR International Workshop on Document
Analysis Systems, 71–75.
Fischer, A., Bunke, H., Naji, N., Savoy, J., Baechler, M., &
Ingold, R. (2012a). The hisdoc project. Automatic
analysis, recognition, and retrieval of handwritten
historical documents for digital libraries.
Fischer, A., Frinken, V., Fornés, A., & Bunke, H. (2011a).
Transcription alignment of Latin manuscripts using
hidden Markov models. Proceedings of the 2011
Workshop on Historical Document Imaging and
Processing - HIP ’11, 29.
Fischer, A., Frinken, V., Fornés, A., & Bunke, H. (2011b).
Transcription alignment of Latin manuscripts using
hidden Markov models. Proceedings of the 2011
Workshop on Historical Document Imaging and
Processing - HIP ’11, 29.
Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., &
Stolz, M. (2010). Ground truth creation for handwriting
recognition in historical documents. Proceedings of the
9th IAPR International Workshop on Document
Analysis Systems, 3–10.
Fischer, A., Indermuhle, E., Frinken, V., & Bunke, H.
(2011). Hmm-based alignment of inaccurate
transcriptions for historical documents. 2011
International Conference on Document Analysis and
Recognition, 53–57.
Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2012).
Lexicon-free handwritten word spotting using character
HMMs. Pattern Recognition Letters, 33(7), 934–942.
Fischer, A., Riesen, K., & Bunke, H. (2010). Graph
similarity features for hmm-based handwriting
recognition in historical documents. 2010 12th
International Conference on Frontiers in Handwriting
Recognition, 253–258.
Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke,
H., Viehhauser, G., & Stolz, M. (2009). Automatic
transcription of handwritten medieval documents. 2009
15th International Conference on Virtual Systems and
Multimedia, 137–142.
Frinken, V., Fischer, A., Baumgartner, M., & Bunke, H.
(2014). Keyword spotting for self-training of BLSTM
NN based handwriting recognition systems. Elsevier.
Frinken, V., Fischer, A., & Martínez-Hinarejos, C.-D.
(2013). Handwriting recognition in historical
documents using very large vocabularies. Proceedings
Historical Document Processing: A Survey of Techniques, Tools, and Trends
347
of the 2nd International Workshop on Historical
Document Imaging and Processing - HIP ’13, 67.
Gatos, B., Louloudis, G., & Stamatopoulos, N. (2014).
Segmentation of historical handwritten documents into
text zones and text lines. 2014 14th International
Conference on Frontiers in Handwriting Recognition,
464–469.
Granell, E., Chammas, E., Likforman-Sulem, L., Martínez-
Hinarejos, C.-D., Mokbel, C., & Cîrstea, B.-I. (2018).
Transcription of spanish historical handwritten
documents with deep neural networks. Journal of
Imaging, 4(1), 15.
Heil, J., & Samuelson, T. (2013). Book history in the early
modern ocr project, or, bringing balance to the force.
Journal for Early Modern Cultural Studies, 13(4), 90–
103.
Jenckel, M., Bukhari, S. S., & Dengel, A. (2016). Anyocr:
A sequence learning based ocr system for unlabeled
historical documents. 2016 23rd International
Conference on Pattern Recognition (ICPR), 4035–
4040.
Kahle, P., Colutto, S., Hackl, G., & Muhlberger, G. (2017).
Transkribus—A service platform for transcription,
recognition and retrieval of historical documents. 2017
14th IAPR International Conference on Document
Analysis and Recognition (ICDAR), 19–24.
Le Bourgeois, F., & Emptoz, H. (2007). Debora: Digital
access to books of the renaissance. International
Journal of Document Analysis and Recognition
(IJDAR), 9(2–4), 193–221.
Mas, J., Rodriguez, J. A., Karatzas, D., Sanchez, G., &
Llados, J. (2008). Histosketch: A semi-automatic
annotation tool for archival documents. 2008 The
Eighth IAPR International Workshop on Document
Analysis Systems, 517–524.
Meyer, E. T., & Eccles, K. (2016). The impacts of digital
collections: Early english books online & house of
commons parliamentary papers (SSRN Scholarly
Paper ID 2740299). Social Science Research Network.
Papadopoulos, C., Pletschacher, S., Clausner, C., &
Antonacopoulos, A. (2013). The IMPACT dataset of
historical document images. Proceedings of the 2nd
International Workshop on Historical Document
Imaging and Processing - HIP ’13, 123.
Pintus, R., Yang, Y., & Rushmeier, H. (2015). Athena:
Automatic text height extraction for the analysis of text
lines in old handwritten manuscripts. Journal on
Computing and Cultural Heritage, 8(1), 1–25.
Pletschacher, S., & Antonacopoulos, A. (2010). The page
(Page analysis and ground-truth elements) format
framework. 2010 20th International Conference on
Pattern Recognition, 257–260.
Raha, P., & Chanda, B. (2019). Restoration of historical
document images using convolutional neural networks.
2019 IEEE Region 10 Symposium (TENSYMP), 56–
61.
Rahnemoonfar, M., & Plale, B. (2013). Automatic
performance evaluation of dewarping methods in large
scale digitization of historical documents. Proceedings
of the 13th ACM/IEEE-CS Joint Conference on Digital
Libraries - JCDL ’13, 331.
Rath, T. M., & Manmatha, R. (2007). Word spotting for
historical documents. International Journal of
Document Analysis and Recognition (IJDAR)
, 9(2),
139–152.
Roe, E., & Mello, C. A. B. (2013). Binarization of color
historical document images using local image
equalization and xdog. 2013 12th International
Conference on Document Analysis and Recognition,
205–209.
Rydberg-Cox, J. A. (2009). Digitizing latin incunabula:
Challenges, methods, and possibilities. Digital
Humanities Quarterly, 003(1).
Sastry, P. N., & Krishnan, R. (2012). A data acquisition and
analysis system for palm leaf documents in Telugu.
Proceeding of the Workshop on Document Analysis and
Recognition, 139–146.
Serrano, N., Castro, F., & Juan, A. (2010, May). The
rodrigo database. Proceedings of the Seventh
International Conference on Language Resources and
Evaluation (LREC’10). LREC 2010, Valletta, Malta.
Shafait, F. (2009). Document image analysis with
OCRopus. 2009 IEEE 13th International Multitopic
Conference, 1–6.
Simistira, F., Seuret, M., Eichenberger, N., Garz, A.,
Liwicki, M., & Ingold, R. (2016). Diva-hisdb: A
precisely annotated large dataset of challenging
medieval manuscripts. 2016 15th International
Conference on Frontiers in Handwriting Recognition
(ICFHR), 471–476.
Springmann, U., & Lüdeling, A. (2017). OCR of historical
printings with an application to building diachronic
corpora: A case study using the RIDGES herbal corpus.
Digital Humanities Quarterly, 011(2).
Springmann, U., Najock, D., Morgenroth, H., Schmid, H.,
Gotscharek, A., & Fink, F. (2014). OCR of historical
printings of Latin texts: Problems, prospects, progress.
Proceedings of the First International Conference on
Digital Access to Textual Cultural Heritage, 71–75.
Su, B., Lu, S., & Tan, C. L. (2010). Binarization of
historical document images using the local maximum
and minimum. Proceedings of the 8th IAPR
International Workshop on Document Analysis Systems
- DAS ’10, 159–166.
Tabrizi, M. H. N. (2008). Digital archiving and data mining
of historic document. 2008 International Conference on
Advanced Computer Theory and Engineering, 19–23.
Ul-Hasan, A., Bukhari, S. S., & Dengel, A. (2016).
Ocroract: A sequence learning ocr system trained on
isolated characters. 2016 12th IAPR Workshop on
Document Analysis Systems (DAS), 174–179.
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., &
Schulz, K. U. (2014). PoCoTo—An open source system
for efficient interactive postcorrection of OCRed
historical texts. Proceedings of the First International
Conference on Digital Access to Textual Cultural
Heritage, 57–61.
Wei, H., Chen, K., Nicolaou, A., Liwicki, M., & Ingold, R.
(2014). Investigation of feature selection for historical
KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval
348
document layout analysis. 2014 4th International
Conference on Image Processing Theory, Tools and
Applications (IPTA), 1–6.
Würsch, M., Ingold, R., & Liwicki, M. (2016).
Divaservices—A restful web service for document
image analysis methods. Digital Scholarship in the
Humanities, fqw051.
Yang, Y., Pintus, R., Gobbetti, E., & Rushmeier, H. (2017).
Automatic single page-based algorithms for medieval
manuscript analysis. Journal on Computing and
Cultural Heritage, 10(2), 1–22.
Historical Document Processing: A Survey of Techniques, Tools, and Trends
349