Contract Metadata Identification in Czech Scanned Documents
Hien Thi Ha
1
, Ale
ˇ
s Hor
´
ak
1 a
and Minh Tuan Bui
2
1
NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic
2
Le Quy Don Technical University, Vietnam
Keywords:
Information Extraction, Scanned Documents, Document Metadata, Contract Metadata Extraction, Czech.
Abstract:
Although nowadays digital-born documents are generally prevalent, exchange of business documents often
consists in processing their scanned image form as a general human-readable format with one-to-one corre-
spondence to paper documents. Bulk processing of such scanned documents then requires human intervention
to extract and enter the main document metadata. In this paper, we present the design and evaluation of a
contract processing module in the OCRMiner system. The information extraction process allows to combine
layout properties with text analysis as input to a rule-based extraction with confidence score propagation.
The first results are evaluated with public Czech contract documents reaching the item extraction accuracy of
almost 88%.
1 INTRODUCTION
A contract is a legally binding document that recog-
nizes and governs the rights and duties of the parties
to an agreement (Ryan, 2006). Organizations such
as companies, institutions, or governmental offices
must monitor and handle contracts for a wide range
of tasks (Milosevic et al., 2004). Some of them are
checking whether obligations, e.g. payments, bind-
ing on the party are fulfilled, tracking taxation duties
of valuable contracts, or notifying legislation amend-
ments’ affects. An important part of such tasks can
be automated by extracting contract metadata such
as parties involved, dates, or legislation references.
However, these pieces of information are mostly filled
in management systems manually which is costly and
time-consuming.
In a previous work (Ha et al., 2018), the OCR-
Miner system designed to process scanned invoices
based on the combination of layout and text analysis
was presented. In the current work, we adapt the sys-
tem to extract metadata elements from contracts based
on a small development set. We also offer an evalua-
tion with detailed analysis of errors.
The next section gives an overview of state-of-the-
art in the legal documents processing domain. Sec-
tion 3 presents a description of the system compo-
nents with the adaptation to contractual documents.
a
https://orcid.org/0000-0001-6348-109X
In Section 4, we offer a detailed evaluation of the sys-
tem with a Czech contract dataset.
2 RELATED WORKS
Research in legal document content classification re-
cently focuses on extracting and classifying clauses,
particularly deontic clauses (obligations, prohibitions,
and permissions). (Neill et al., 2017) classify deontic
clauses using an ensemble of bidirectional long short-
term memory networks (BiLSTMs) with the inputs
of Google news embeddings. They trained specific
legal domain word and phrase embeddings and com-
pared the result with other neural and non-neural clas-
sifiers. In a similar task, (Chalkidis et al., 2018) use
word embeddings and part-of-speech (POS) tag em-
beddings trained on an English contract dataset and
pre-trained token shape embeddings. The network is
also based on BiLSTM but in a hierarchical architec-
ture along with self-attention mechanism to improve
training time and accuracy of the classifiers.
In terms of information extraction, (Kwok and
Nguyen, 2006) proposed a general template based
framework to extract data from PDF contracts. A pat-
tern for each contract data item in a contract type in-
cludes data tag, number of words and location (page,
paragraph, line, and word numbers). A document
type, which is determined by the beginning and end-
ing patterns, identifies a pattern matrix and a list of
Ha, H., Horák, A. and Bui, M.
Contract Metadata Identification in Czech Scanned Documents.
DOI: 10.5220/0010243807950802
In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART 2021) - Volume 2, pages 795-802
ISBN: 978-989-758-484-8
Copyright
c
2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
795
contract data tags. There is no specific example of ei-
ther a contract data pattern or a document type pattern
to illustrate the idea in the paper.
In (Winter and Rinderle-Ma, 2018) and (Drag-
oni et al., 2016), natural language processing (NLP)
techniques are used to detect constraints and their
relations, or rules in legal documents. In the for-
mer, constraints are detected by modal verbs (shall,
should, must). These constraints are grouped by ei-
ther term frequencies or related subjects based on sen-
tence structure or external information. In each group,
similarity between each pair of constraints is counted
to detect redundant, subsumed, and conflicting con-
straints using cosine distance of the each constraint
word vectors. (Dragoni et al., 2016) use NLP tools to
extract rules from legal text. First, they identify deon-
tic components (prohibition, permission, obligation)
using a deontic lightweight ontology. Then, these
components are combined to create rules using a pat-
tern based model.
The most related works are (Chalkidis et al., 2017;
Chalkidis and Androutsopoulos, 2017). In these
works, the authors resolve the extraction of contract
elements such as contract title, clause headings, par-
ties, dates, values, or legislation references as a se-
quence labeling task, similar to e.g. named entity
recognition (NER). Each sliding window classifier is
used for an element type to classify each token of
pre-defined extraction zones as positive if it is a part
of a contract element and negative otherwise. In the
former work, they use Logistic Regression, or Lin-
ear Support Vector Machines (SVMs) models. The
features involve word embeddings and POS tag em-
beddings, both pre-trained on a contract dataset, plus
hand-crafted features. With the same approach, but
using BiLSTM-based models instead of linear ones
and with the hand-crafted features being replaced by
token-shape embeddings, the latter work improves the
previous result. Their best macro average F1 score is
0.88 using a relaxed match. For the contract parties,
only the organization name is extracted. The extrac-
tion zones, which is up to 20 tokens before and af-
ter specified keyword, are explicitly marked in each
training and test contract. The system also needs a
large amount of data to be annotated for training mod-
els.
3 METHODOLOGY
The OCRMiner system pipeline is illustrated in Fig-
ure 1. Modules specific for invoice analysis and in-
formation extraction were introduced and evaluated
in (Ha et al., 2018). Each piece of information is ex-
Figure 1: The processing pipeline.
tracted based on a weighted combination of layout
and text analysis. The text analysis involves a se-
ries of annotations to detect keywords and data types
based on either patterns or learning models. Firstly,
the contract image is recognized by an OCR tool
1
to
obtain words and word positions (bounding boxes).
Then, the physical layout including hierarchical ele-
ments (lines and blocks) of the pages, block positions
in the page and relative positions with neighboring
blocks, is built by the layout analysis module.
From this point, annotations are added by annota-
tion modules. They involve title, keywords, structural
data types, named entities, and parts of addresses. For
example, characteristics of the title text are detected
by biggest font size, usually center alignment, and
containing keyword contract (‘smlouva or its vari-
ants in Czech). The first two features are based on
layout attributes. For the last one, the text lines are
parsed to obtain words and their index forms (lem-
mata) before searching for the title keyword. Each
characteristic increases the confidence score of the
item detection. Finally, the candidate with the highest
confidence score is marked as the title. If there are
more than one with same confidence score, then the
first candidate in the reading order, i.e. the one closest
to the top of the page, is selected.
Keyword annotation looks for markers of de-
sired data, for example contract number (‘smlouva
ˇ
c
´
ıslo’), date (‘dne’), address (’se s
´
ıdlem’), etc.
The list of keywords is prepared based on the most
frequent words and word bigrams of the contract
dataset adapted using the development set. The key-
1
The open source OCR system Tesseract (Smith, 2007;
Smith et al., 2020) is currently used in OCRMiner.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
796
word search takes into account possible small OCR
errors, i.e. it allows a flexible similarity matching (
see (Ha, 2019) for details). The data annotation
module searches for structural data types such as a
date, VAT number, or legislation reference using reg-
ular expressions. In each contract, entity mentions
(e.g. an organization (ORG), a person (PER), or a lo-
cation (LOC)) play an important role, especially in
contract party detection. OCRMiner currently uses
named entity recognition module based on the Slavic
BERT model for 4 languages (Bulgarian, Czech, Pol-
ish, and Russian) (Arkhipov et al., 2019), which ex-
tends the multilingual BERT model by adding a CRF
layer tuned for Slavic languages using Wikipedia and
news articles. To improve address recognition, an
extra module based on a global address parser Lib-
postal (Barrentine et al., 2020) is used to detect parts
of addresses, such as road/street name, postcode, city,
state, or country.
After the annotations, each block is assigned a
block type in the logical structure analysis based on
the information gained in the preceding steps using a
set of logical rules. These rules are human readable
and easy to edit. The reasoning here mimics the hu-
man decisions based on visual inspection of the doc-
ument.
The information extraction module concludes the
processing to present the identified pieces of informa-
tion. For each extracted item, the module firstly looks
for the item “anchor” in the text, i.e. the correspond-
ing keywords or blocks. Then, in the surroundings
of the keyword position, the algorithm searches for
the appropriate data type, e.g. a “date” for the invoice
date item. The surroundings is limited to either next
to the keyword on the same text line, or the text line
on the right, or below it. The exact position of the
item value is decided by a score weighting function
fulfilling the criteria that the block/line contains the
data type and does not contain other keywords. Some
types of data can be found without keywords such as
ORG(anization), PER(son), VAT number, or legisla-
tion references. Contract parties are extracted only
in blocks being identified as the block type “party”,
i.e. a block containing at least one keyword in the
group of organization, address, contact person, com-
pany id, vat number, or bank information, or at least
two named entity entries in the corresponding class
(PER, ORG, LOC, CITY, COUNTRY, VAT NUM-
BER). Before parsing a party’s information in a block,
text blocks that may belong to the same party but that
are separated either by physical distance or by cov-
ered lines in the block, are joined together using log-
ical rules. The principle here is that if consecutive
blocks contain non-overlapping parts of a party’s in-
Table 1: Text statistics of the evaluation contract dataset.
dev test total
documents 10 102 112
pages 36 589 625
blocks 430 8,451 8,881
lines 16,587 2,426,298 2,442,885
words 147,154 4,911,953 5,059,107
formation, then they should be merged together. Each
extracted party is assigned a confidence score corre-
sponding to the amount of identified labeled informa-
tion (ORG, PER, VAT number, company id, or role)
in the block.
4 EXPERIMENTS
4.1 Dataset
The dataset used for development and evaluation of
the contract analysis module of OCRMiner comes
from the official state registry of Czech public con-
tracts
2
. The data obtained from the website include
contract texts (in PDF) and metadata files (in XML).
The registry contains not only contracts but also ap-
pendices, price lists, invoices, etc. Therefore, a 2-step
filter is applied to select contracts only. The first
step automatically filters out documents based on the
filename and the text content. The filename usually
reflexes the content, so, files having names contain-
ing obj (“objedn
´
avka” order), cen
´
ık or cenov
´
a
nab
´
ıdka (price list), p
ˇ
r
´
ıloha (appendix) have been
removed. Then remaining files have been converted
into OCR text. If the text does not contain the key-
word ‘smlouva (contract), then the document is also
filtered out. The second step involves manual check.
Finally, 112 contracts were selected randomly for the
thorough evaluation to be annotated (by one annota-
tor) as the gold standard data. Ten documents are used
as a development set and the remaining ones form a
test set. Text statistics of the final datasets are enlisted
in Table 1.
Although the contracts metadata are available, a
further step is still needed to prepare the gold stan-
dard data for evaluation. Firstly, the metadata does
not contain all the information that is to be extracted
such as a representative person or role of a contract
party. Secondly, since the registry metadata were en-
tered manually through the available forms, they are
in different formats compared to the contract text, es-
pecially the dates and addresses. Thirdly, some pieces
of information appear in the metadata but not in the
2
https://smlouvy.gov.cz/
Contract Metadata Identification in Czech Scanned Documents
797
Table 2: Identified items in the contract texts.
in in
Item dev test Example
title 9 102 “Smlouva o poskytov
´
an
´
ı slu
ˇ
zeb” (supply of services contract)
contract type 10 100 “poskytov
´
an
´
ı slu
ˇ
zeb” (supply of services)
legislation 33 547 “$ 1746 a n
´
asl. z
´
akona
ˇ
c. 89/2012 Sb., ob
ˇ
cansk
´
y z
´
akon
´
ık” ($ 1746
et seq. Act No. 89/2012 Coll., Civil Code)
contract number 7 58 ”VODA/ZA20-4023”
contract date 8 78 10.1.2020
company name 13 175 “TESCO SW a.s.
representative 13 164 “Josefem Tesa
ˇ
r
´
ıkem” (by Josef Tesa
ˇ
r
´
ık)
address 21 194 “t
ˇ
r. Kosmonaut
˚
u 1288/1, Hodolany, Olomouc, PS
ˇ
C 779 00”
vat number 10 102 “CZ699000785”
company id 19 191 “25892533”
bank name 6 56
ˇ
Cesk
´
a spo
ˇ
ritelna, a.s.
account number 4 49 ”1303699319/0800”
role 19 194 “poskytovatel” (supplier)
contract text. For example, contract numbers in some
cases are not stated in the original contract but in the
metadata only. Moreover, in many contracts, private
information is covered, such as an account number or
contact details. So, after converting the registry meta-
data file into the desired format, the data is manually
examined before becoming the ground truth for the
evaluation.
4.2 Information to Extract
The detected and extracted pieces of the contract in-
formation are summarized in Table 2. Specifically,
the contract date is the closest date that all parties have
signed the contract. Usually, it appears at the end of
the contract, before the signatures. If the signature
dates are different then the later one is extracted. A
contract party is a group of information, involving or-
ganization, address, company id, VAT number, a com-
pany representative, a party role in the contract, bank
name, or an account number. The party role is of-
ten stated at the beginning of the party text block, e.g.
zhotovitel’/contractor and objednatel’/customer, or
after the keyword d
´
ale jen’/hereinafter. A full exam-
ple of information extracted from the first page of a
contract is illustrated in Figure 4 in the Appendix.
4.3 Results
Within the evaluation process, each piece of extracted
information is evaluated as a match, a partial match,
or a mismatch. For all fields except organization and
address, the match means an exact match. For these
two exceptions, a match allows to ignore the piece of
the gold standard information which is not crucial for
the company or address identification.
For example, organization full official name oc-
currences of:
ground truth:
Czech Airlines Technics, a.s.
extracted:
Czech Airlines Technics
are considered a match.
Differently from the previous use case of invoice
information extraction where the parties can always
be classified into a seller and a buyer, in contracts the
number of parties is not predetermined. Therefore,
the extraction process needs to take each text block
containing an organization’s information as a possible
contract party information. In the evaluation phase,
each gold standard contract party is compared to each
extracted party. The result is recorded for the party
having the most common information. This means
the evaluation will not search for each piece of in-
dividual contract party information in the extracted
data as a whole and return a match if such piece is
found. The evaluation is here strict in the sense that
even if a sought piece of information is extracted but
in a different party block then the result will be a mis-
match. Some works use a relaxed match, i.e. if the
extracted information matches the ground truth at a
threshold, e.g. 80%, then it is considered as a (re-
laxed) match. The importance of the missing piece
is ignored here. To give an example, if a contract
date ground truth is 1.12.2019 and the extracted
date is 31.12.2019”, that makes only 10% differ-
ence. However, in the context, the second date was
meant as the payment due date, so, it should have
been considered as a mismatch instead of a match.
Due to such complications, the evaluation is first pre-
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
798
Table 3: Test set evaluation results.
Result items Percentage
Match 1,631 81.14%
Partial match 137 6.82%
Mismatch 242 12.04%
Total 2,010 100.00%
processed automatically using approximate compar-
isons based on the Levenshtein distance, then exam-
ined manually.
The evaluation results of the OCRMiner contract
module with the test set are presented in Table 3. Al-
together, almost 88% of gold standard information
was extracted, with 81% in the exact expected form
and approx. 7% with minor differences. Just 12% of
items were not identified or identified wrongly.
A detailed evaluation of the individual item types
is illustrated in Figure 2. Addresses and contract types
have the highest accuracy of 94.3% and 93% respec-
tively. In contrast, contract dates and party roles dis-
play the highest number of mismatches with 30.8%
and 26.3%. The legislation reference field contains
the highest number of minor errors (partial matches)
of 15.7%.
In the following section, a detailed error analysis
of 50 contracts in the test set identifies and explains
the causes of both minor and major mismatches.
4.4 Error Analysis
In the OCRMiner extraction pipeline, the data to ex-
tract are identified by keywords, data format or text
position. If a keyword is found, then the extraction
module looks for the appropriate data item around
the keyword based on the visual layout, especially in
relation to the keyword position. Non-keyword data
are detected by a pattern (e.g. the VAT number) or a
pre-trained model (e.g. organization or person name).
Therefore, the error causes are classified into differ-
ent categories: OCR errors, keyword error (there is
either no keyword in the text or a new keyword which
did not appear in the development set), layout error,
named entity recognition (NER) error, block misiden-
tification (extracted in another block), and others.
A layout error means the keyword is found but the
Table 4: Error analysis of partial matches.
Error type
items in %
OCR error 18 39.13
NER 9 19.57
Multi-lines 5 10.87
Pattern 9 19.57
Other 5 10.87
Total 46 100.00
Table 5: Error analysis of mismatches.
Error type items in %
In another block 7 6.31
OCR error 31 27.93
Keyword 27 24.32
Layout 10 9.10
NER 12 10.81
Pattern 7 6.31
Title 6 5.40
Other 11 9.91
Total 111 100.00
data text line is not found in the expected relative
position, either due to a typing error or the layout
match criteria. In the detection of a company name,
a keyword is often elided, thus the extraction relies
on the NER annotation or the company name’s end-
ing. However, the dataset originates in the public sec-
tor where many parties are public organizations of a
specific area (e.g city, village, etc.). In consequence,
NER recognizes only part of the organization name as
a location instead of the whole chunk as an organiza-
tion. For example, ‘M
ˇ
esto Hostinn
´
e’ (Hostinn
´
e town)
or Slu
ˇ
zby m
ˇ
esta N
´
am
ˇ
e
ˇ
st
ˇ
e nad Oslavou (Town ser-
vices of N
´
am
ˇ
e
ˇ
s
ˇ
t nad Oslavou). As mentioned above,
in parties’ evaluation the comparison is made for the
whole group instead of searching for each piece of
information separately causing a mismatch when a
piece of information is correctly extracted but as-
signed to a different block. Furthermore, as we de-
scribed in 3, the contract title is extracted using 3 cri-
teria involving the font size, the central alignment and
a keyword. However, in some cases, the title can be
left aligned, or the biggest font is a part of text in the
logo or another line. The combination of these errors
falls into the title category. Legislation references are
identified by flexible patterns consisting of 3 parts:
the section mark (§), paragraph ID (‘odst. X’), and
the act or law ID. But not all 3 parts are obligatory.
The act or law is illustrated by number, e.g. 89/2012,
or name (‘ob
ˇ
cansk
´
y z
´
akon
´
ık’/Civil code). Full exam-
ples are:
§ 1746 odst. 2 z
´
akona
ˇ
c. 89/2012 Sb.,
ob
ˇ
cansk
´
y z
´
akon
´
ık (§ 1746 par. 2 of Act No.
89/2012 Coll., Civil Code)
§ 2586 a n
´
asl. z
´
ak.
ˇ
c. 89/2012 Sb.,
ob
ˇ
cansk
´
eho z
´
akon
´
ıku (§ 2586 et seq. Act. No.
89/2012 Coll., Civil Code)
§ 92a z
´
akona o dani (§ 92a of the Tax Act)
ˇ
c. 340/2015 Sb. (No. 340/2015 Coll.)
These patterns are based on findings in the develop-
ment set with extra flexibility, however, some test set
cases yet remained uncovered such as more than one
Contract Metadata Identification in Czech Scanned Documents
799
a) b)
Figure 2: Evaluation of each field: a) by item, and b) by percentage.
section in a legislation reference (§27 a 31 z
´
akona
ˇ
c.
134/2016 Sb.), or a connection word ‘n
´
asl.’/seq. writ-
ten in the full form (‘n
´
asleduj
´
ıc
´
ıch’/sequentes).
The error analysis of each category is summarized
in Tables 4 and 5. In the partial match section, al-
most 40% of errors are due to OCR errors, usually in
characters sharing similar shapes, e.g. 4-A, Z-7, or O-
0. The cases where NER and pattern did not detect
full organization or full legislation reference caused
the same number of errors (19.57%), leaving 10.87%
for multiple lines and for other reasons.
In the mismatch section, OCR errors caused more
than a quarter of mismatches, followed by keyword
errors with another 24.32%. As we can see in Fig-
ure 2, the accuracy of the contract date item is low.
In the analysed contracts, the dates usually appear
stamped or hand-written when signing the contract
(see Figure 3 for an example). Thus, most of the date
errors happen because the OCR engine could not rec-
ognize the hand-written characters correctly. In addi-
tion, the cover of confidential information sometimes
overlapped text in the surrounding areas which made
more OCR errors. 10.8% errors was because of NER
recognizing a part of an organization name as a lo-
cation. Layout errors appear in 9% and 6% of items
were extracted in a wrong block. The reason was ei-
ther due to a specific layout design or to distances be-
tween information in the group created by a covering
black line (as we can see in the example Figure 4 in
the Appendix). Pattern errors appear also in 6% fol-
lowed by title errors (5.41%) and 9.91% of others rea-
sons.
Figure 3: A contract date example.
5 CONCLUSIONS
In the paper, we have presented the first version of the
OCRMiner system module for information extraction
of scanned contract documents. The design and the
architecture of the module have been described in de-
tails.
A new dataset for the evaluation of the contract in-
formation extraction has been built and used in a thor-
ough evaluation of the contract analysis modules. The
evaluation results show that the system is able to iden-
tify almost 88% of the contract metadata correctly. A
detailed error analysis depicts and classifies the rea-
sons of the current mismatches. Although some mod-
ules (e.g. keyword detection) are language dependent,
the pipeline is easily adaptable to other languages.
With the presented analysis, the current test set
can be seen as an extended development set to evalu-
ate the system on a new and larger test set to confirm
its generalization capabilities. This version also offers
as a strong baseline for further work where we plan to
employ state-of-the-art NLP techniques such as pre-
trained BERT model tuning on the contract dataset or
grammar induction techniques for the layout and con-
tent analysis.
ACKNOWLEDGEMENTS
This work has been partly supported by Konica Mi-
nolta Business Solution Czech within the OCR Miner
project, the Ministry of Education of CR within the
LINDAT-CLARIAH-CZ project LM2018101 and the
Specific University Research of Masaryk University.
REFERENCES
Arkhipov, M., Trofimova, M., Kuratov, Y., and Sorokin,
A. (2019). Tuning multilingual transformers for
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
800
language-specific named entity recognition. In Pro-
ceedings of the 7th Workshop on Balto-Slavic Natural
Language Processing, pages 89–93, Florence, Italy.
Association for Computational Linguistics.
Barrentine, A. et al. (2020). Libpostal. https://github.com/
openvenues/libpostal.
Chalkidis, I. and Androutsopoulos, I. (2017). A deep learn-
ing approach to contract element extraction. In JU-
RIX, pages 155–164.
Chalkidis, I., Androutsopoulos, I., and Michos, A. (2017).
Extracting contract elements. In Proceedings of the
16th edition of the International Conference on Arti-
cial Intelligence and Law, pages 19–28.
Chalkidis, I., Androutsopoulos, I., and Michos, A. (2018).
Obligation and prohibition extraction using hierarchi-
cal RNNs. arXiv preprint arXiv:1805.03871.
Dragoni, M., Villata, S., Rizzi, W., and Governatori, G.
(2016). Combining NLP approaches for rule extrac-
tion from legal documents.
Ha, H. T. (2019). Approximate string matching for detect-
ing keywords in scanned business documents. In Pro-
ceedings of Recent Advances in Slavonic Natural Lan-
guage Processing, RASLAN 2019, pages 49–54.
Ha, H. T., Nev
ˇ
e
ˇ
rilov
´
a, Z., Hor
´
ak, A., et al. (2018). Recog-
nition of OCR Invoice Metadata Block Types. In
Text, Speech, and Dialogue. TSD 2018, pages 304–
312. Springer, Cham.
Kwok, T. and Nguyen, T. (2006). An automatic method
to extract data from an electronic contract composed
of a number of documents in PDF format. In The 8th
IEEE International Conference on E-Commerce Tech-
nology and The 3rd IEEE International Conference on
Enterprise Computing, E-Commerce, and E-Services
(CEC/EEE’06), pages 33–33. IEEE.
Milosevic, Z., Gibson, S., Linington, P. F., Cole, J., and
Kulkarni, S. (2004). On design and implementation
of a contract monitoring facility. In Proceedings. First
IEEE International Workshop on Electronic Contract-
ing, 2004., pages 62–70. IEEE.
Neill, J. O., Buitelaar, P., Robin, C., and Brien, L. O. (2017).
Classifying sentential modality in legal language: a
use case in financial regulations, acts and directives.
In Proceedings of the 16th edition of the International
Conference on Articial Intelligence and Law, pages
159–168.
Ryan, F. (2006). Round Hall nutshells Contract Law.
Thomson Round Hall.
Smith, R. (2007). An overview of the Tesseract OCR en-
gine. In Document Analysis and Recognition, Ninth
International Conference on, volume 2, pages 629–
633. IEEE.
Smith, R. et al. (2020). Tesseract OCR. https://github.com/
tesseract-ocr/tesseract.
Winter, K. and Rinderle-Ma, S. (2018). Detecting con-
straints and their relations from regulatory documents
using NLP techniques. In OTM Confederated Interna-
tional Conferences” On the Move to Meaningful Inter-
net Systems”, pages 261–278. Springer.
APPENDIX
An example of a scanned contract with the extracted
metadata information is presented on the next page in
Figure 4.
Contract Metadata Identification in Czech Scanned Documents
801
Figure 4: A contract example: extracted information is in the red boxes.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
802