Towards Automation of Regulatory Compliance Checking in the Product
Design Phase
Malte Ramonat
1
, Andreas W. M
¨
uller
2
and Alexander Fay
1
1
Institute of Automation, Helmut Schmidt University Hamburg, Holstenhofweg 85, Hamburg, Germany
2
Data & Analytics Governance, Schaeffler AG, Herzogenaurach, Germany
Keywords:
Regulatory Compliance Checking, Table Extraction, Formula Extraction, Ontology Design.
Abstract:
The process of checking if a designed product is compliant with standards is time-consuming and error-prone.
This paper presents an approach for the automation of compliance checking using tables and formulae of
standards as information sources. An ontology is created to enable comparisons between parameter values
specified in standards in the form of a PDF document and parameter values of a designed product saved in a
3D PDF document. The extraction of regulatory information from PDF documents is discussed and software
tools for information extraction are compared.
1 INTRODUCTION
Technical standards provide knowledge for maintain-
ing a certain quality level for products or services.
Especially during the product development process,
standards specify a leeway for the product design.
The Portable Document Format (PDF) has become
the most common digital format in which standards
are provided by standards committees. PDF docu-
ments are widely used throughout the product life
cycle due to their portability and platform indepen-
dence. Information within PDF documents is either
unstructured or semi-structured, i.e. it is human-
readable but not interpretable for machines (Khusro
et al., 2015).
This lack of machine-interpretability makes the
process of verifying the compliance with require-
ments from regularities such as technical standards
labor-intensive. In the process of regulatory com-
pliance checking (RCC) during the product design
phase, the engineer first needs to find the standards
relevant for the designed product. Then, the engineer
compares a required value from a specific section of
a standard to a value of the designed product and pos-
sibly makes adjustments to the product design. This
process of RCC has to be repeated multiple times dur-
ing the product design phase. Thus, manual RCC is
time-intensive and error-prone (Manoharan, 2019). If
technical dependencies between different engineering
departments occur for the designed product, errors in
RCC of one department can have a big impact on
the whole development process (Jager, 2011). More-
over, RCC is not necessarily only conducted by the
company designing the product but also by certifica-
tion organisations. In the certification process multi-
ple feedback loops between certification organisation
and the manufacturer are necessary if the designed
product is not fit for certification. Each such feed-
back loop increases the manual effort for RCC for
both involved parties. Due to the amount of manual
labor put into RCC, both manufacturers and certifi-
cation organisations depend significantly on the reg-
ularity knowledge of their involved employees. An
automation of the RCC process is highly desirable, as
it would not only accelerate the design and certifica-
tion of new products but also reduce the risk of er-
rors during the process and decrease the dependency
of a company on the knowledge of single employees.
In order to automate the RCC, standard documents
need to be made machine-interpretable. Ontologies
can be used to enable machine-interpretability of re-
quirements stated in the standards. An approach is
presented in this paper demonstrating how ontologies
can be connected to requirements stated in standard
PDF documents.
The remainder of this paper is organized as fol-
lows: In Section 2 a scenario for the application is
presented. Section 3 shows an analysis of publica-
tions related to automation of RCC. In Section 4 the
proposed approach is outlined with its main parts be-
ing described in Subsections 4.1 and 4.2. Section 5
completes the paper with a conclusion and outlook.
136
Ramonat, M., Müller, A. and Fay, A.
Towards Automation of Regulatory Compliance Checking in the Product Design Phase.
DOI: 10.5220/0010644500003064
In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 2: KEOD, pages 136-143
ISBN: 978-989-758-533-3; ISSN: 2184-3228
Copyright
c
2021 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
2 APPLICATION SCENARIO
It is common practice for engineers in product design
to reuse existing concepts, e.g. in form of 3D CAD
models, to create complex designs. Depending on
the respective product and its designated fields of use,
such supplied models must meet various standards-
based quality criteria. In a typical workflow an engi-
neer needs to know all the relevant criteria before he
or she can choose to apply a reusable concept in a par-
ticular context. Alternatively, he or she must look up,
find and correctly interpret the necessary information
in the potentially very large knowledge corpus of the
relevant standards documents.
In the example shown in Figure 1 the engineer
needs to check whether the dimensioning and surface
properties of the part described in the 3D PDF file are
compliant with underlying standards.
Figure 1: Example for a 3D PDF with sample properties.
The engineer can currently only gather the infor-
mation necessary for checking the compliance with
these standards from PDF documents. In the refer-
enced standards the necessary dimensioning and sur-
face information are also given in human-readable
form of tables and formulae. Hence, the engineer has
to read and understand the table shown in Figure 2,
and may even have to further perform calculations ac-
cording to the applicable formulae (see Figure 3). In
all cases the relevant attributes and their values must
be properly identified in and taken from the 3D PDF
product design and matched with the correct attributes
in the correct tables and formulae.
In order to provide the engineer with a time-saving
automatic checking of these criteria by means of an
RCC pipeline, the involved values, tables, and formu-
lae must all be semantically well-described.
Figure 2: Example section of a dimensioning table with
complex header structures and sample values.
Figure 3: Example section of a profiles table with complex
cell contents and sample values.
3 RELATED WORK
Various efforts have been made towards the automa-
tion of the RCC process. (Zhang and El-Gohary,
2015) propose a method for automatic RCC in the
construction sector by using ontology-based logic
clauses. Requirements are automatically extracted
from a PDF document using natural language pro-
cessing. The extracted information is formalised into
logic clauses that can be used for reasoning. The pub-
lication is followed by (Zhang and El-Gohary, 2017)
in which a reasoning schema for RCC is presented.
Both publications focus on textual requirements.
In (Beach, 2013) a framework for RCC in the con-
struction sector is described, in which domain experts
create and maintain requirements from standards. Re-
quirements are extracted from regulation documents
and are enriched with metadata by domain experts.
The requirements are then transferred into a rule en-
gine, which can be used for RCC. In this publication
textual requirements are focused. Due to the manual
Towards Automation of Regulatory Compliance Checking in the Product Design Phase
137
extraction and maintenance of requirements the pro-
posed framework is labor-intensive.
In (Manoharan, 2019) the digitisation of require-
ments from standards and their integration into the
product design process is examined. Information
from standards is manually extracted and uploaded
to a graph database, which can be accessed by CAD
tools. For the presented method the requirements in
the standard had to be manually extracted and con-
verted into the JSON format from the standard. The
authors criticise the manual effort for the information
extraction and upload into the database and advise
against using this method in general.
In a following publication (Loibl et al., 2020) pro-
pose a procedure for the transformation of standards’
contents into a machine-executable form. Standards
are classified according to their eligibility for a con-
version into a machine-executable format. It is stated
that standards primarily containing tables, formulae
and text in form of property specifications are espe-
cially suited for machine-executability because they
possess a high level of unambiguousness. To vali-
date their approach, a data-graph from the standard
is transformed into a table which is formalized into a
machine-executable format. The information is then
transferred to a graph database and provided by web-
services.The steps described above are carried out
manually making this approach labor-intensive. For-
mulae and tables in the original PDF document are
not addressed.
The majority of previous efforts discuss the au-
tomation of the RCC process for the construction sec-
tor. Automated RCC for the product design phase
should be studied more profoundly due to possible
time savings and error prevention. In previous publi-
cations mostly requirements in textual form were fo-
cused. Tables and formulae are less ambiguous than
textual requirements and should therefore be utilised
in the automation of RCC. To the best of our knowl-
edge no authors have discussed an automation of RCC
for the product design phase using tables and formula
as the source of requirements from standards.
4 APPROACH TO AUTOMATIC
COMPLIANCE CHECKING
The automation of the RCC process can be achieved
in two different ways. The first way is to create
standards in a machine-interpretable format, e.g. in
the machine-readable Extensible Markup Language
(XML) format. This is state-of-the-art in the stan-
dardisation process. In such standard documents, tags
are used to add semantics to different sections of the
document. However, the tags used are not detailed
enough to make the standard machine-interpretable
(Loibl et al., 2020). Currently, efforts are taken to
change the creation process of standards to allow that,
in future, machine-interpretable standards are created.
However, the timescale until these efforts will be im-
plemented is unclear. This is due to the complex de-
velopment process of a semantic tag set and also be-
cause of the amount of manual effort required for the
semantic tagging of standards’ contents. It is also un-
clear if the envisioned tag set will grant a suitable
level of accuracy for companies as the responsibility
of tag set creation lies with standards committees.
The second way for the automation of RCC is a
post-processing of the existing standards, which have
been provided as PDF documents. Following this
way, results can be obtained faster because less co-
ordination between organisations is necessary. Ad-
ditionally this post-processing can be applied to all
existing standards, whereas the first way can only be
applied to newly created standards. A post-processing
method could also be applied more easily to internal
standards of companies.
In the following sections, the second way is be-
ing followed, i.e. a post-processing approach to en-
able automatic RCC using PDF files of standards is
proposed. Thus, information is extracted especially
from tables and formulae of standards in form of PDF
files by means of appropriate software tools. The ex-
tracted information is mapped into an ontology. The
information from the product description in form of a
3D PDF file is also extracted and mapped into the on-
tology. Thus, within the ontology, compliance checks
can be executed.
4.1 Information Model for Regulatory
Compliance Checking
In order to check if the properties of the product
description match the required values in a standard,
both values have to be made comparable. This can
be achieved by loading the information from both
sources into the same information model (IM). In sub-
section 4.1.1 a suitable type of IM is chosen based on
established requirements. A modeling language for
the chosen type of IM is selected. In subsection 4.1.2
a prototypical implementation for the IM is described.
4.1.1 Information Model Type and Modeling
Language Selection
The type of IM used for implementation has to meet
certain requirements. These IM requirements (IMR)
are derived from the application scenario and from a
KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development
138
literature review. They are described hereinafter in
order of their importance to the application scenario.
IMR1 (Machine-interpretability): In order to
automate RCC, the IM needs to be machine-
interpretable. To achieve this, the IM should store
information and also provide means to enrich it with
contextual meaning, i.e. semantics (Bettini, 2010)
IMR2 (Value comparison): The IM must provide
means for the automatic comparison of values stated
in tables or formulae within a standard to parameter
values presented in the product description. Semantic
structures of tables and formulae as well as means to
make both values comparable must be supported to
achieve this.
IMR3 (Compatibility): The IM must be accessible
by programs so that compliance checks can be applied
automatically. For this purpose, the IM must provide
non-proprietary interfaces so that other programs can
query the database from outside.
IMR4 (General applicability): In order for the IM
to be be applicable to other application scenarios, it
needs to be reusable and extendable (Glawe, M., et
al., 2015).
For the implementation of the IM, relational
databases and ontologies are compared. Both rela-
tional databases and ontologies enable value com-
parision and can be accessed by other programs.
Machine-interpretability is only supported by ontolo-
gies because their semantic triple structure allows the
modelling of complex semantics. Additionally, on-
tologies provide more flexibility compared to rela-
tional databases as they can easily be extended and
reused (Loibl et al., 2020). Therefore, ontologies have
been chosen for the implementation.
The Web Ontology Language (OWL) is suitable to
model an ontology for the application scenario due to
its high level of formalisation and because of its com-
patibility to various software tools. Also, an ontology
built with OWL can easily be connected to other on-
tologies due to the open world assumption of OWL.
4.1.2 Information Model Formalisation
An ontology has been created using OWL to for-
malise information from both tables and formulae.
Their structures have been represented such that the
information extracted from standards can be mapped
to the ontology. An excerpt of the ontology is de-
picted in Figure 4 reflecting the structure of the table
in Figure 2.
Firstly, terminological components, i.e. T-Box el-
ements, are built to formalise the general structure of
tables. Classes are depicted in white. Object proper-
ties and datatype properties are shown in blue and red,
respectively. The table structure is reflected by classes
Figure 4: Ontology for allocation of parameter values from
tables.
for columns, subcolumns, rows and subrows as well
as for data-cells and header-cells. Secondly, asser-
tional components, i.e. A-Box elements, are added.
Individuals are created for columns and subcolumns
as well as for rows and subrows, they are depicted in
purple. They are connected to cell-entry-individuals,
which hold parameter values as datatype properties.
Individuals for header-cells holding parameter names
as datatype properties are also connected to the cell-
entry-individuals. This way, names and values of pa-
rameters are connected. Names and values of param-
eters of the product description in the 3D PDF file
are included in the ontology as datatype properties of
3D-PDF-parameter-individuals. Means of compari-
son are provided by linking the individuals of header-
cells and 3D-PDF-parameters to individuals of the
class ”Parameter”. The individuals of the three dif-
ferent classes share the same value for the datatype
property ”hasName”. A comparison between table
values and 3D PDF values can thus be conducted by
SPARQL queries. In order for the value comparison
to work, it is important that the same naming conven-
tion is followed for parameter names of both the stan-
dard and the product description (Hildebrandt, 2020).
If this is not the case, a workaround using matching
tables has to be used.
4.2 Extraction of Information from
Standards
In order to transfer the required values stated in ta-
bles and formulae of standards into the created ontol-
ogy, they have to be extracted by means of a software
tool. In Subsection 4.2.1 different software tools for
content extraction are compared based on established
requirements. After a suitable software has been cho-
sen, its output is analysed in Subsection 4.2.2.
Towards Automation of Regulatory Compliance Checking in the Product Design Phase
139
4.2.1 Selection of Extraction Software
The extraction software has to meet certain require-
ments to be suitable for the application scenario.
These software requirements (SR) described here-
inafter result from a literature review and an analysis
of the application scenario.
SR1 (Recognition and extraction of content): Pre-
ceding the extraction, the relevant information in the
standard has to be identified. Tables and formulae
have to be recognised. Contents of tables and for-
mulae such as numerical values and formula symbols
have to be recognised. For table recognition the con-
tent of individual table cells should be distinguished
(Khusro et al., 2015). The distinction between header-
and data-cells is also desirable because header-cells
can add semantics to data-cells of the same column
(Yildiz et al., 2005). Formulae should be recognised,
even if their shape is changed due to mathematical op-
erators like fractions or sums. Indexes and exponents
of formula elements as well as mathematical opera-
tors should be recognised (Chan and Yeung, 2000).
In addition to the recognition of table and formula
elements they should also be extracted.
SR2 (High quality results): The recognition of
tables and formula should yield high quality results.
Precision and recall can be used for result evaluation
(Wei et al., 2006). Precision is defined as the num-
ber of correctly recognised objects in relation to the
total number of recognised objects in the document,
whereas recall is defined as the number of correctly
recognised objects in relation to the total number of
objects in the document. Simple and complex tables
with connected cells should be recognised and ex-
tracted with comparable levels of precision and recall
(Yildiz et al., 2005). The same applies for the differ-
ent kinds of formulae.
SR3 (Broad range of applicability): To provide
a broad range of applicability, the software needs to
be usable for different types of document layouts and
should run on various operating systems (Pitale and
Sharma, 2011).
SR4 (Mature and standalone software): For a
smooth extraction of standard content, a standalone
software, which is not dependent on results of other
software, should be used. Using a mature, off-the-
shelf software would be desirable because this re-
duces the setup time. A mature software is also likely
to be less error-prone.
SR5 (Extraction of metadata): Metadata such as
the name of the document and the author or the issu-
ing date are important for the contextualisation of the
standard’s content and should thus be extracted (Pitale
and Sharma, 2011).
SR6 (Compatibility to the information model): In
Section 4.1 OWL has been chosen for the formalisa-
tion of the ontology. The software should therefore
extract the structure and content of tables and formu-
lae and present them in a format which can be mapped
into OWL.
SR7 (Availability): The software tool needs to be
available so that it can be used to implement an ex-
traction.
The requirements SR1, SR4, SR6 and SR7 are
classified as critical for the application scenario. In
the following comparison, a software tool is chosen
that meets all critical requirements and also meets the
most non-critical requirements.
Previous efforts and commercial software tools
have been searched for in a literature review and on-
line research. The software tools best fitting to the
application scenario are listed in Table 1. Each soft-
ware tool is evaluated regarding the fulfillment of the
above-mentioned requirements. A distinction is made
between fulfilled [X], partly fulfilled [(X)] and un-
fulfilled [X] requirements as well as insufficient in-
formation [?] for evaluation. The availability of the
software is rated with [X] if the software is avail-
able free of charge, [(X)] if it is a commercial soft-
ware and [X] if it cannot be acquired. The software
tools are categorised by tools for both table and for-
mula extraction, only table extraction and only for-
mula extraction and are arranged by their availability.
The requirements have been evaluated on the basis
of the software capabilities described by the software
creators and in related publications such as (Khusro
et al., 2015), (Perez-Arriaga et al., 2016) or (Con-
stantin et al., 2013).
It becomes apparent that no extraction software
meets all requirements. Of the 21 software tools
shown in Table 1, six are not available. Twelve soft-
ware tools do not meet other critical requirements and
thus cannot be used. The software tools SectLabel,
ABBYY FineReader and Nitro Pro do not recognise
information in tables or formulae with satisfactory ac-
curacy. Pdf2xml detects the information but does not
extract it. PDF-Extract, pdf2table, PDFFigures2, pdf-
table-extract, TableSeer and iText are dependent on
results of other software and therefore are no stan-
dalone software. The software tools Sumatra PDF and
i2OCR generate output in a format which is not com-
patible with the created OWL ontology described in
Section 4.1.
The critical requirements are met by PDFX,
Adobe Acrobat Pro and InftyReader. PDFX is an
online tool which converts PDF documents into the
XML format. The XML format is suitable for the ap-
proach due to its compatibility to OWL and because
KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development
140
Table 1: Comparison of previous efforts and commercial software for table and formula extraction.
Requirements
Software Tools SR1 SR2 SR3 SR4 SR5 SR6 SR7
Tables & formulae
PDFX (Constantin et al., 2013) (X) X X (X) X X X
PDF-Extract (Berg et al., 2012) (X) X X X X X X
SectLabel (Luong et al., 2012) X ? X X X X X
Adobe Acrobat Pro (Adobe, 2021) (X) X X X X X (X)
ABBYY FineReader (ABBYY, 2021) X X X X X (X) (X)
Nitro Pro (Nitro Software, Inc., 2021) X X X X ? (X) (X)
(Wei et al., 2006) (X) X (X) X X X X
Tables
Pdf2table (Yildiz et al., 2005) X X (X) X X X X
PDFFigures2 (Clark and Divvala, 2016) (X) X X X X X X
pdf-table-extract (Lee et al., 2014) X ? X X ? X X
Pdf2xml (D
´
ejean and Meunier, 2006) X ? X X X X X
TableSeer (Liu, Y., 2009) X ? ? X X X X
Sumatra PDF (Sumatra, 2021) (X) X X X X X X
iText (iText Group nv, 2021) (X) X ? X X X (X)
XONTO (Oro and Ruffolo, 2008) X ? X X X X X
PDF-TREX (Oro and Ruffolo, 2009) X ? X X X X X
TAO (Perez-Arriaga et al., 2016) X ? X X X (X) X
Formulae
i2OCR (Sciweavers LLC, 2021) (X) ? X X X X X
InftyReader (Suzuki, 2004) X (X) X X ? X (X)
(Garain and Chaudhuri, 2005) X ? X X X X X
MaxTract (Baker et al., 2012) X ? X X ? (X) X
it is both human-readable and machine-interpretable
(Schmidberger and Fay, 2007). Because standards are
confidential documents of high value, an online con-
version tool like PDFX cannot be used due to pos-
sible security leakages. The InftyReader meets the
critical requirements and is also applicable to differ-
ent document types. It converts formula content into
MathML output. The MathML format is compati-
ble to OWL and enables structuring of mathematical
equations in a detailed and hierarchical way (Schmid-
berger and Fay, 2007). Thus, it is suited for the pro-
posed approach. The InftyReader is limited to the ex-
traction of formulae and does not detect tables. There-
fore, formulae within tables cannot be extracted. For
many standards such as the one presented in Figure
3, formulae are, however, depicted within tables mak-
ing the InftyReader not usable. Adobe Acrobat Pro
is chosen for both formula and table extraction as it
meets all critical requirements. It can be used to con-
vert PDF documents into the XML format which can
be mapped into OWL. Adobe Acrobat Pro works for
digitally created PDF documents but does not recog-
nise a table or formula if it is added as a figure in the
original file. Because of this, Adobe Acrobat Pro does
not work for scanned documents. This means that the
extraction cannot be generalised to all kinds of PDF
documents. It has to be noted that Adobe Acrobat Pro
is chosen due to the absence of a satisfactory alter-
native. Additionally, none of the software tools can
distinguish between header- and data-cells of tables.
This proves that the technical infrastructure for a fully
functional method for post-processing of PDF docu-
ments has yet to be implemented.
4.2.2 Analysis of Software Output
Adobe Acrobat Pro has been used to convert the table
shown in Figure 2 into the XML format. An excerpt
of the output is shown in Figure 5.
Figure 5: XML Output of Adobe Acrobat Pro.
Towards Automation of Regulatory Compliance Checking in the Product Design Phase
141
Tags are assigned to table rows, columns and to
the table itself. header-cells and data-cells are not
tagged as such. Multi-row-cells of tables are con-
verted into multiple cells. The content of multi-row-
cells is saved in the uppermost cell, all other cells are
blank. Multi-column-cells are considered as single
cells. Thus, the number of columns per row can vary.
This is problematic for the mapping of information
from tables with complex header structures such as
the table shown in Figure 2 because headers of differ-
ent rows can in some cases not be assigned to the same
column. Formulae are converted into plain text with-
out tags for formula components. Formulae inserted
into the original document as a figure are not con-
verted. Because of the lack of tags and the poor qual-
ity of the formula output, the mapping of the XML
output into the ontology still presents a challenge.
5 CONCLUSION AND OUTLOOK
In this paper it is shown that the automation of RCC is
highly desirable because it leads to time savings and
less errors. An analysis of the related work shows
that the automation of the RCC on the basis of ta-
ble and formula information has not yet been exam-
ined. This research gap has been addressed by this
approach, specifically for the comparison of param-
eter values from a product description to values of a
standard. In the approach for automatic RCC an on-
tology for table and formula information has been for-
malised. Next, information from a standard and from
the product description has been extracted. For this
step Adobe Acrobat Pro has been chosen. The XML
output of Adobe Acrobat Pro has been analysed with
regards for suitability for information mapping into
the created ontology.
The use of Adobe Acrobat Pro restricts the ap-
proach to the extraction of standards’ requirements
from digitally created PDF documents. This limits
the applicability of the approach. The mapping of
the Adobe Acrobat Pro XML output into the ontology
presents a challenge because tags for detailed formula
and table content are missing. Table header- and data-
cells cannot yet be securely transferred into the ontol-
ogy. Formula calculations still need to be introduced.
Previous efforts have shown that there is no easy-to-
use-way to implement the calculation of complex for-
mulae into ontologies (Hildebrandt, 2017). Formula
calculation needs to be implemented in a different
manner.
Future work will focus on the mapping of the
XML output into the created ontology. A mapping
algorithm needs to be implemented to enable an au-
tomation of the RCC process. The created ontology
needs to be refined so that it is more generally ap-
plicable. Furthermore, an extension of the ontology
adding more detailed descriptions to the parameter
names is planned. Additional research will be devoted
to finding a more suitable extraction software as well
as to the integration of formula calculations into the
approach.
ACKNOWLEDGEMENTS
This publication resulted from a project between the
Schaeffler AG and the Helmut Schmidt University
Hamburg. The authors wish to thank the colleagues
from Schaeffler Engineering Standards (in alphabet-
ical order: S
¨
oren Clodius, Markus Franke, Stefan
Gatersleben, Dietmar Lochner, Hans Reichelsdorfer)
for providing and discussing application scenarios.
REFERENCES
ABBYY (2021). Finereader pdf. Retrieved March 6, 2021
from https://pdf.abbyy.com/?redirect-from=old-fr-ce.
Adobe (2021). Acrobat pro. Retrieved March 6, 2021 from
https://acrobat.adobe.com/us/en/acrobat.html.
Baker, J. B., Sexton, A. P., and Sorge, V. (2012). Max-
tract: Converting pdf to latex, mathml and text. In
Intelligent Computer Mathematics, pages 422–426.
Springer Berlin Heidelberg, Berlin, Heidelberg.
Beach, T. H., e. a. (2013). Towards automated compliance
checking in the construction industry. In Database
and Expert Systems Applications, pages 366–380.
Springer Berlin Heidelberg, Berlin, Heidelberg.
Berg, Ø. R., Oepen, S., and Read, J. (2012). To-
wards high-quality text stream extraction from pdf:
Technical background to the acl 2012 contributed
task. In 50th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 98–
103, Stroudsburg, PA. Association for Compu-
tational Linguistics (ACL). Software available at
https://github.com/oyvindberg/PDFExtract.
Bettini, C., e. a. (2010). A survey of context modelling and
reasoning techniques. Pervasive and Mobile Comput-
ing, 6(2):161–180.
Chan, K. and Yeung, D. (2000). Mathematical expression
recognition: a survey. International Journal on Doc-
ument Analysis and Recognition, 3(1):3–15.
Clark, C. and Divvala, S. (2016). Pdffigures
2.0. In JCDL’16, pages 143–152, Pis-
cataway, NJ. IEEE. Software available at
https://github.com/allenai/pdffigures2.
Constantin, A., Pettifer, S., and Voronkov, A. (2013). Pdfx
fully-automated pdf-to-xml conversion of scientific li-
etrature. In Proceedings of the 2013 ACM symposium
on Document engineering - DocEng ’13, page 177,
KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development
142
New York, New York, USA. ACM Press. Software
available at http://pdfx.cs.man.ac.uk/.
D
´
ejean, H. and Meunier, J. (2006). A system for converting
pdf documents into structured xml format. In Docu-
ment analysis systems VII, Lecture notes in computer
science, pages 129–140. Springer, Berlin. Software
available at https://sourceforge.net/projects/pdf2xml/.
Garain, U. and Chaudhuri, B. B. (2005). A corpus for ocr
research on mathematical expressions. International
Journal of Document Analysis and Recognition (IJ-
DAR), 7(4):241–259.
Glawe, M., et al. (2015). Knowledge-based engineering of
automation systems using ontologies and engineering
data. In Proceedings of the 7th International Joint
Conference on Knowledge Discovery, Knowledge En-
gineering and Knowledge Management (IC3K 2015),
pages 291–300.
Hildebrandt, C., e. a. (2017). Reasoning on engineering
knowledge: Applications and desired features. In
European Semantic Web Conference, volume 10250
of Lecture notes in computer science, pages 65–78,
Cham. Springer International Publishing.
Hildebrandt, C., e. a. (2020). Ontology building for cyber–
physical systems: Application in the manufacturing
domain. IEEE Transactions on Automation Science
and Engineering, 17(3):1266–1282.
iText Group nv (2021). itext. Retrieved March 6, 2021 from
https://itextpdf.com/.
Jager, T., e. a. (2011). Mining technical dependen-
cies throughout engineering process knowledge. In
ETFA2011, pages 1–7. IEEE.
Khusro, S., Latif, A., and Ullah, I. (2015). On methods
and tools of table detection, extraction and annotation
in pdf documents. Journal of Information Science,
41(1):41–57.
Lee, C., Bzdak, J., and Lannon, B. (2014). pdf-
table-extract. Retrieved March 6, 2021 from
https://github.com/ashima/pdf-table-extract.
Liu, Y. (2009). Tableseer: Automatic Table Extrac-
tion, Search an Understanding. Dissertation, The
Pennsylvania State University. Software available at
https://sourceforge.net/projects/tableseer/.
Loibl, A., Manoharan, T., and Nagarajah, A. (2020).
Procedure for the transfer of standards into
machine-actionability. Journal of Advanced Me-
chanical Design, Systems, and Manufacturing,
14(2):JAMDSM0022–JAMDSM0022.
Luong, M., Nguyen, T. D., and Kan, M. (2012).
Logical structure recovery in scholarly articles
with rich document features. In Multimedia
Storage and Retrieval Innovations for Digi-
tal Library Systems, pages 270–292. Software
available at https://github.com/knmnyn/ParsCit/
tree/master/bin/sectLabel.
Manoharan, T., e. a. (2019). Approach for a machine-
interpretable provision of standard contents using
welded constructions as an example. Proceedings of
the Design Society: International Conference on En-
gineering Design, 1(1):2477–2486.
Nitro Software, Inc. (2021). Nitro pro. Retrieved March 6,
2021 from https://www.gonitro.com/.
Oro, E. and Ruffolo, M. (2008). Xonto: An ontology-based
system for semantic information extraction from pdf
documents. In 2008 20th IEEE International Confer-
ence on Tools with Artificial Intelligence, pages 118–
125. IEEE.
Oro, E. and Ruffolo, M. (2009). Pdf-trex: An approach for
recognizing and extracting tables from pdf documents.
In 2009 10th International Conference on Document
Analysis and Recognition, pages 906–910. IEEE.
Perez-Arriaga, M. O., Estrada, T., and Abad-Mota, S.
(2016). Tao: System for table detection and extraction
from pdf documents. Proceedings of the Twenty-Ninth
International Florida Artificial Intelligence Research
Society Conference, pages 591–596.
Pitale, S. and Sharma, T. (2011). Information extrac-
tion tools for portable document format. Interna-
tional Journal of Computer Technology 2011, Vol
2(6):2047–2051.
Schmidberger, T. and Fay, A. (2007). A rule format for in-
dustrial plant information reasoning. In 2007 IEEE
Conference on Emerging Technologies & Factory Au-
tomation (EFTA 2007), pages 360–367. IEEE.
Sciweavers LLC (2021). i2ocr. Retrieved March 6, 2021
from https://www.i2ocr.com/.
Sumatra (2021). Sumatra pdf reader. Retrieved March
6, 2021 from https://www.sumatrapdfreader.org/free-
pdf-reader.
Suzuki, M., e. a. (2004). An integrated ocr soft-
ware for mathematical documents and its output
with accessibility. In Computers Helping Peo-
ple with Special Needs, volume 3118 of Lec-
ture notes in computer science, pages 648–655.
Springer, Berlin and Heidelberg. Software available at
http://www.inftyreader.org/.
Wei, X., Croft, B., and McCallum, A. (2006). Table ex-
traction for answer retrieval. Information Retrieval,
9(5):589–611.
Yildiz, B., Kaiser, K., and Miksch, S. (2005). pdf2table:
A method to extract table information from pdf
files. IICAI, pages 1773–1785. Software available at
http://ieg.ifs.tuwien.ac.at/projects/pdf2table.
Zhang, J. and El-Gohary, N. M. (2015). Automated infor-
mation transformation for automated regulatory com-
pliance checking in construction. Journal of Comput-
ing in Civil Engineering, 29(4).
Zhang, J. and El-Gohary, N. M. (2017). Semantic-based
logic representation and reasoning for automated reg-
ulatory compliance checking. Journal of Computing
in Civil Engineering, 31(1):04016037.
Towards Automation of Regulatory Compliance Checking in the Product Design Phase
143