Automated Data Extraction from PDF Documents:

Application to Large Sets of Educational Tests

Karina Wiechork

1,2 a

and Andrea Schwertner Char

1 b

Department of Languages and Computer Systems, Federal University of Santa Maria, Santa Maria, Brazil

Information Technology Coordination, Federal Institute of Education Science and Technology Farroupilha,

Frederico Westphalen, Brazil

Keywords:

Dataset Collection, Ground Truth, Performance Evaluation, PDF Extraction Tools.

Abstract:

The massive production of documents in portable document format (PDF) format has motivated research on

automated extraction of data contained in these ﬁles. This work is mainly focused on extractions of natively

digital PDF documents, made available in large repositories of educational exams. For this, the educational

tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The ﬁles

used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with

14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests,

the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF

ﬁles: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions,

with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia

and ExamClipper for the cutouts of the questions. The results of the extractions point out some limitations in

relation to the diversity of layout in each year of application. The extracted data provide useful information in

a wide variety of ﬁelds, including academic research and support for students and teachers.

1 INTRODUCTION

With the development of information technology and

the wide spread of the Internet, a large amount of elec-

tronic documents are stored in PDF ﬁles (Fang Yuan

and Bo Lu, 2005).

PDF is one of the most widely used document for-

mats for storing text based data. This ﬁle format was

senting a document, regardless of the platform used,

and preserving the layout on the screen and in print-

ing.

While this is an efﬁcient way to store the visual

representation of a document, the resulting structure

is difﬁcult to work with if the aim is to extract speciﬁc

parts of the text in a structured manner (Budhiraja,

2018).

One of the great advances in the digital era has

been to enable us to store vast amounts of documents

electronically (Øyvind Raddum Berg, 2011). The

substitution of physical document storage for elec-

https://orcid.org/0000-0003-2427-1385

https://orcid.org/0000-0003-3695-8547

tronic storage, provides advantages such as: cost re-

duction, easy storage and sharing, optimization in

searches and queries, documents are not damaged and

digital documents have a structural standardization,

for example in paragraphs, sections, titles, ﬁgures,

which can be useful to detect regions and extract in-

formation in high demand, in an automated or semi-

automated manner.

This research aims to carry out an exploratory

analysis on tools used in data extraction in documents

born in PDF, whose objective is to discover their ef-

fectiveness and limitations. Retrieving relevant infor-

mation in the questions of these tests is a difﬁcult task,

since the layout is not geometrically simple.

Extracting information from PDF ﬁles is an im-

portant job, since these extracted questions are very

valuable knowledge assets for research, providing

useful and timely information for several users who

may beneﬁt, for example, research material to stu-

dents who intend to study for tests, courses or pub-

lic contests, using as an object of study interesting for

learn to retain new knowledge.

This information can also help course coordi-

nators to analyze the effectiveness of Pedagogical

Wiechork, K. and Charão, A.

Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests.

DOI: 10.5220/0010524503590366

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 359-366

ISBN: 978-989-758-509-8

359

Course Projects, mapping students’ knowledge and

discovering gaps from the results of the questions in

reports. In addition to becoming a set of interesting

material to be used in the classroom by teachers, in

order to facilitate understanding, as well as use these

questions for exercises. In the case of a teacher, he can

have a database of questions and answers and from

there generate new tests.

For this work, the extractions of the educational

tests were carried out through the ENADE tests (Na-

tional Student Performance Test) applied in the years

2004 to 2019. The set of downloaded ﬁles, consists

of 386 tests and 396 answers, totaling 782 ﬁles, how-

ever, only tests with more than two applications were

used. The number of tests, pages for extractions and

questions are 343, 6.834, 11.196 respectively, while

the total of alternatives in the 396 objective responses

is approximately 14.475. Our dataset for PDF extrac-

tion totals 739 ﬁles, accounting for 343 tests and 396

answers.

Knowing that it is possible to extract data in this

type of test, it will also be possible to extract from

other tests, just by reusing the tools used in this work.

The remainder of this paper is organized as fol-

lows. Section 2 discusses some basic topics and re-

lated work that included experiments involving ex-

tracting data from PDF ﬁles. In section 3, we present

the dataset used and extracted, together with the

methodology used to obtain these data. Section 4

details the experiments and the results we obtained.

Finally, section 5 concludes this article with a brief

summary and suggestions for future research.

2 BACKGROUND AND RELATED

WORK

2.1 Enade

In Brazil, the National Institute of Educational Stud-

ies and Research An

ısio Teixeira (INEP) is respon-

sible for applying Enade. The results of the tests

present several indicators, among them the concept

of the course that varies from 0 to 5 for the courses.

Based on the analysis of the data obtained by the ap-

plication of Enade, it is possible to analyze the per-

formance of both institutions and students, and then

calculate quality indicators that may provide opportu-

nities for improvement decisions in the teaching pro-

cess.

Enade assesses the performance of graduates of

undergraduate courses in relation to the syllabus fore-

seen in the curricular guidelines of the courses, the

development of competencies and skills necessary for

the deepening of general and professional formation,

and the level of updating of students in relation to the

reality Brazilian and world (INEP, 2020).

Applied by Inep since 2004, Enade is part of

the National Higher Education Assessment System

(Sinaes), which also comprises undergraduate courses

and institutional assessment. Together, they form the

evaluative tripod that allows to know the quality of

Brazilian higher education courses and institutions.

The results of Enade are inputs for calculating the

Higher Education Quality Indicators (gov.br, 2021).

The test consists of 40 questions, where 10 ques-

tions make up the general formation and 30 speciﬁc

formation in the area, both parts contain discursive

and multiple choice questions. The general formation

part has 25% of the test and 75% is for speciﬁc for-

mation, showed at Table 1.

Table 1: Values for each part of the test.

General

Formation

Speciﬁc

Formation

Discursive 2 3

Multiple

Choice

8 27

Peso 25% 75%

2.2 Ground Truh

For automatic evaluation of results of any segmenta-

tion/recognition system, the ground truth information

plays a signiﬁcant role. However, it is an error prone

and time consuming task (Alaei et al., 2011).

In document image understanding, public datasets

with ground truth are an important part of scientiﬁc

work. They are not only helpful for developing new

methods, but also provide a way of comparing perfor-

mance. Generating these datasets, however, is time

consuming and cost-intensive work, requiring a lot of

manual effort (Strecker et al., 2009).

To assist in this research in the task of creating the

ground truth on each page with question in the Enade

tests, the software was used Aletheia, belonging to

one of the PRIMA research group (Pattern Recogni-

tion & Image Analysis Research Lab) at the Univer-

sity of Salford Manchester. The fact that it has been

widely adopted in similar studies, is maintained by a

research group, updated and presents several options

for working, are contributing factors for use in this

work.

The ﬁgure 1 shows a page in Aletheia with the

regions of interest marked. The workﬂow of Aletheia

consists of the steps of input, which includes the page,

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

360

and output, where the segments are classiﬁed and

saved in eXtensible Markup Language (XML).

Figure 1: Example of input (left) and output (right) with 12

marked regions belonging to 4 different types: image (light

blue), table (brown), text (dark blue) and separator (pink).

About 6.834 pages of the PDF tests were collected.

For each page of the Enade tests that contains ques-

tions, a ground truth ﬁle is available and the corre-

sponding XML ﬁle built with the Aletheia tool. The

ﬁles are in XML format and contain the coordinates

of the regions in a hierarchical structure.

In terms of presentation, each page with the con-

tent of the test is composed of 3 parts: 1) one with

identiﬁed ground truth; 2) the original page in jpg

format; 3) an XML description of the attributes con-

tained and built with Aletheia, according to the se-

lected regions. This ﬁles contains detailed data that

can meet other extraction requirements and be used

in other research. In this case, it was used to com-

pare with the output of the textual extraction tools,

PDFMiner and CyberPDF, in addition to being used

for the metrics count script, which will be explained

in the section 3.2.

2.3 PDF Extraction Tools

In this section, works by other authors related to this

research are analyzed and described. The authors

(Constantin et al., 2013) present a system designed

to reconstruct the logical structure of academic PDF

articles, the PDFX tool. The output is an XML or

HyperText Markup Language (HTML) document that

describes the logical structure of the input article in

terms of title, sections, tables, references, etc. When

using HTML output, the ﬁgures are also extracted and

are available at the end of the ﬁle, but not in the order

of reading.

In this direction, the work of (Ramakrishnan et al.,

2012), the authors develop a tool for extracting text

from PDF with layout recognition (LA-PDFText)

whose objective is accurately extract text from sci-

entiﬁc articles. The scope of the system is only in

extractions in the textual content of the research ar-

ticles. In (Hadjar et al., 2004), attackers describe an

approach in which they extract all objects from a PDF

document, including text, images and graphics, enti-

tled Xed (eXtracting electronic documents). The out-

put of the extracted objects is in SVG (Scalable Vector

Graphics) format.

A number of works in the ﬁeld of extracting ta-

bles from PDF ﬁles, such as (Hassan and Baumgart-

ner, 2007), are available. The work (Liu et al., 2007)

it presents a system capable of extracting tables and

table metadata from PDF documents, for this pur-

pose the PDFBox is used to extract raw text, which is

later processed to identify tables. The Tabula (Manuel

Aristar

an, Mike Tigas, Jeremy B. Merrill, Jason Das,

David Frackman and Travis Swicegood, 2018) tool al-

lows users to select tables for extracting tabular data

from PDF documents. Excalibur is a web tool for

extracting tabular data from text-based PDFs and not

from scanned documents (Excalibur, 2018). The ex-

traction of tables has become useful for this work in

performing the extraction of the answers of objective

questions.

PDFMiner is a tool for extracting information

from PDF documents. Unlike other PDF-related

tools, it focuses entirely on obtaining and analyzing

text data (Yusuke Shinyama, 2014). In (Parizi et al.,

2018) the authors propose a technique that allows

users to consult a representative PDF document and

extract the same data from a series of ﬁles in the form

of batch analysis quickly, CyberPDF is an automatic

PDF batch extraction tool based on coordinates.

Other approaches focus on evaluating PDF extrac-

tion tools. In the research by (Bast and Korzen, 2017),

the authors provide an assessment of 14 PDF extrac-

tion tools to determine the quality and scope of its

functionality, based on a benchmark that they built

from parallel TeX and PDF data. They used 12.098

scientiﬁc articles and for each article, the benchmark

contains a ground truth ﬁle in addition to the related

PDF ﬁle. In (Lipinski et al., 2013) the authors eval-

uate the performance of tools for the extraction of

metadata from scientiﬁc articles. The comparative

study is a guide for developers who want to integrate

the most appropriate and effective metadata extraction

tool into their software.

Speciﬁc approaches to extract ﬁgures and captions

from PDF are being proposed. This is the case of the

work by (Choudhury et al., 2013), where the authors

were concerned with extracting ﬁgures and associated

captions from PDF documents. In the work of (Li

et al., 2018), they present a system for extracting ﬁg-

ures and associated legends from scientiﬁc publica-

tions, PDFigCapX.

Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

361

Although several sophisticated and even complex

approaches have been proposed, they are still limited

in many ways (Strecker et al., 2009).

In the research of (Lima and Cruz, 2019), the au-

thors propose an approach to detect and extract data

from unstructured data sources available online and

spread across multiple web pages, to store the data in

a Data Warehouse properly designed for this. Almost

all ﬁles are published in PDF and there are ﬁles with

different layout. For this process, the authors use pre-

existing tools.

Through these works, it is possible to identify

some tools used in the extraction of data ﬁles PDF.

This study in related works helped in the exploratory

research of some approaches used in this research.

3 DATASET AND

METHODOLOGY

In this section, we present the dataset collection and

the methodology used to conduct our experiments

with Excalibur and Tabula, for extracting answers,

CyberPDF and PDFMiner, for extracting textual con-

tent, Aletheia and ExamClipper, for extracting re-

gions of interest. There are generic tools that pro-

cess images and allow you to make cutouts. With that

we use Aletheia, to do an experiment with some tests,

which contemplates this category and allows you to

make cuts in regions of interest in PDF ﬁles.

3.1 Dataset Collection

The dataset for this research consists of tests and eval-

uation answers from Enade, composed from the years

2004 to 2019. In an automated way to download these

tests and answers on the INEP website, the Scrapy

tool was applied to assist in the download these ﬁles

in an automated way. All automation scripts are avail-

able at: https://github.com/karinawie/scrapy. The use

of these scripts was essential, due to the fact that there

is a large set of documents to be collected.

As a result, 386 tests and 396 answers from ob-

jective alternatives were collected. This difference in

the amount of tests for the answers, although in some

years the same test was applied for similar courses,

but with different answers.

However, the data set totals 343 tests and 396 an-

swers, totaling 739 ﬁles. Not all tests were used due to

the large volume and the delay in carrying out the ex-

periments, so it was decided to remove the tests with

less than two applications. The dataset is available at:

https://github.com/karinawie/PDFExtraction/tree/mas-

ter/dataset with 739 ﬁles downloaded.

From the quantitative data contained in the

dataset, the number of tests, pages for extractions and

questions are 343, 6.834, 11.196 respectively, while

the total extraction of responses from the 396 ﬁles,

totaled 14.475 approximately. In this count, the blank

pages, the covers of the tests and the pages of the

questionnaire of perception of the test were not in-

cluded. In all tests the part of General Formation

questions was counted only once, in cases where the

layout pattern was the same for all tests in the year

evaluated. In a few years, more than one layout pat-

tern was identiﬁed, with which more than one ground

truth was generated for General Formation. Answers

are not accounted for with essay questions, only with

objective questions.

Table 2 presents an overview with quantitative

data from pages and questions that were used in the

extractions of the tests.

Table 2: Overview of the data used in the tests.

Year Number of pages Number questions

2004 151 430

2005 391 884

2006 171 340

2007 188 440

2008 453 939

2009 328 595

2010 236 550

2011 546 956

2012 292 520

2013 342 520

2014 884 1222

2015 439 580

2016 351 520

2017 905 1270

2018 537 610

2019 620 820

Total 6.834 11.196

3.2 Methodology

This section presents the methodology used to per-

form the data extractions, in addition to the set of met-

rics and criteria established for the evaluations applied

in the experiments.

After obtaining our dataset, was performed a com-

parative evaluation of 6 PDF ﬁle extraction tools. Ac-

cording to the need for this research, we performed

data extractions for 3 categories: data tables, text con-

tent and regions of interest for image format. In each

category there are 2 tools that extract the same con-

tent. After this extraction, a comparison is made with

the tool output belonging to the same category. For

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

362

this, a set of criteria was established that allow an as-

sessment of the extraction tools comparing the results

of the tools with the ground truth.

Performance evaluation is necessary to compare

and select the most suitable methods for a given appli-

cation. Different algorithms have different deﬁcien-

cies considering all the metrics of evaluation. Ground

truth contains sufﬁcient and detailed data in several

aspects and it is necessary to use it as a reference to

evaluate the results of the experiments (Fang et al.,

2012).

To quantify the accuracy when analyzing the per-

formance of the tools, the metrics listed in the Table

3 are used. When evaluating a tool, each of its output

ﬁles is compared with the equivalent ground truth ﬁle,

then with the competing tool. The following evalua-

tion criteria are measured.

To compare the metrics of the question

extractions with the ground truth, a metric

count script was created and is available at

https://github.com/karinawie/XML aletheia. The

ground truth was built with Aletheia, the XML ﬁles

were created and saved. These XML ﬁles were used

to create our script. The script counts the metrics for

each question according to the year and the test area

and exports it to a text ﬁle with this information. The

script analyzes the XML ﬁle, detecting the beginning

of the question using a regular expression: ”QUES-

TION or Question or DISCURSIVE QUESTION or

Discursive Question”, followed by numbers between

0 to 9 with two digits. With that, the quantity of each

region in each test was informed in a spreadsheet in

an automated way with the help of this script.

Table 3: Metric notations.

Notations Signiﬁcation

1C one column on the page

2C two columns on the page

MC mixed columns on page

1QP one question per page/column

1QV one question that starts on one

page/column and ends on another

VQP multiple questions in one

page/column

QF questions with ﬁgure/graph

QT questions with tables

- not available in the selected test set

N tool does not recognize

Several approaches are currently available for extract-

ing data from PDF. To carry out the comparison of

the tools, they must largely have the same general

objectives, for this the Excalibur comparison will be

performed with Tabula, CyberPDF with PDFMiner,

Aletheia with ExamClipper, since they have resources

for similar extractions. ExamClipper is software un-

der development by a research group at the Federal

University of Santa Maria, used to extract regions of

interest. The Aletheia tool, for providing an option,

among the several, to extract similar to ExamClipper,

was chosen to use it. For this, the XML ﬁles were

used, with their respective images, created during the

development of the ground truth, that is, the same ﬁles

were reused to perform the extractions. Obviously

this is not a fair comparison, although the output of

the two tools and the type of extractions are similar

and ﬁt the same extraction category.

In the extraction of the answers, the Excalibur and

Tabula tools were used, both work in the same objec-

tive: extraction of tabular PDF data. These tools were

selected to extract the answers of objective questions.

To calculate the performance of the tools, a sim-

ple rule of three was used, where only the complete

extractions of the questions are counted. In the fol-

lowing formula, the value of ”total questions to ex-

tract”, is equivalent to the total of questions identiﬁed

in the ground truth. Then, the simple arithmetic mean

in each year was applied, to know the average in each

metric. Finally, the average formula was again ap-

plied to obtain the average value of the entire dataset

for each tool.

questions extracted by tool ∗ 100

total questions to extract

4 PERFORMANCE EVALUATION

AND RESULTS

4.1 Performance Evaluation

To verify the performance of the selected tools, exper-

iments were performed in the dataset. The evaluation

criteria introduced in the section 3.2 are easily inter-

pretable, but measuring them is not trivial.

Starting with the category of questions: general

formation, for all tests of the same year these ques-

tions are always the same. The experiments were car-

ried out only once on these questions, for each year.

Then, counting occurred only on speciﬁc questions

in the area. This approach reduced the amount of

computation required and, therefore, reduced the time

required to perform the analysis. Initially the work

would be applied to 14.386 questions. Finally, we de-

creased to 11.226 questions objective and discursive

used to extract.

Text extraction plays an important role for data

processing workﬂows in digital libraries. Complex

Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

363

ﬁle formats make the extraction process error prone

and make it very difﬁcult to verify the accuracy of the

extraction components.

Based on digital preservation and information re-

trieval scenarios, three quality requirements in terms

of effec- tiveness of text extraction tools are identi-

ﬁed: 1) is a certain text snippet correctly extracted

from a document, 2) does the extracted text appear in

the right order relative to other elements and, 3) is the

structure of the text preserved (Duretec et al., 2017).

The tools were executed to obtain the ﬁnal output

and then compared with the results of its competitor.

Then, both results are compared with ground truth.

Excalibur and Tabula that extract tabular data,

both used to extract the answers, the evaluations were

only in the objective answers. Discourse answers are

not included in the count.

The main objective of the evaluation is to analyze

each tool, comparing its output ﬁles with the ground

truth ﬁles, using the set of established metrics and cri-

teria. This was more difﬁcult than expected, espe-

cially the part of comparing tool outputs. Then, we

will present the results of these experiments.

4.2 Results

This section presents the results obtained from exper-

iments carried out using the extraction tools. For each

tool, a concise result is provided, according to the

criteria addressed. The full results are available at:

https://github.com/karinawie/PDFExtraction.

The information from the ground truth of each

page of the tests, was informed in a spreadsheet

to make comparisons with the information extracted

from the extraction tools. The analysis of the ex-

perimental results demonstrates the effectiveness of

the suggested measures and provides valuable infor-

mation on the performance and characteristics of the

evaluated tools.

The Table 4 provides an overview of the evalua-

tion results for each of the PDF extraction tools, in

relation to the average time in seconds, required to

extract the data from a single PDF ﬁle. The value ob-

tained from the average time was calculated in 5 equal

tests for all tools, only at the time of extraction with-

out counting the time to attach the tests to the tools.

The ExamClipper tool took approximately 4 minutes

to complete this task, the Aletheia tool took about

3 minutes. Emphasizing that the results of Aletheia

have a bias previously conﬁgured in the ground truth,

even so it was accounted for.

In the Table 5, below, the results of the Excalibur

and Tabula tools are shown together with the perfor-

mance in extracting the tabular data, which in this

Table 4: Overview of the results of the evaluation process

of extracting information from PDFs.

Tools Time

Excalibur 20

Tabula 16

PDFMiner 16

CyberPDF 22

ExamClipper 240

Aletheia 180

work were the objective answers. For these tools, only

the QF metric was calculated, as the answers are in ta-

bles.

Table 5: Overview of the results of the tabular data extrac-

tion tools.

Metrics / Tool Excalibur Tabula

1QP N N

1QV N N

VQP N N

QF N N

QT 99,4 97,7

In the Figure 2, in yellow, the ground truth compared

to Excalibur, in blue, and Tabula, in red. It is ob-

served that between the years 2005 to 2007, the Tab-

ula tool had a slight difﬁculty in extracting the alter-

natives from the answers.

Figure 2: Excalibur and Tabula extraction results detailed

by year and compared to ground truth.

The Figure 3 shows that the Excalibur and Tabula

tools achieve a result of extraction quantity very close,

however the Excalibur tool presents a better perfor-

mance in data extraction. Regarding the performance

of the time to extract the data, Tabula was a little

faster.

The metrics 1C, 2C, MC, 1QP, 1QV, VQP, QFG

and QT, Table 6, are presented with the simple arith-

metic mean of the percentages obtained for the years

2004-2019. According to this table, the metrics for

2 columns (2C) have a relatively low recovery rate,

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

364

Figure 3: General comparison of Excalibur and Tabula ex-

traction, the higher the result, the more efﬁcient the tool is.

as the tools do not identify that the page contains 2C,

so the extraction ends up being performed as a sin-

gle column (1C). This is also true for mixed column

(MC) metrics.

Table 6: Overview of the results of the textual data extrac-

tion tools.

CyberPDF PDFMiner

1C 2C MC 1C 2C MC

1Q 74,1 15,4 - 76,7 39,5 -

1QV 68,1 0 - 88,8 62,5 -

VQP 80,4 18,6 41,9 74,6 55,5 33,6

QFG N N N N N N

QT N N N N N N

The Table 7 show the results of the extractions per-

formed by Aletheia and ExamClipper, which identi-

ﬁes regions and extracts the clippings from the PDFs.

It was decided to follow the extraction with the

Aletheia tool. The disadvantage of using the same

XML ﬁle as the one conﬁgured in the ground truth

for the extractions, is that the comparison with the

ExamClipper is a result that does not match the

way used in the other extractions, since there were

manual adjustments so the XML was organized

some corrections in the segmentations, for example,

joining lines of questions in the same region. This

ended up favoring 100% extractions for all the

metrics evaluated. The advantage is that all questions

have been extracted and are available on GitHub:

https://github.com/karinawie/PDFExtraction/tree/-

master/extractions/aletheia. Obviously this is not a

fair comparison, although the output of the two tools

and the type of extractions are similar and ﬁt the

same extraction category.

The values of the ExamClipper were not obtained

with the manual selection on each region, but with the

detection that the tool makes available in the cutout

interface.

Table 7: Overview of the extractions performed by the Ex-

amClipper.

Aletheia ExamClipper

1C 2C MC 1C 2C MC

1Q 100 100 - 85,8 61,1 -

1QV 100 100 - 70,2 55,0 -

VQP 100 100 100 64,7 48,2 36,9

QFG 100 100 100 68,3 37,1 56,6

QT 100 100 - 70,8 48,9 -

5 CONCLUSIONS

This article presented a performance of the PDF

extraction tools: Excalibur, Tabula, CyberPDF,

PDFMiner, Aletheia and ExamClipper. We ran Ex-

calibur and Tabula on the 396 answers, CyberPDF,

PDFMiner, Aletheia and ExamClipper on the 343

tests.

According to the settings used in the tools for this

work, it was possible to evaluate the extraction tools.

Based on the extracted data, the Excalibur tool rec-

ognizes more tables compared to the uses of Tab-

ula, however it takes a few more seconds for the ex-

traction. PDFMiner is able to automatically iden-

tify multiple questions in all of the stipulated metrics,

while CyberPDF cannot automatically identify ques-

tions that start on one page/column and end on an-

other and that are in two columns. The PDFMiner

tool also extracts more quickly. Although the extrac-

tions the Aletheia use a bias and the results are all at

100%, it was possible to obtain all the extractions of

the questions used in this research. ExamClipper of-

fers the option to manually adjust regions for cutouts,

taking longer. If this had been applied, the extractions

would also have been 100%.

These results can change within certain limits, for

example, manually adjusting some identiﬁcations that

the tools select, changing input settings, among oth-

ers. The results of the extractions were valued in the

automatic identiﬁcations that the tools allow without

manual interference, except with Aletheia. If the tests

used a standard layout for all courses in all years, the

extractions would be more efﬁcient, at least using the

CyberPDF tool where it uses the coordinates as a stan-

dard for the other ﬁles.

As a suggestion for future work, it is intended

to carry out experiments with other extraction tools

not covered in the study. This extracted information

is very valuable knowledge assets for research, pro-

viding useful, informative and timely information for

several users who may beneﬁt, and it can serve as re-

search material for students, for example, who intend

Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

365

to study for other tests, using as an interesting object

of study to learn and retain new knowledge. In ad-

dition to becoming a set of interesting material to be

used in the classroom by teachers, in order to facilitate

understanding, as well as use these questions for exer-

cises. In the case of a teacher, he may have a database

of questions and answers from which he can generate

new tests. Another important aspect is the possibility

of creating a database of questions with these ques-

tions extracted. The objective of this work was not to

make these extracted questions available in databases

or systems, but it can be a suggestion for future works.

REFERENCES

Alaei, A., Nagabhushan, P., and Pal, U. (2011). A bench-

mark kannada handwritten document dataset and its

segmentation. In 2011 International Conference on

Document Analysis and Recognition, pages 141–145.

Bast, H. and Korzen, C. (2017). A Benchmark and

Evaluation for Text Extraction from PDF. In 2017

ACM/IEEE Joint Conference on Digital Libraries

(JCDL), pages 1–10.

Budhiraja, S. S. (2018). Extracting Speciﬁc Text From Doc-

uments Using Machine Learning Algorithms. Thesis

of computer science, Lakehead University, Canada.

Choudhury, S. R., Mitra, P., Kirk, A., Szep, S., Pellegrino,

D., Jones, S., and Giles, C. L. (2013). Figure meta-

data extraction from digital documents. In 2013 12th

International Conference on Document Analysis and

Recognition, pages 135–139.

Constantin, A., Pettifer, S., and Voronkov, A. (2013).

PDFX: Fully-Automated PDF-to-XML Conversion of

Scientiﬁc Literature. In Proceedings of the 2013 ACM

Symposium on Document Engineering, DocEng ’13,

page 177–180, New York, NY, USA. Association for

Computing Machinery.

Duretec, K., Rauber, A., and Becker, C. (2017). A text ex-

traction software benchmark based on a synthesized

dataset. In Proceedings of the 17th ACM/IEEE Joint

Conference on Digital Libraries, JCDL ’17, page

109–118. IEEE Press.

Excalibur (2018). Excalibur: Pdf table extraction for hu-

mans. Accessed: 2020-11-29.

Fang, J., Tao, X., Tang, Z., Qiu, R., and Liu, Y. (2012).

Dataset, ground-truth and performance metrics for ta-

ble detection evaluation. In 2012 10th IAPR Inter-

national Workshop on Document Analysis Systems,

pages 445–449.

Fang Yuan and Bo Lu (2005). A new method of information

extraction from PDF ﬁles. In 2005 International Con-

ference on Machine Learning and Cybernetics, vol-

ume 3, pages 1738–1742 Vol. 3.

gov.br (2021). Exame Nacional de Desempenho dos Estu-

dantes (Enade). Accessed: 2020-01-16.

Hadjar, K., Rigamonti, M., Lalanne, D., and Ingold, R.

(2004). Xed: a new tool for extracting hidden struc-

tures from electronic documents. In First Interna-

tional Workshop on Document Image Analysis for Li-

braries, 2004. Proceedings., pages 212–224.

Hassan, T. and Baumgartner, R. (2007). Table recognition

and understanding from pdf ﬁles. In Ninth Interna-

tional Conference on Document Analysis and Recog-

nition (ICDAR 2007), volume 2, pages 1143–1147.

INEP (2020). Exame Nacional de Desempenho dos Estu-

dantes (Enade). Accessed: 2020-10-07.

Li, P., Jiang, X., and Shatkay, H. (2018). Extracting ﬁgures

and captions from scientiﬁc publications. In Proceed-

ings of the 27th ACM International Conference on In-

formation and Knowledge Management, CIKM ’18,

page 1595–1598, New York, NY, USA. Association

for Computing Machinery.

Lima, R. and Cruz, E. F. (2019). Extraction and multi-

dimensional analysis of data from unstructured data

sources: A case study. In ICEIS.

Lipinski, M., Yao, K., Breitinger, C., Beel, J., and Gipp, B.

(2013). Evaluation of header metadata extraction ap-

proaches and tools for scientiﬁc pdf documents. JCDL

’13, page 385–386, New York, NY, USA. Association

for Computing Machinery.

Liu, Y., Bai, K., Mitra, P., and Giles, C. L. (2007). Table-

seer: Automatic table metadata extraction and search-

ing in digital libraries. In Proceedings of the 7th

ACM/IEEE-CS Joint Conference on Digital Libraries,

JCDL ’07, page 91–100, New York, NY, USA. Asso-

ciation for Computing Machinery.

Manuel Aristar

an, Mike Tigas, Jeremy B. Merrill, Jason

Das, David Frackman and Travis Swicegood (2018).

Tabula is a tool for liberating data tables locked inside

pdf ﬁles. Accessed: 2020-07-20.

Parizi, R. M., Guo, L., Bian, Y., Azmoodeh, A., De-

hghantanha, A., and Choo, K. R. (2018). Cyber-

pdf: Smart and secure coordinate-based automated

health pdf data batch extraction. In 2018 IEEE/ACM

International Conference on Connected Health: Ap-

plications, Systems and Engineering Technologies

(CHASE), pages 106–111.

Ramakrishnan, C., Patnia, A., Hovy, E., and Burns, G. A.

(2012). Layout-aware text extraction from full-text

PDF of scientiﬁc articles. Source Code for Biology

and Medicine, 7(1):7.

Strecker, T., v. Beusekom, J., Albayrak, S., and Breuel,

T. M. (2009). Automated ground truth data genera-

tion for newspaper document images. In 2009 10th

International Conference on Document Analysis and

Recognition, pages 1275–1279.

Yusuke Shinyama (2014). Python pdf parser and analyzer.

Accessed: 2020-05-21.

Øyvind Raddum Berg (2011). High precision text extrac-

tion from PDF documents. Thesis en informatics, Uni-

versity of Oslo.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

366