Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

Nuno Moniz; Fátima Rodrigues

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

Topics: Information Extraction; Structured Data Analysis and Statistical Methods

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 0IC3K, 123-131, 2012 , Barcelona, Spain

Authors: Nuno Moniz and Fátima Rodrigues

Affiliation: Institute of Engineering and Polytechnic of Porto, Portugal

Keyword(s): Information Retrieval, Text Extraction, PDF.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Information Extraction ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Structured Data Analysis and Statistical Methods ; Symbolic Systems

Abstract: This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font’s structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic’s Diary. Evaluation results show that our approach presents good results.

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.157

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Moniz, N., Rodrigues and F. (2012). Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2012) - KDIR; ISBN 978-989-8565-29-7; ISSN 2184-3228, SciTePress, pages 123-131. DOI: 10.5220/0004103501230131

@conference{kdir12,
author={Nuno Moniz and Fátima Rodrigues},
title={Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2012) - KDIR},
year={2012},
pages={123-131},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004103501230131},
isbn={978-989-8565-29-7},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2012) - KDIR
TI - Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation
SN - 978-989-8565-29-7
IS - 2184-3228
AU - Moniz, N.
AU - Rodrigues, F.
PY - 2012
SP - 123
EP - 131
DO - 10.5220/0004103501230131
PB - SciTePress