loading
Documents

Research.Publish.Connect.

Paper

Authors: Andrei Puha ; Octavian Rinciog and Vlad Posea

Affiliation: Politehnica University of Bucharest and Romania

ISBN: 978-989-758-318-6

Keyword(s): Open Data, Image Processing, OCR.

Related Ontology Subjects/Areas/Topics: Data Engineering ; Data Management and Quality ; Information Quality ; Open Data

Abstract: Open data published by public institutions are one of the most important resources available online. Using this public information, decision makers can improve the lives of citizens. Unfortunately, most of the times these open data are published as files, some of them not being easily processable such as scanned pdf files. In this paper we present an algorithm which enhances nowadays knowledge by extracting tabular data from scanned pdf documents in an efficient way. The proposed workflow consists of several distinct steps: first the pdf documents are converted into images, subsequently images are preprocessed using specific processing techniques. The final steps imply running an adaptive binarization of the images, recognizing the structure of the tables, applying Optical Character Recognition (OCR) on each cell of the detected tables and exporting them as csv. After testing the proposed method on several low quality scanned pdf documents, it turned out that our methodology performs alike dedicated OCR paid software and we have integrated this algorithm as a service in our platform that converts open data in Linked Open Data. (More)

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.205.109.82

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Puha, A.; Rinciog, O. and Posea, V. (2018). Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images.In Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-318-6, pages 220-228. DOI: 10.5220/0006862402200228

@conference{data18,
author={Andrei Puha. and Octavian Rinciog. and Vlad Posea.},
title={Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images},
booktitle={Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2018},
pages={220-228},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006862402200228},
isbn={978-989-758-318-6},
}

TY - CONF

JO - Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images
SN - 978-989-758-318-6
AU - Puha, A.
AU - Rinciog, O.
AU - Posea, V.
PY - 2018
SP - 220
EP - 228
DO - 10.5220/0006862402200228

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.