loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Andrei Puha ; Octavian Rinciog and Vlad Posea

Affiliation: Politehnica University of Bucharest and Romania

Keyword(s): Open Data, Image Processing, OCR.

Related Ontology Subjects/Areas/Topics: Data Engineering ; Data Management and Quality ; Information Quality ; Open Data

Abstract: Open data published by public institutions are one of the most important resources available online. Using this public information, decision makers can improve the lives of citizens. Unfortunately, most of the times these open data are published as files, some of them not being easily processable such as scanned pdf files. In this paper we present an algorithm which enhances nowadays knowledge by extracting tabular data from scanned pdf documents in an efficient way. The proposed workflow consists of several distinct steps: first the pdf documents are converted into images, subsequently images are preprocessed using specific processing techniques. The final steps imply running an adaptive binarization of the images, recognizing the structure of the tables, applying Optical Character Recognition (OCR) on each cell of the detected tables and exporting them as csv. After testing the proposed method on several low quality scanned pdf documents, it turned out that our methodology performs alike dedicated OCR paid software and we have integrated this algorithm as a service in our platform that converts open data in Linked Open Data. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 13.59.82.167

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Puha, A.; Rinciog, O. and Posea, V. (2018). Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images. In Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-318-6; ISSN 2184-285X, SciTePress, pages 220-228. DOI: 10.5220/0006862402200228

@conference{data18,
author={Andrei Puha. and Octavian Rinciog. and Vlad Posea.},
title={Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images},
booktitle={Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA},
year={2018},
pages={220-228},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006862402200228},
isbn={978-989-758-318-6},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA
TI - Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images
SN - 978-989-758-318-6
IS - 2184-285X
AU - Puha, A.
AU - Rinciog, O.
AU - Posea, V.
PY - 2018
SP - 220
EP - 228
DO - 10.5220/0006862402200228
PB - SciTePress