loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Chems Eddine Neche ; Yolande Belaíd and Abdel Belaíd

Affiliation: Vandoeuvre-Lès-Nancy, F-54506, France

Keyword(s): Word Embeddings, Stream Flow Segmentation, Pages Pair Classification, Continuity, Rupture.

Abstract: Page stream segmentation into single documents is a very common task which is practiced in companies and administrations when processing their incoming mail. It is not a straightforward task because the limits of the documents are not always obvious, and it is not always easy to find common features between the pages of the same document. In this paper, we seek to compare existing segmentation models and propose a new segmentation one based on GRUs (Gated Recurrent Unit) and an attention mechanism, named AGRU. This model uses the text content of the previous page and the current page to determine if both pages belong to the same document. So, due to its attention mechanism, this model is capable to recognize words that define the first page of a document. Training and evaluation are carried out on two datasets: Tobacco-800 and READ-Corpus. The former is a public dataset on which our model reaches an F1 score equal to 90%, and the later is private for which our model reaches an F1 sco re equal to 96%. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.12.162.179

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Neche, C.; Belaíd, Y. and Belaíd, A. (2020). Use of Language Models for Document Stream Segmentation. In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods - ICPRAM; ISBN 978-989-758-397-1; ISSN 2184-4313, SciTePress, pages 220-227. DOI: 10.5220/0009146402200227

@conference{icpram20,
author={Chems Eddine Neche. and Yolande Belaíd. and Abdel Belaíd.},
title={Use of Language Models for Document Stream Segmentation},
booktitle={Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods - ICPRAM},
year={2020},
pages={220-227},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009146402200227},
isbn={978-989-758-397-1},
issn={2184-4313},
}

TY - CONF

JO - Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods - ICPRAM
TI - Use of Language Models for Document Stream Segmentation
SN - 978-989-758-397-1
IS - 2184-4313
AU - Neche, C.
AU - Belaíd, Y.
AU - Belaíd, A.
PY - 2020
SP - 220
EP - 227
DO - 10.5220/0009146402200227
PB - SciTePress