PaperClip: Automated Dossier Reorganizing

Wessel Stoop, Iris Hendrickx, Tom van Ees

Abstract

We investigate the creation of a robust algorithm for document identification and page ordering in a digital mail room in the banking sector. PaperClip is a system that takes files containing pages of various documents as input, and returns multiple files that contain all the pages of one document in the correct order. PaperClip performs (1) document type classification and (2) page number classification on each page, and then (3) merges the results. We experimented with various algorithms and methods for these three steps and we performed an elaborate evaluation to measure different aspects of the methods. The best performing setup achieved a cut F-score of 86\% and a V-measure of 0.91\% . This is high enough to fulfill business needs of the banking sector.

References

  1. Agin, O., Ulas, C., Ahat, M., and Bekar, C. (2015). An approach to the segmentation of multi-page document flow using binary classification. In Proceedings of the Sixth International Conference on Graphic and Image Processing (ICGIP 2014), pages 944311-944311. International Society for Optics and Photonics.
  2. Chen, F., Girgensohn, A., Cooper, M., Lu, Y., and Filby, G. (2012a). Genre identification for office document search and browsing. International Journal on Document Analysis and Recognition (IJDAR), 15(3):167- 182. note Iris: not-so relevant: text-based features but not OCR-features.
  3. Chen, S., He, Y., Sun, J., and Naoi, S. (2012b). Structured document classification by matching local salient features. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 653-656.
  4. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407.
  5. Gordo, A., Perronnin, F., and Valveny, E. (2012). Document classification using multiple views. In Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on, pages 33-37. IEEE.
  6. Infantino, I., Maniscalco, U., Stabile, D., and Vella, F. (2014). A fully visual based business document classification system. In Proceedings of the Science and Information Conference (SAI), 2014, pages 339-344. IEEE.
  7. Klink, S. and Kieninger, T. (2001). Rule-based document structure understanding with a fuzzy combination of layout and textual features. International Journal on Document Analysis and Recognition, 4(1):18-26.
  8. Koster, C. H. A., Seutter, M., and Beney, J. (2003). Perspectives of System Informatics: 5th International Andrei Ershov Memorial Conference, PSI 2003, Akademgorodok, Novosibirsk, Russia, July 9-12, 2003. Revised Papers, chapter Multi-classification of Patent Applications with Winnow, pages 546-555. Springer Berlin Heidelberg, Berlin, Heidelberg.
  9. Kumar, J., Ye, P., and Doermann, D. (2014). Structural similarity for document image classification and retrieval. Pattern Recognition Letters, 43:119 - 126. {ICPR2012} Awarded Papers.
  10. Marinai, S. (2008). Introduction to document analysis and recognition. In Machine learning in document analysis and recognition, pages 1-20. Springer.
  11. Matwin, S. and Sazonova, V. (2012). Direct comparison between support vector machine and multinomial naive bayes algorithms for medical abstract classification. JAMIA, 19(5):917.
  12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.
  13. Rosenberg, A. and Hirschberg, J. (2007). V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of EMNLP-CoNLL, volume 7, pages 410-420.
  14. Rusin˜ol, M., Frinken, V., Karatzas, D., Bagdanov, A. D., and Llad ós, J. (2014). Multimodal page classification in administrative document image streams. International Journal on Document Analysis and Recognition (IJDAR), 17(4):331-341.
  15. Schmidtler, M. A., Texeira, S. S., Harris, C. K., Samat, S., Borrey, R., and Macciola, A. (2014). Automatic document separation. US Patent 8,693,043.
  16. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.
  17. Simon, M., Rodner, E., and Denzler, J. (2015). Fine-grained classification of identity document types with only one example. In Machine Vision Applications (MVA), 2015 14th IAPR International Conference on, pages 126-129. IEEE.
  18. Tjong Kim Sang, E. and Veenstra, J. (1999). Representing text chunks. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, pages 173-179. Association for Computational Linguistics.
  19. Verberne, S., Vogel, M., D'hondt, E., et al. (2010). Patent classification experiments with the linguistic classification system lcs. In CLEF (Notebook Papers/LABs/Workshops).
Download


Paper Citation


in Harvard Style

Stoop W., Hendrickx I. and van Ees T. (2017). PaperClip: Automated Dossier Reorganizing . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 471-478. DOI: 10.5220/0006195904710478


in Bibtex Style

@conference{icpram17,
author={Wessel Stoop and Iris Hendrickx and Tom van Ees},
title={PaperClip: Automated Dossier Reorganizing},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={471-478},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006195904710478},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - PaperClip: Automated Dossier Reorganizing
SN - 978-989-758-222-6
AU - Stoop W.
AU - Hendrickx I.
AU - van Ees T.
PY - 2017
SP - 471
EP - 478
DO - 10.5220/0006195904710478