DOCUMENT CLASSIFICATION - Combining Structure and Content

Samaneh Chagheri, Sylvie Calabretto, Catherine Roussey, Cyril Dumoulin

2011

Abstract

Technical documentation such as user manual and manufacturing document is now an important part of the industrial production. Indeed, without such documents, the products can neither be manufactured nor used according to their complexity. Therefore, the increasing volume of such documents stored in the electronic format, needs an automatic classification system in order to categorize them in pre-defined classes and to retrieve the information quickly. On the other hand, these documents are strongly structured and contain the elements like tables and schemas. However, the traditional document classification typically classifies the documents considering the document text and ignoring its structural elements. In this paper, we propose a method which makes use of structural elements to create the document feature vector for classification. A feature in this vector is a combination of the term and the structure. The document structure is represented by the tags of the XML document. The SVM algorithm has been used as learning and classifying algorithm.

References

  1. Aïtelhadj, A., Mezghiche, M., & Souam, F. (2009). Classification de Structures Arborescentes: Cas de Documents XML. CORIA 2009, 301-317.
  2. Aïtelhadj, A., Mezghiche, M., & Souam, F. (2009). Classification de Structures Arborescentes: Cas de Documents XML. CORIA 2009, 301-317.
  3. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning , 273-297.
  4. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning , 273-297.
  5. Dalamagas, T., Cheng, T., Winkel, K.-J., & Sellis, T. (2005). Clustering XML Documents Using Structural Summaries. EDBT , 547-556.
  6. Dalamagas, T., Cheng, T., Winkel, K.-J., & Sellis, T. (2005). Clustering XML Documents Using Structural Summaries. EDBT , 547-556.
  7. Doucet, A., & Ahonen-Myka, H. (2002). Naive Clustering of a Large XML Document Collection. INEX Workshop 2002 , 81-87.
  8. Doucet, A., & Ahonen-Myka, H. (2002). Naive Clustering of a Large XML Document Collection. INEX Workshop 2002 , 81-87.
  9. Ghosh, S., & Mitra, P. (2008). Combining Content and Structure Similarity for XML Document. ICPR , 1-4.
  10. Ghosh, S., & Mitra, P. (2008). Combining Content and Structure Similarity for XML Document. ICPR , 1-4.
  11. Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press , 169-184.
  12. Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press , 169-184.
  13. Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press , 169-184.
  14. Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press , 169-184.
  15. Salton, G. (1968). Search and retrieval experiments in real-time information retrieval. (C. University, Ed.) 1082-1093.
  16. Salton, G. (1968). Search and retrieval experiments in real-time information retrieval. (C. University, Ed.) 1082-1093.
  17. Vercoustre, A.-M., Fegas, M., Lechevallier, Y., & Despeyroux, T. (2006). Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents. EGC 2006 .
  18. Vercoustre, A.-M., Fegas, M., Lechevallier, Y., & Despeyroux, T. (2006). Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents. EGC 2006 .
  19. Wisniewski, G., Denoyer, L., & Gallinari, P. (2005). Classification automatique de documents structurés. Application au corpus d'arbres étiquetés de type XML. CORIA 2005 Grenoble , 52-66.
  20. Wisniewski, G., Denoyer, L., & Gallinari, P. (2005). Classification automatique de documents structurés. Application au corpus d'arbres étiquetés de type XML. CORIA 2005 Grenoble , 52-66.
  21. Wu, J., & Tang, J. (2008). A bottom-up approach for XML documents classification. (ACM, Ed.) ACM International Conference Proceeding Series; Vol. 299 , 131-137.
  22. Wu, J., & Tang, J. (2008). A bottom-up approach for XML documents classification. (ACM, Ed.) ACM International Conference Proceeding Series; Vol. 299 , 131-137.
  23. Yan, H., Jin, D., Li, L., Liu, B., & Hao, Y. (2008). Feature Matrix Extraction and Classification of XML Pages. APWeb 2008 Workshops , 210-219.
  24. Yan, H., Jin, D., Li, L., Liu, B., & Hao, Y. (2008). Feature Matrix Extraction and Classification of XML Pages. APWeb 2008 Workshops , 210-219.
  25. Yang, J., & Wang, S. (2010). Extended VSM for XML Document Classification Using Frequent Subtrees. INEX 2009 , 441-448.
  26. Yang, J., & Wang, S. (2010). Extended VSM for XML Document Classification Using Frequent Subtrees. INEX 2009 , 441-448.
  27. Yang, J., & Zhang, F. (2008). XML Document Classification Using Extended VSM. INEX 2007 , 234-244.
  28. Yang, J., & Zhang, F. (2008). XML Document Classification Using Extended VSM. INEX 2007 , 234-244.
  29. Yi, J., & Sundaresan, N. (2000). A classifier for semistructured documents. KDD 7800 , 340-344.
  30. Yi, J., & Sundaresan, N. (2000). A classifier for semistructured documents. KDD 7800 , 340-344.
Download


Paper Citation


in Harvard Style

Chagheri S., Calabretto S., Roussey C. and Dumoulin C. (2011). DOCUMENT CLASSIFICATION - Combining Structure and Content . In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8425-53-9, pages 95-100. DOI: 10.5220/0003505100950100


in Harvard Style

Chagheri S., Calabretto S., Roussey C. and Dumoulin C. (2011). DOCUMENT CLASSIFICATION - Combining Structure and Content . In Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8425-53-9, pages 95-100. DOI: 10.5220/0003505100950100


in Bibtex Style

@conference{iceis11,
author={Samaneh Chagheri and Sylvie Calabretto and Catherine Roussey and Cyril Dumoulin},
title={DOCUMENT CLASSIFICATION - Combining Structure and Content},
booktitle={Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2011},
pages={95-100},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003505100950100},
isbn={978-989-8425-53-9},
}


in Bibtex Style

@conference{iceis11,
author={Samaneh Chagheri and Sylvie Calabretto and Catherine Roussey and Cyril Dumoulin},
title={DOCUMENT CLASSIFICATION - Combining Structure and Content},
booktitle={Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2011},
pages={95-100},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003505100950100},
isbn={978-989-8425-53-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - DOCUMENT CLASSIFICATION - Combining Structure and Content
SN - 978-989-8425-53-9
AU - Chagheri S.
AU - Calabretto S.
AU - Roussey C.
AU - Dumoulin C.
PY - 2011
SP - 95
EP - 100
DO - 10.5220/0003505100950100


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - DOCUMENT CLASSIFICATION - Combining Structure and Content
SN - 978-989-8425-53-9
AU - Chagheri S.
AU - Calabretto S.
AU - Roussey C.
AU - Dumoulin C.
PY - 2011
SP - 95
EP - 100
DO - 10.5220/0003505100950100