Towards a Bio-inspired Approach to Match Heterogeneous Documents

Nourelhouda Yahi, Hacene Belhadef, Mathieu Roche, Amer Draa

Abstract

Matching heterogeneous text documents coming from different sources means matching data extracted from these documents, generally structured in the form of vectors. The accuracy of matching directly depends on the right choice of the content of these vectors. That’s why we need to select the best features. In this paper, we present a new approach to select the minimum set of features that represents the semantics of a set of text documents, using a quantum inspired genetic algorithm. Among different Vs characterizing the big data we focus on ‘Variety’ criterion, therefore, we used three sets of different sources that are semantically similar to retrieve their best features which describe the semantics of the corpus. In the matching phase, our approach shows significant improvement compared with the classic ‘Bag-of-words’ approach.

References

  1. Aghdam, M. H., Ghasem-Aghaee, N., and Basiri, M. E. (2009). Text feature selection using ant colony optimization. Expert systems with applications, 36(3):6843-6853.
  2. Al-Ani, A. (2005). Ant colony optimization for feature subset selection. In WEC (2), pages 35-38. Citeseer.
  3. Babatunde, O., Armstrong, L., Leng, J., and Diepeveen, D. (2014). A genetic algorithm-based feature selection. British Journal of Mathematics & Computer Science, 4(21):889-905.
  4. Craswell, N. (2009). Mean reciprocal rank. In Encyclopedia of Database Systems, pages 1703-1703. Springer.
  5. Draa, A. (2011). Modéles pour les systéemes complexes adaptatifs pour la résolution de problémes : Automates cellulaires apprenants quantiques et évolution différentielle quantique. PhD thesis, Constantine 2 Abdelhamid Mehri University, Algeria.
  6. Jourdan, L., Dhaenens, C., and Talbi, E.-G. (2001). A genetic algorithm for feature selection in data-mining for genetics. Proceedings of the 4th Metaheuristics International ConferencePorto (MIC2001), pages 29-34.
  7. Kabir, M. M., Shahjahan, M., and Murase, K. (2012). A new hybrid ant colony optimization algorithm for feature selection. Expert Systems with Applications, 39(3):3747-3763.
  8. Laboudi, Z. and Chikhi, S. (2009). Evolution d'automate cellulaire par algorithme genetique quantique. In CIIA.
  9. Longhi, J., Marinica, C., Borzic, B., and Alkhouli, A. (2014). Polititweets, corpus de tweets provenant de comptes politiques influents. Banque de corpus CoMeRe. Ortolang. fr: Nancy. http://hdl. handle. net/11403/comere/cmr-polititweets.
  10. Oliveira, L. S., Sabourin, R., Bortolozzi, F., and Suen, C. Y. (2003). A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. International Journal of Pattern Recognition and Artificial Intelligence, 17(06):903-929.
  11. Réhel, S. (2005). Catégorisation automatique de textes et cooccurrence de mots provenant de documents non étiquetés. Faculty of Science and Engineering, University LAVAL, QUEBEC.
  12. Salton, G. and McGill, M. J. (1986). Introduction to modern information retrieval.
  13. Siedlecki, W. and Sklansky, J. (1989). A note on genetic algorithms for large-scale feature selection. Pattern recognition letters, 10(5):335-347.
  14. Xue, B., Zhang, M., and Browne, W. N. (2012). Multiobjective particle swarm optimisation (pso) for feature selection. In Proceedings of the 14th annual conference on Genetic and evolutionary computation, pages 81-88. ACM.
  15. Yang, J. and Honavar, V. (1998). Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection, pages 117-136. Springer.
  16. Yusta, S. C. (2009). Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters, 30(5):525-534.
  17. Zahran, B. M. and Kanaan, G. (2009). Text feature selection using particle swarm optimization algorithm. World Applied Sciences Journal 7 (Special Issue of Computer & IT), pages 69-74.
Download


Paper Citation


in Harvard Style

Yahi N., Belhadef H., Roche M. and Draa A. (2017). Towards a Bio-inspired Approach to Match Heterogeneous Documents . In Proceedings of the 13th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-246-2, pages 276-283. DOI: 10.5220/0006294002760283


in Bibtex Style

@conference{webist17,
author={Nourelhouda Yahi and Hacene Belhadef and Mathieu Roche and Amer Draa},
title={Towards a Bio-inspired Approach to Match Heterogeneous Documents},
booktitle={Proceedings of the 13th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2017},
pages={276-283},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006294002760283},
isbn={978-989-758-246-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Towards a Bio-inspired Approach to Match Heterogeneous Documents
SN - 978-989-758-246-2
AU - Yahi N.
AU - Belhadef H.
AU - Roche M.
AU - Draa A.
PY - 2017
SP - 276
EP - 283
DO - 10.5220/0006294002760283