Study of the Parallel Techniques for Dimensionality Reduction and Its Impact on Performance of the Text Processing Algorithms

Marcin Pietron, Maciej Wielgosz, Pawel Russek, Kazimierz Wiatr

Abstract

The presented algorithms employ the Vector Space Model (VSM) and its enhancements such as TFIDF (Term Frequency Inverse Document Frequency). Vector space model suffers from curse of dimensionality. Therefore various dimensionality reduction algorithms are utilized. This paper deals with two of the most common ones i.e. Latent Semantic Indexing (LSI) and Random Projection (RP). It turns out that the size of a document corpus has a substantial impact on the processing time. Thus the authors introduce GPU based on acceleration of these techniques. A dedicated test set-up was created and a series of experiments were conducted which revealed important properties of the algorithms and their accuracy. They show that the random projection outperforms LSI in terms of computing speed at the expanse of results quality.

References

  1. Abidin, T., Yusuf, B., and Umran, M. (2010). Singular value decomposition for dimensionality reduction in unsupervised text learning problems. In Education Technology and Computer (ICETC), 2010 2nd International Conference on, volume 4, pages 422-426.
  2. Achlioptas, D. (2001). Database friendly random projections. ACM Symposium on the Principles of Database Systems, pages 274-281.
  3. Andrecut, M. (2009). Parallel gpu implementation of iterative pca algorithms. Journal of Computational Biology, 11:15931599.
  4. Bingham, Ella, Mannila, and Heikki (2001). Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 245-250, New York, NY, USA. ACM.
  5. Dasgupta, S. (2000). Experiments with random projection. Uncertainty in Artifficial Intelligence .
  6. Dasgupta, S. and Gupta, J. (1999). An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report TR-99-006, Berkeley, California, USA.
  7. Interia.pl (2015). http://interia.pl.
  8. Jamro, E., Wielgosz, M., Russek, P., Pietron, M., Zurek, D., Janiszewski, M., and Wiatr, K. (2013). Implementation of algorithms for fast text search and files comparison. In Proceedings of the High Performance Computer Users Conference KU KDM 2013, pages 83-84. Academic Computer Centre Cyfronet AGH, Academic Computer Centre Cyfronet AGH.
  9. Keogh, E. and Pazzani, M. (2000). A simple dimensionality reduction technique for fast similarity search in large time series databases. Pacific-Asia Conference on Knowledge Discovery and Data Mining.
  10. Kim, S., Han, K., Rim, H., and Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11):1457-1466.
  11. Ko, Y. and Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th international conference on computational linguistics, pages 453-459.
  12. Lili, H. and Lizhu, H. (2008). Automatic identification of stopwords in chinese text classification. In Proceedings of the IEEE international conference on Computer Science and Software Engineering, pages 718- 722.
  13. Pietron, M., Zurek, D., Russek, P., Wielgosz, M., Jamro, E., and Wiatr, K. (2013). Accelerating aggregation and complex sql queries on a gpu. In Proceedings of the High Performance Computer Users Conference KU KDM 2013, pages 21-22, Krakw. Academic Computer Centre Cyfronet AGH.
  14. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130-137.
  15. Rehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45-50, Valletta, Malta. ELRA.
  16. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11):613-620.
  17. Wielgosz, M., Koryciak, S., Janiszewski, M., PietroÁ, M., Russek, P., Jamro, E., Dabrowska-Boruch, A., and Wiatr, K. (2013). Parallel mpi implementation of n-gram algorithm for document comparison. ACACES 2013 : the 9th international summer school on Advanced Computer Architecture and Compilation for High-performance and Embedded Systems, pages 217-220.
Download


Paper Citation


in Harvard Style

Pietron M., Wielgosz M., Russek P. and Wiatr K. (2016). Study of the Parallel Techniques for Dimensionality Reduction and Its Impact on Performance of the Text Processing Algorithms . In Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ORG/PUANLP 2016) ISBN 978-989-758-172-4, pages 315-322. DOI: 10.5220/0005756903150322


in Bibtex Style

@conference{puanlp16,
author={Marcin Pietron and Maciej Wielgosz and Pawel Russek and Kazimierz Wiatr},
title={Study of the Parallel Techniques for Dimensionality Reduction and Its Impact on Performance of the Text Processing Algorithms},
booktitle={Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ORG/PUANLP 2016)},
year={2016},
pages={315-322},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005756903150322},
isbn={978-989-758-172-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ORG/PUANLP 2016)
TI - Study of the Parallel Techniques for Dimensionality Reduction and Its Impact on Performance of the Text Processing Algorithms
SN - 978-989-758-172-4
AU - Pietron M.
AU - Wielgosz M.
AU - Russek P.
AU - Wiatr K.
PY - 2016
SP - 315
EP - 322
DO - 10.5220/0005756903150322