fgssjoin: A GPU-based Algorithm for Set Similarity Joins

Rafael D. Quirino, Sidney R. Junior, Leonardo A. Ribeiro, Wellington S. Martins

Abstract

Set similarity join is a core operation for text data integration, cleaning and mining. Most state-of-the-art solutions rely on inherently sequential, CPU-based algorithms. In this paper we propose a parallel algorithm for the set similarity join problem, harnessing the power of GPU systems through filtering techniques and divide-and-conquer strategies that scales well with data size. Experiments show substantial speedups over the fastest algorithms in literature.

References

  1. Arasu, A., Ganti, V., and Kaushik, R. (2006). Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918-929. VLDB Endowment.
  2. Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In WWW, pages 131-140.
  3. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. (1998). Min-Wise Independent Permutations (Extended Abstract). In STOC, pages 327-336.
  4. Chacón, A., Marco-Sola, S., Espinosa, A., Ribeca, P., and Moure, J. C. (2014). Thread-cooperative, Bit-parallel Computation of Levenshtein Distance on GPU. In ICS, pages 103-112.
  5. Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5.
  6. Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2016). Accelerating set similarity joins using gpus. TLDKS, 28:1-22.
  7. Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. (2014). MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In ICDE, pages 340-351.
  8. Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
  9. Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. (2001). Approximate string joins in a database (almost) for free. In VLDB, pages 491-500.
  10. Indyk, P. and Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC, pages 604-613.
  11. Junior, S. R., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2016). gssjoin: a gpu-based set similarity join algorithm. In SBBD, pages 64-75.
  12. Kirk, D. B. and Hwu, W.-m. W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition.
  13. Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets, 2nd Ed. Cambridge University Press.
  14. Li, C., Lu, J., and Lu, Y. (2008). Efficient Merging and Filtering Algorithms for Approximate String Searches. In ICDE, pages 257-266.
  15. Lieberman, M. D., Sankaranarayanan, J., and Samet, H. (2008). A Fast Similarity Join Algorithm Using Graphics Processing Units. In ICDE, pages 1111- 1120.
  16. Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set Similarity Join Techniques. PVLDB, 9(9):636-647.
  17. Ribeiro, L. A. and Härder, T. (2011). Generalizing prefix filtering to improve set similarity joins. Information Systems, 36(1):62-78.
  18. Sarawagi, S. and Kirpal, A. (2004). Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743-754.
  19. Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495-506.
  20. Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search. In SIGMOD, pages 85-96.
  21. Xiao, C., Wang, W., Lin, X., and Shang, H. (2009). Topk set similarity joins. In 2009 IEEE 25th International Conference on Data Engineering, pages 916-927. IEEE.
  22. Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. TODS, 36(3):15.
Download


Paper Citation


in Harvard Style

Quirino R., Junior S., Ribeiro L. and Martins W. (2017). fgssjoin: A GPU-based Algorithm for Set Similarity Joins . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 152-161. DOI: 10.5220/0006339001520161


in Bibtex Style

@conference{iceis17,
author={Rafael D. Quirino and Sidney R. Junior and Leonardo A. Ribeiro and Wellington S. Martins},
title={fgssjoin: A GPU-based Algorithm for Set Similarity Joins},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={152-161},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006339001520161},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - fgssjoin: A GPU-based Algorithm for Set Similarity Joins
SN - 978-989-758-247-9
AU - Quirino R.
AU - Junior S.
AU - Ribeiro L.
AU - Martins W.
PY - 2017
SP - 152
EP - 161
DO - 10.5220/0006339001520161