fgssjoin: A GPU-based Algorithm for Set Similarity Joins

Rafael D. Quirino, Sidney R. Junior, Leonardo A. Ribeiro, Wellington S. Martins


Set similarity join is a core operation for text data integration, cleaning and mining. Most state-of-the-art solutions rely on inherently sequential, CPU-based algorithms. In this paper we propose a parallel algorithm for the set similarity join problem, harnessing the power of GPU systems through filtering techniques and divide-and-conquer strategies that scales well with data size. Experiments show substantial speedups over the fastest algorithms in literature.


