A Parallel Bit-map based Framework for Classiﬁcation Algorithms

Amila De Silva and Shehan Perera

Department of Computer Science & Engineering, University of Moratuwa, Katubedda, Sri Lanka

Keywords:

Data Mining, Classiﬁcation, Bitmaps, Bit-Slices, GPU.

Abstract:

Bitmaps are gaining popularity with Data Mining Applications that use GPUs, since Memory organisation and

the design of a GPU demands for regular & simple structures. However, absence of a common framework

has limited the beneﬁts of Bitmaps & GPUs mostly to Frequent Itemset Mining (FIM) algorithms. We in

this paper, present a framework based on Bitmap techniques, that speeds up Classiﬁcation Algorithms on

GPUs. The proposed framework which uses both CPU and GPU for Algorithm execution, delegates compute

intensive operations to GPU. We implement two Classiﬁcation Algorithms Na

ıve Bayes and Decision Trees,

using the framework, both which outperform CPU counterparts by several orders of magnitude.

1 INTRODUCTION

Long before the advent of Data Mining, Bitmaps have

been used in analytical queries. Bitmap based tech-

niques have been used when evaluating long predi-

cate statements and different varieties of Bitmap in-

dices have been proposed as early as 1997(O’Neil

and Quass, 1997). In certain studies like(Sinha and

Winslett, 2007), authors have proposed methods to

store data in Bitmaps which enable querying scientiﬁc

data,that mainly consists of ﬂoating-point numbers,

efﬁciently. Bitmaps have been used for different Ap-

plications from a long time, but mainly Bitmaps have

been used as a complementary index, not as a com-

plete data structure. An instance where Bitmaps have

been considered from the perspective of a data struc-

ture is(Fang et al., 2009), where Bitmaps have been

applied for a Frequent Itemset Mining(FIM)(Chee

et al., 2018) algorithm.

The organization and the processing done with

Bitmaps makes it a natural candidate for FIM algo-

rithms. In FIM algorithms such as Apriori(Agrawal

and Srikant, 1994), individual itemsets are mapped

into distinct Bitmaps so that generation of new item-

sets can be easily done by intersecting Bitmaps.

Showing that Bitmaps aren’t limited to FIM algo-

rithms, authors have implemented a Decision Tree

algorithm using Bitmaps in (Favre and Bentayeb,

2005), which uses Bitmap indices residing on a

database to obtain counts needed to build the tree.

Similar to Bitmaps, another emerging technique

being increasingly used in Data Mining is process-

ing with GPUs. GPUs are being increasingly consid-

ered for Data Mining algorithms due to their ability

to execute algorithms in parallel and also due to the

repetitive nature of Data Mining workloads. In stud-

ies (Fang et al., 2009; Silvestri and Orlando, 2012;

Chon et al., 2018),Bitmaps have been used on GPUs

to accelerate FIM algorithms,but the type of process-

ing they do can be extended to other algorithms.

In this study we are exploring the ability to use

Bitmap processing for Classiﬁcation Algorithms. We

propose a Bitmap-based CPU-GPU hybrid frame-

work for Classiﬁcation Algorithms, which uses two

Bitmap representations, Bitmaps and BitSlices. We

also propose a Batching technique which limits num-

ber of kernel invocations and improves performance

signiﬁcantly. To prove our hypothesis, we imple-

ment two algorithms Na

ıve Bayes and Decision Tree

using both the representations and show that signif-

icant speedups can be obtained with the proposed

techniques. In the experiments we perform with real

world datasets, we obtain average speed ups of 30 and

3 for Na

ıve Bayes and Decision Trees respectively.

2 RELATED WORK

In this section, we brieﬂy review related work on Data

Mining frameworks which uses GPUs, Distributed

Algorithms proposed for Classiﬁcation Algorithms

and use of Bitmaps in Data Mining algorithms.

De Silva, A. and Perera, S.

A Parallel Bit-map based Framework for Classiﬁcation Algorithms.

DOI: 10.5220/0007931202590266

In Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pages 259-266

ISBN: 978-989-758-377-3

259

2.1 GPU Frameworks for Data Mining

Applications

In work done by B

ohm et al.(B

ohm et al., 2009), a

framework, which uses an optimised index to speed

up similarity join operations, has been proposed . By

expressing two clustering algorithms DBSCAN and

K-Means with similarity join, they have obtained sig-

niﬁcant speed ups over CPU counterparts. Their pro-

posed technique has so far been applied to Clustering

Algorithms,even they claim that the technique can be

applied for other Algorithms.

The framework proposed by Fang et al. ,

GPUMiner (Fang et al., 2008) is much closer to what

we are doing, in the sense that it uses Bitmaps for

all algorithms supported by the framework. They use

both Horizontal and Columnar Data layouts and fa-

cilitates multiple algorithms falling into different cat-

egories. In GPUMiner, Bitmaps are being used for

different types of operations. In Apriori, Bitmaps

are utilised to represent unique itemsets and are used

while computing support and generating candidates.

From the perspective of Apriori Algorithm, this is a

core operation, since it’s the most compute intensive

step. But in clustering algorithms, Bitmaps are used

for tracking the identity of different data points. This

operation can only be considered as a supporting task,

because it doesn’t directly involve with calculating

distance, which is the compute intensive operation in

Clustering Algorithms.

Another study that proposes a framework for Data

Mining algorithms is (Gainaru and Slusanschi, 2011),

where the framework provides core functionalities

like handling data transfers and scheduling while giv-

ing the ﬂexibility to extend/perform algorithm spe-

ciﬁc changes. With their approach of unifying data

transfers, they have been able to use optimization

techniques applied for one algorithm to improve an-

other. But they haven’t been able to surpass speed ups

that can be obtained with Bitmaps.

In work done by Jian et al.(Jian et al., 2013), they

propose 3 main techniques which improves process-

ing on GPUs. These techniques address three recur-

ring operations in Data Mining applications. Their so-

lution to one such problem which is processing high-

dimensional data, is to follow column-wise process-

ing. This approach enables GPU to apply sequential

addressing reduction(Harris, 2016), which is the same

technique we are using in our study.

2.2 Parallel Data Mining Algorithms

In (Fang et al., 2009) authors present two efﬁcient

GPU based Apriori algorithms that use Bitmaps, one

running solely using Bitmaps and another using a Trie

to do candidate generation. Support counting part in

both algorithms is delegated to the GPU and count-

ing support of a single itemset is handled by a sin-

gle thread block. For support counting both the vari-

ants rely on Bitmap representation. Another couple

of studies where GPUs are used in FIM algorithms

are (Silvestri and Orlando, 2012) and (Chon et al.,

2018). In (Silvestri and Orlando, 2012), algorithm de-

fers using GPU until all the frequent itemsets ﬁt into

device memory, which prevents frequent data trans-

fers between host and the device. With this technique,

they’ve been able to observe signiﬁcant speed ups. In

(Chon et al., 2018) authors have explored the possibil-

ity of using multiple GPUs while exploiting the abil-

ity to compute a partial sum in each thread block. In

both the studies, Bitmaps have been used for storing

Frequent Itemsets.

Recently, Viegas et al.(Andrade et al., 2013) im-

plemented Na

ıve Bayes algorithm on GPUs. In their

implementation they’ve used a compact data structure

indexed by terms, which has helped them to minimize

memory consumption. The compact structure being

used, help them to perform model building in paral-

lel, allowing them to achieve 35 times speed up over

sequential CPU execution.

One of the earliest methods for building a De-

cision tree in parallel has been proposed in (Shafer

et al., 1996), where records are distributed among

multiple processors. Algorithm SPRINT is an im-

provement over SLIQ(Mehta et al., 1996), and has

adopted many characteristics from SLIQ. SPRINT

proposes a parallel tree building technique which dis-

tributes computation by delegating each node to a dif-

ferent processor.

But these algorithms are designed on multipro-

cessor systems, where each processor has access to

a dedicate memory and a hard disks. At a concep-

tual level, data organisation and processing followed

in our framework for Decision Tree, is similar to the

methods used in SPRINT(Shafer et al., 1996).

Techniques proposed in(Shafer et al., 1996) have

been adopted in CudaTree(Liao et al., 2013) which

is a GPU based implementation. In addition to the

characteristics borrowed from SPRINT, another ap-

proach CudaTree(Liao et al., 2013) explores is, blend-

ing task parallelism and data parallelism by switching

between two different modes of tree building.

There are couple of algorithms proposed for build-

ing Random forests on GPUs. Algorithm proposed in

(Amado et al., 2001) exploits task parallelism by task-

ing each core with building a single tree. Authors are

claiming that the algorithm works best when lot of

parallel trees are built.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

260

...

Dataset

Figure 1: Dataset represented using Bitmaps.

3 DESIGN AND

IMPLEMENTATION

This section mainly describes the design and imple-

mentation of our framework. We talk in-depth about

the two Bitmap variants supported by the framework,

Bitmaps and BitSlices, highlighting each area they

can be optimally used in. We also talk about the two

algorithms implemented using the framework, Na

ıve

Bayes and Decision tree, detailing about types of pro-

cessing needed for each algorithm and showing how

our framework provides those. Then we move onto

discuss how batching is implemented and how it re-

duces running time of algorithms.

3.1 BitSlice And Bitmap

Representations

The BitSlice and Bitmap representations we are talk-

ing about are widely known index schemes available

in literature. However in the scope of our work, rather

than using as index schemes we are using those to

store actual underlying data. Before converting to ei-

ther format, data is ﬁrst arranged into a column-major

format.

The Bitmap representation is similar to Value-List

indices proposed in (O’Neil and Quass, 1997). If

Dataset D can be represented as a collection of At-

tributes {A

, A

, . . . , A

} where each Attribute has

| R | number of elements and cardinality of attributes

(the number of distinct values) in each attribute can be

expressed as {C

, C

, . . . , C

}, then we can deﬁne

Bitmap and BitSlice representations as below.

The Bitmap representation is a Set B

, B

, . . . , B

} where B

is the set of Bitmaps

corresponding to Attribute A

. B

can be expressed by

a set of Bitmaps {b

i,1

, b

i,2

, b

i,3

, . . . , b

i,m

} where b

i,j

...

Dataset

Figure 2: Dataset represented using BitSlices.

a vector of bits consisting of either ones or zeros and

m = C

. Size of each Bitmap is equal to the number

of records in the Dataset or | b

i,j

|=| R |. Assuming

that distinct values in A

can be expressed by the set

i,1

, a

i,2

, a

i,3

, . . . , a

i,m

}, then k

bit in b

i,j

is set to one

only if k

value in A

is equal to a

i,j

. This way k

value will be set to 1 only in one bitmap. Loosely

deﬁning, b

i,j

gives the locations a

i,j

is appearing in

dataset. Fig. 1 gives a graphical illustration of the

Bitmap representation. As depicted in the Figure,

attribute A

has 3 distinct values, hence the 3 Bitmaps

and B

. Similarly, A

has m attributes which

are shown by the Bitmaps B

, . . . , B

BitSlice representation of Dataset D can be de-

ﬁned by making slight modiﬁcation to the previous.

Assuming attribute A

can be represented as a binary

number with N+l bits, the Bit sliced representation of

is an ordered list of bitmaps b

i,N

, b

i,N-1

, . . . , b

i,1

i,0

where these Bitmaps are called the BitSlices. If

[k] denotes, k

element in Attribute A

and the bit

for row k in bit-slice b

i,j

by b

i,j

[k] then the values for

i,j

[k] are chosen so that

[k] =

∑

i=1

i,j

[k] × 2

(1)

Note that we determine N in advance so that

the highest-order bit-slice b

i,N

is non-empty. Usu-

ally N is selected so that N = log

(max(A

)). Bit-

Slice representation of the Dataset D is the set B

,. . . ,B

} where B

is the BitSlice represen-

tation of Attribute A

. Fig. 2 illustrates the Dataset

represented in BitSlices. Note that all the values in

column A

are represented by three BitSlices, which

means that the maximum value in A

is 7. Simi-

larly, A

is represented by m BitSlices, meaning that

m+1

− 1 is the maximum values present in the col-

umn.

Even we deﬁne a single Bitmap as a vector of

bits, when implementing it programatically, bits are

A Parallel Bit-map based Framework for Classiﬁcation Algorithms

261

n1 n2

n1+n2

Intersect &

count

summation

101100101...1

...

101100101...1

Bitmap 2

101100101...1

...

101100101...1

Bitmap 1

& & & & & & & & & & & &

Cycle 2

& & & &

Block 1

Block 2

Cycle 1

Figure 3: Bitmap intersection & counting on GPU.

grouped into chunks of 64 and is usually stored as an

array of unsigned long (ulong) literals. Then Bitmap

intersection would reduce into performing bitwise

AND between two ulong arrays.

3.2 Bitmap and BitSlice Processing

When data is represented with BitSlices/Bitmaps, ap-

plying ﬁlters and searching for data elements needs

Bitmap manipulation. Since data is encoded, obtain-

ing a count with a Bitmap structure isn’t straight-

forward. Framework provides a range of core al-

gorithms, which manipulates the underlying Bitmap

structure and give a result for a query. Co-

occurenceCount is such a core algorithm which would

count the co-occurrence of two numbers among two

columns. Since Co-Occurrence counting makes up

the most basic processing in the framework, we’ll ﬁrst

show how this operation is performed with Bitmaps

& BitSlices. Co-occurence counting is frequently

used when populating contingency tables. While im-

plementing Both Na

ıve Bayes and Decision Tree we

used a contingency table to perform computations.

Co-OccurenceCount is implemented as a kernel

and at the start of the kernel,the entire Dataset gets

transferred to the Device memory. Since GPU is only

transferring result of a computation, there won’t be

any major data transfers from GPU to CPU.

The basic unit in either of these representations is

a Bitmap, which is kept as ulong vector. Result of

an intersection produces another Bitmap, count of 1

of which can be obtained using popcount instruction

available in CUDA. Since both Bitmaps and BitSlices

are representationally similar, we’ll explain in detail

about Bitmap processing and then brieﬂy talk about

BitSlices.

3.2.1 Processing with Bitmaps

With the Bitmap representation each attribute is a

collection of Bitmaps, so counting co-occurrence be-

tween two attributes would involve intersecting and

counting Bitmaps.

In sequential addressing reduction(Harris, 2016)

, each GPU core would work on the same intersec-

tion and count operation, regardless of the GPU mul-

tiprocessor they belong to. Each thread is in charge

of an interleaved portion of the Bitmap, in such a way

that threads having consecutive indexes work on con-

secutive parts of the Bitmap. In Fig. 3 we provide

a visualisation of the Bitmap intersection. Bitmap 1

and Bitmap 2 are two ulong arrays residing in GPU’s

main memory. In Cycle1 each block will be pro-

cessing the set of elements located to the left of the

Bitmaps.Block 1 will be processing elements demar-

cated by broken lines while Block 2 will be process-

ing elements with dotted lines. Each thread in the

block will pick an index and read two elements lo-

cated at that position from two Bitmaps. Intersec-

tion and counting would happen in each thread and

would get aggregated by the block level when writ-

ing to shared memory. At the end of Cycle 1, Block 1

will write the aggregation of 4 elements located to

the very left of the Bitmap and Block 2 will simi-

larly write down aggregation of the next 4.In Cycle 2,

both the blocks will pick a different portion of the

same Bitmaps. Block 2 will be picking the last 4 ele-

ments to the right, the ones marked with dotted lines

and Block1 will pick next 4 elements from the end

marked with broken lines. The value n1 provided by

Block 1 at the end of Cycle 2 is the aggregation of all

elements processed by Block 1. Similarly n2 is the

aggregation of all elements processed by Block 2. If

there are n blocks, then an array of n will be written

to Global memory, each with the aggregation of all

elements processed by each block. Summing up this

array would give the result for the entire Bitmap.This

is usually done by running a summing kernel provid-

ing the array with partial sums as the input.

The Algorithm BitmapCo-OccuranceCountGPU

shows the code for this Bitmap intersecting kernel.

Here col1 and col2 are the respective columns (at-

tributes) represented in Bitmaps needed for the inter-

section. Each column can be thought of as an array of

Bitmaps. Since we are only representing categorical

data with Bitmaps, a single category value would have

a unique Bitmap.index1 and index2 are the indices of

the ﬁrst and second category values respectively. We

also pass the length which gives the number of ulong

literals in a Bitmap.

BitmapCo-OccuranceCountGPU (col1,col2,length,

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

262

index1,index2,output)

tid <- threadIdx.x

i <- blockIdx.x x blockSize + threadIdx.x

gridSize <- blockSize x gridDim.x

sdata <- initialize shared memory

mySum <- 0

bitwise <- !0

while i < length:

bitwise <- col1[index1][i] &

col2[index2][i]

mySum <- mySum + _popcll(bitwise)

i <- i + gridSize

endwhile

sdata[tid] <- mySum

....

if tid == 0:

output[blockIdx.x] <- mySum

endif

We ﬁrst initialise internal state variables by get-

ting Block and Thread conﬁgurations. The variables

threadIdx and blockIdx are set by CUDA environment

based on parameters we set while invoking the ker-

nel. Since this kernel is invoked by each thread, each

thread needs to select a non-overlapping portion of the

Bitmap. That’s why the variable i is determined using

blockIdx and threadIdx. Once we initialize variables

properly, we do a Bitmap intersection in line 9. A sin-

gle thread may work on multiple portions in different

iterations. To facilitate this, we keep increasing i by

the size of the Grid (i.e the number of blocks). We’ve

also omitted the part which performs block-wise ag-

gregation. The kernel would only write to the global

memory at the time of ﬁnishing kernel invocation, and

at other times it would only do reads. When doing

reads, consecutive threads will be read from adjacent

memory locations so the accesses are coalesced. And

when writing to shared memory, a sequential address-

ing method is followed to avoid bank conﬂicts.

3.2.2 Processing with BitSlices

Processing with BitSlices is very much similar to pro-

cessing Bitmaps, main difference being having to load

all the BitSlices belonging to the two attributes as op-

posed to reading the two particular Bitmaps.

BitSliceCo-OccuranceCountGPU (col1,col2,length,

val1,val2,col1_bitmaps,col2_bitmaps,output)

Initialising ...

while i < length:

bitwise <- !0

for k <- 0 to col1_bitmaps - 1:

if val1 & (1 << k):

bitwise <- bitwise & col1[k][i]

else:

bitwise <- bitwise & !col1[k][i]

endif

endfor

for k <- 0 to col2_bitmaps - 1:

if val2 & (1 << k):

bitwise <- bitwise & col2[k][i]

else:

bitwise <- bitwise & !col2[k][i]

endif

endfor

i <- i + gridSize

mySum <- mySum + _popcll(bitwise)

endwhile

sdata[tid] <- mySum

....

if tid == 0:

output[blockIdx.x] <- mySum

endif

Intersection with BitSlices are done in a fashion

similar to the algorithms described in Algorithm 4.2

(O’Neil and Quass, 1997). Algorithm BitSliceCo-

OccuranceCountGPU show the BitSlice intersecting

kernel running on GPU. In addition to the parameters

passed in BitmapCo-OccuranceCountGPU, we pass

the number of Bitmaps in each column (indicated by

col1 bitmaps&col2 bitmaps)and a pointer to output

array residing in the Global Memory. BitSlice Inter-

section looks a bit complex compared to Bitmap Ker-

nel, since there’s loop running to select the number.

In Bitmap Kernel we didn’t have to explicitly pass the

category values, but for BitSlices we need to do so,

since it’s based on those values intersection is done.

Similar to Bitmaps, once the intersecting and

counting is done, a block level reduction happens,

which is followed by a global reduction. The only

noticeable difference between the two representations

is that, in Bitmap representation a Bitmap is read-

ily available for an attribute value, but in BitSlices it

needs to be generated in the kernel as computation

happens.

3.2.3 Batching Operations

Usually arithmetic intensity of a Bitmap intersection

is low. To produce intersection of two Bitmaps, two

global reads have to be made. This makes the kernel,

a bandwidth sensitive one since a larger proportion

of time is spent in transferring Bitmaps from global

memory. With batching we'll be reading four Bitmaps

to produce four resulting Bitmaps. In the traditional

reduction phase, which follows counting operation,

count is kept in an integer. But in the batched mode

we are using int4, which easily allows us to keep four

pattern counts. However before transferring results to

host's side they need to be mapped with proper inte-

gers.

A Parallel Bit-map based Framework for Classiﬁcation Algorithms

263

USCensus

PokerHand

KDDCUP

0.5

1.5

2.5

3.5

4.5

5.5

·10

2.06 · 10

2.87 · 10

2.92 · 10

6.71 · 10

2.15 · 10

4.26 · 10

23,348.4

7,624.2

1.3 · 10

4.82 · 10

1.39 · 10

1.08 · 10

23,386.8

6,000.8

72,358.4

Running Time(µs)

Standard-CPU

BitSlices-CPU

BitSlices-GPU Batched

Bitmaps-CPU

Bitmaps-GPU Batched

Figure 4: Execution times with Different Datasets Results

for Na

ıve Bayes.

3.2.4 Implementing Algorithms on the

Framework

Both Na

ıve Bayes and Decision Trees have been

implemented by using counts returned from Co-

OccurrenceCount kernels. For both the Algorithms

we used the implementation provided by weka (Wit-

ten et al., 2011). When an algorithm starts, it ini-

tializes a two-dimensional matrix at Device’s space,

transfers Dataset to the device and starts invoking

kernels. It’s to this matrix, the counts will be writ-

ten. The 2D Matrix has cells equal to att values ×

class values, where att values and class values rep-

resent cardinality of the test attribute and the class

attribute respectively. When using non-batched ker-

nels, each kernel invocation counts a single pattern

requiring that many invocations equal to the number

of cells. The batched mode can count four patterns at

once, so depending on the Dataset, a minimum of one

fourth of the invocations will happen.

4 PERFORMANCE EVALUATION

To evaluate performance of the framework, and to as-

sess correctness of algorithms, we present experimen-

tal results. Since our primary target is to measure per-

formance gains obtained by Bitmap and BitSlice vari-

ants on GPUs, results are compared with CPU vari-

ants which use those representations.

For experiments we used three Datasets available

in UCI machine learning repository (Dua and Graff,

2017), USCensus(b23, 1990), which is a categorical

dataset with 68 attributes and two million rows, Pok-

erHand(b24, 1990), which also is another categori-

USCensus

PokerHand

KDDCUP

100

110

39.41

18.54

11.51

88.43

37.7

22.52

50.44

18.75

14.77

88.29

47.9

40.39

Speed up

BitSlices-GPU

BitSlices-GPU Batched

Bitmaps-GPU

Bitmaps-GPU Batched

Figure 5: Speedup over Standard-CPU on different

Datasets.

cal dataset with 11 attributes and one million rows

and KDDCup99 dataset(b25, 1999). Most attributes

in KDDCup were real, due to which we had to split

those into ranges and create discrete categories.

In the following subsections we present experi-

ments performed and the criteria used in evaluating

algorithms.

4.1 Experimental Setup

All experiments were performed on a computer with

Intel Core 17-2600 CPU at 3,40GHz, with Hyper-

Threading, 16GB of main memory, and equipped

with a GeForce GTX480 graphics card. The GPU

consists of 15 SIMD multi-processors, each of which

has 32 cores running at 1.4 GHZ. The GPU memory

is 1.5 GB with the peak bandwidth of 177 GB/sec.

The goal of following tests is to assess perfor-

mance of two Na

ıve Bayes implementations for GPU,

with respect to CPU implementations. We mainly

performed the test by using different datasets and

recording execution time of each variant. Accuracy

of the models were veriﬁed by comparing estimator

values set during each execution. For the experiments

we used 7 different implementations which are sum-

marized below.

• Standard-CPU - Unmodiﬁed implementation pro-

vided by Weka.

• BitSlice-CPU - Algorithm that uses Bitslices,

which runs on the CPU.

• Bitmap-CPU - An implementation running on

CPU using a Bitmap representation.

• Bitmap-CPU - An implementation running on

CPU using a Bitmap representation.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

264

USCensus

PokerHand

KDDCUP

0.5

1.5

2.5

3.5

·10

3.1 · 10

1.45 · 10

1.71 · 10

5.52 · 10

3.95 · 10

1.37 · 10

1.78 · 10

2.4 · 10

4.18 · 10

2.55 · 10

1.64 · 10

5.85 · 10

1.56 · 10

1.12 · 10

3.12 · 10

Running Time(µs)

Standard-CPU

BitSlices-CPU

BitSlices-GPU Batched

Bitmaps-CPU

Bitmaps-GPU Batched

Figure 6: Execution times for Decision Tree with Different

Datasets.

• Bit-Slices GPU - The variant using Bit-Slices,

which runs on GPU.

• Bit-Slices GPU Batched - The same variant as

above, which performs operations in batches.

• Bitmap GPU - An implementation running on

GPU which uses Bitmaps.

• Bitmap GPU Batched - The bitmap variant run-

ning on GPU

running operations in batches.

In Fig. 4 we show the comparison with different

datasets, according to which we can see a clear differ-

ence between CPU and GPU variants. While measur-

ing time we only measured the time taken to build the

model. Transfer times were excluded because a trans-

fer would be done only once for multiple executions

and can be considered as a one time operation. The

graphs show an average time which is the average of

six iterations.

Fig. 5 shows the Speed ups for GPU variants, ob-

tained against Standard-CPU. This highlights the dif-

ference between batched and non-batched modes. In

all cases, batched variants report as twice as much

speed up when compared to the non-batched coun-

terpart. Further, an interesting observation can be

made with USCensus dataset. When comparing non-

batched executions for BitSlices and Bitmaps, in all

three Datasets Bitmap variant has given a better speed

up than the BitSlice variant. But such a difference

cannot be observed between batched variants for US-

Census. Both BitSlice-Batched and Bitmap-Batched

are showing similar speedups in USCensus. While

looking into the Dataset we found that, there are many

attributes having less than 4 distinct values. In the

non-batched mode, Bitmap based algorithm would

read two bitmaps to produce result of a single inter-

section, but to produce the result for the same inter-

section, BitSlice algorithm would read 4 BitSlices.

In batched mode, both the variants will be reading

4 Bitmaps to produce 4 results. Since memory ac-

cesses are more uniform in batched mode, BitSlice

and Bitmap algorithms run equally fast.

4.2 Results for Decision Trees

For Decision Tree algorithm, we ran a subset of the

above tests using the same three datasets. Same steps

followed for Na

ıve Bayes were used while running

the experiments and verifying model accuracy. Re-

sults obtained for Decision Trees are shown in 6.

However, we didn’t execute non-batched modes for

Decision Trees, since batched mode itself wasn’t giv-

ing a considerable speed up. While implementing De-

cision Trees, we had to handle additional complexity

of maintaining partitions. It’s the approach we took in

handling partitions gave us considerable small perfor-

mance gains. Still the GPU variants ﬁnish faster than

CPU ones, but the speedups are very modest.

5 CONCLUSIONS

In this paper we focused on using Bitmap techniques

for classiﬁcation algorithms. We showed that FIM al-

gorithms use Bitmaps on GPUs to perform compu-

tations in parallel. Then we showed that by separat-

ing out model building phase from the phase which

iterates through the Dataset , we can use Bitmap

based structures to speed up Algorithm execution.

The framework we proposed, represents Data with

Bitmaps and provides kernels to manipulate Bitmap

based structure. We also propose a batching technique

which enables performing multiple counting opera-

tions in a single kernel. By implementing two al-

gorithms, Na

ıve Bayes and Decision Trees, we show

that proposed model can be used to implement, Clas-

siﬁcation algorithms. With the Datasets used in our

experiments, we’ve been able to achieve a maximum

speed ups of 80 for Na

ıve Bayes, and 19 for Decision

Trees, against the CPU implementation. With this we

show that Bitmaps can be used on GPUs to speed up

Classiﬁcation Algorithms. Results for Decision tree

even though remains promising, aren’t as signiﬁcant

as Na

ıve Bayes. We feel that, Decision Trees can

be improved further, since the approach we followed

incurs frequent memory transfers between CPU and

GPU. Bitmap representation provided the best speed

up in all cases, giving an indication that it can be used

to speed up processing categorical datasets. BitSlices

A Parallel Bit-map based Framework for Classiﬁcation Algorithms

265

can be considered as a generic representation, since it

can hold both categorical and numerical data and also

since it doesn’t signiﬁcantly reduce speed when com-

paring with Bitmap representation. We believe the

work discussed in this paper would provide a ground

work for building a Bitmap based framework for Data

Mining algorithms in general.

REFERENCES

(1990). Poker hand data set. URL: https://archive.ics.uci.

edu/ml/datasets/Poker+Hand. Online; Accessed 16

March 2019.

(1990). Us census data (1990) data set. URL: https://archive.

ics.uci.edu/ml/datasets/US+Census+Data+(1990).

Online; Accessed 16 March 2019.

(1999). Kdd cup (1999). kdd cup 99 intrusion detec-

tion datasets. URL: http://kdd.ics.uci.edu/databases/

kddcup99/kddcup99.html. Online; Accessed 16

March 2019.

Agrawal, R. and Srikant, R. (1994). Fast algorithms for

mining association rules. In Proc. of 20th Intl. Conf.

on VLDB, pages 487–499.

Amado, N., Gama, J., and Silva, F. M. A. (2001). Par-

allel implementation of decision tree learning algo-

rithms. In Proceedings of the10th Portuguese Con-

ference on Artiﬁcial Intelligence on Progress in Arti-

ﬁcial Intelligence, Knowledge Extraction, Multi-agent

Systems, Logic Programming and Constraint Solving,

EPIA ’01, pages 6–13, London, UK, UK. Springer-

Verlag.

Andrade, G., Viegas, F., Ramos, G. S., Almeida, J., Rocha,

L., Gonc¸alves, M., and Ferreira, R. (2013). Gpu-nb:

A fast cuda-based implementation of na

ıve bayes. In

2013 25th International Symposium on Computer Ar-

chitecture and High Performance Computing, pages

168–175.

ohm, C., Noll, R., Plant, C., Wackersreuther, B., and

Zherdin, A. (2009). Data Mining Using Graphics Pro-

cessing Units, pages 63–90. Springer Berlin Heidel-

berg, Berlin, Heidelberg.

Chee, C.-H., Jaafar, J., Aziz, I. A., Hasan, M. H., and Yeoh,

W. (2018). Algorithms for frequent itemset mining: a

literature review. Artiﬁcial Intelligence Review.

Chon, K.-W., Hwang, S.-H., and Kim, M.-S. (2018).

Gminer: A fast gpu-based frequent itemset mining

method for large-scale data. Information Sciences,

439-440:19 – 38.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory.

Fang, W., Lau, K. K., Lu, M., Xiao, X., Lam, C. K., Yang,

P. Y., He, B., Luo, Q., S, P. V., and Yang, K. (2008).

Parallel data mining on graphics processors. Technical

report.

Fang, W., Lu, M., Xiao, X., He, B., and Luo, Q. (2009).

Frequent itemset mining on graphics processors. In

Proceedings of the Fifth International Workshop on

Data Management on New Hardware, DaMoN ’09,

pages 34–42, New York, NY, USA. ACM.

Favre, C. and Bentayeb, F. (2005). Bitmap index-based de-

cision trees. In Proceedings of the 15th International

Conference on Foundations of Intelligent Systems, IS-

MIS’05, pages 65–73, Berlin, Heidelberg. Springer-

Verlag.

Gainaru, A. and Slusanschi, E. (2011). Framework for map-

ping data mining applications on gpus. In 2011 10th

International Symposium on Parallel and Distributed

Computing, pages 71–78.

Harris, M. (2016). Optimizing parallel reduction in

cuda. URL: https://developer.download.nvidia.com/

assets/cuda/ﬁles/reduction.pdf. Online; Accessed 16-

Feb-2019.

Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., and Shi, Y.

(2013). Parallel data mining techniques on graphics

processing unit with compute uniﬁed device architec-

ture (cuda). J. Supercomput., 64(3):942–967.

Liao, Y., Rubinsteyn, A., Power, R., and Li, J. (2013).

Learning random forests on the gpu.

Mehta, M., Agrawal, R., and Rissanen, J. (1996). Sliq: A

fast scalable classiﬁer for data mining. In Apers, P.,

Bouzeghoub, M., and Gardarin, G., editors, Advances

in Database Technology — EDBT ’96, pages 18–32,

Berlin, Heidelberg. Springer Berlin Heidelberg.

O’Neil, P. and Quass, D. (1997). Improved query per-

formance with variant indexes. In Proceedings of

the 1997 ACM SIGMOD International Conference on

Management of Data, SIGMOD ’97, pages 38–49,

New York, NY, USA. ACM.

Shafer, J. C., Agrawal, R., and Mehta, M. (1996). Sprint:

A scalable parallel classiﬁer for data mining. In Pro-

ceedings of the 22th International Conference on Very

Large Data Bases, VLDB ’96, pages 544–555, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Silvestri, C. and Orlando, S. (2012). gpudci: Exploiting

gpus in frequent itemset mining. In 2012 20th Euromi-

cro International Conference on Parallel, Distributed

and Network-based Processing, pages 416–425.

Sinha, R. R. and Winslett, M. (2007). Multi-resolution

bitmap indexes for scientiﬁc data. ACM Trans.

Database Syst., 32(3).

Witten, I. H., Frank, E., and Hall, M. A. (2011). Data

Mining: Practical Machine Learning Tools and Tech-

niques. Morgan Kaufmann Publishers Inc., San Fran-

cisco, CA, USA, 3rd edition.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

266