A Novel Multi-View Partitioning and Ensembled-Based Cancer

Classiﬁcation Using Gene Expression Data

Kavitha K R, Kashyap G and Anjima K S

Department of Computer Science and Applications, Amrita School of Computing, Amrita Vishwa Vidyapeetham,

Amritapuri, India

Keywords:

Gene Expression Data, Multi-View Partitioning, Ensemble Learning, Machine Learning.

Abstract:

In this research, we propose an ensemble-based multi-view classiﬁcation framework to analyze high-

dimensional gene expression data, targeting the speciﬁc application of colon tumor classiﬁcation. We proposed

to incorporate state-of-the-art techniques that tackle the problem of heterogeneity, dimensionality, and clas-

siﬁcation performance in medical datasets. The methodology starts with clustering the gene expression data

into distinct feature subsets (views). Using a Feature Selection and Projection (FSP) algorithm called attribute

bagging, these subsets are spread out over several views: V1, V2, V3, V4, and V5, thereby capturing a very

broad range of data representations. Each view is independently classiﬁed with a specialized classiﬁer-again,

one that was especially designed to take full advantage of the particular properties inherent in that view-that

could be Random Forest, XGBoost, SVM, Multi-Layer Perceptron (MLP), and LSTM networks. The predic-

tions from these classiﬁers (Yp1, Yp2, Yp3, Yp4, Yp5) are combined using a weighted ensemble approach

based on majority voting, producing a uniﬁed prediction (Ypred). This strategy ensures robustness and mini-

mizes the impact of individual model biases. Finally, the accuracy of the ensemble is evaluated, demonstrating

the effectiveness of the proposed approach in achieving reliable and precise tumor classiﬁcation. By using

this architecture, we are able to achieve enhanced classiﬁcation performance with the strengths of ensemble

methods and multi-view learning. This scalable and accurate framework is highly pertinent for biomedical

data analysis and supports diagnostic decision-making processes.

1 INTRODUCTION

Gene expression data, in which gene activity is mea-

sured for various biological conditions, are essential

resources in biomedical research. It is useful to iden-

tify biomarkers, comprehend the interactions of genes

with each other, and provide insights into the diag-

nosis of diseases such as cancer. High dimensional-

ity is, however, a challenge to successfully classify

samples, in that the number of features greatly out-

numbers the available samples(Ben Brahim and Li-

mam, 2018). Similarly, even as challenges in ana-

lyzing high-dimensional gene expression data arose,

machine learning helped simplify the more compli-

cated biological processes in making accurate predic-

tions for drug discovery and ADMET(Bhavitha et al.,

2023). This often causes overﬁtting and poor gener-

alization in machine learning models, which requires

robust feature selection and classiﬁcation strategies.

This paper addresses the challenges mentioned

above by introducing a novel ensemble-based multi-

view classiﬁcation framework tailored for the Colon

Tumor dataset. As can be seen in the code accom-

panying this paper and as supported by the archi-

tectural design, the proposed methodology incorpo-

rates clustering and Feature Selection and Projection

(FSP) techniques to handle high dimensionality effec-

tively. The FSP algorithm, implemented as attribute

bagging, partitions the gene expression data into

smaller, manageable feature subsets or ”views”(Singh

and Kumar, 2024). Each view is then processed in-

dependently by diverse classiﬁers such as Random

Forest, XGBoost, Support Vector Machines (SVM),

Multi-Layer Perceptron (MLP), and Long Short-Term

Memory (LSTM) networks, each chosen to exploit

unique patterns and characteristics in the data. En-

semble machine learning approaches(Sreejesh Kumar

et al., 2021), such as those using Random Forest and

SVM for virtual screening, emphasize the need to in-

tegrate diverse classiﬁers in order to improve predic-

tive accuracy for biological datasets.

Attribute bagging ensures that feature subsets are

K R, K., G, K. and K S, A.

A Novel Multi-View Partitioning and Ensembled-Based Cancer Classiﬁcation Using Gene Expression Data.

DOI: 10.5220/0013594100004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 2, pages 435-443

ISBN: 978-989-758-763-4

435

Figure 1: Architectural Diagram for Multi-view Ensembler

diverse, hence enhancing the generalization ability of

the ensemble. In addition, the outputs of these clas-

siﬁers are combined through weighted ensemble and

majority voting techniques, leading to robust and re-

liable predictions(Xu et al., 2024). The multi-view

approach reduces the computational complexity as-

sociated with analyzing high-dimensional data and

improves classiﬁcation accuracy(Singh and Kumar,

2024) by leveraging the strengths of individual mod-

els and ensuring their complementarity.

This research tackles dimensionality, redundancy, and

overﬁtting head-on, thus providing an efﬁcient and

accurate framework to classify gene expression data.

The results of this paper establish the promise of en-

semble multi-view learning in biological applications,

thereby paving the way for cancer diagnosis advance-

ment and precision medicine. A potential foundation

is thus established through the proposed methodology

for further investigation using ensemble techniques

on other high-dimensional and complex datasets.

This paper introduces a new ensemble-based

multi-view classiﬁcation framework for addressing

the challenges of high-dimensional gene expression

data, especially in colon tumor classiﬁcation. The ar-

chitecture, shown in Figure 1, is designed to effec-

tively address the challenges of dimensionality and

sparsity of the data while improving the classiﬁcation

performance through an ensemble learning approach.

The process starts with the input dataset, like the

colon tumor gene expression dataset, and it under-

goes clustering for grouping similar features accord-

ing to their inherent patterns. From each cluster,

some of the features are chosen to make small fea-

ture subsets that are manageable. The FSP algorithm,

implemented through attribute bagging, guarantees

that these subsets, or ”views,” are diverse, represen-

tative of the original data, and handle redundancy and

noise. Techniques such as dictionary learning(Menon

et al., 2023) and sparse coding for minimal document

representation also emphasize reducing dimensional-

ity and redundancy, a principle which is repeated in

clustering and feature selection strategies for high-

dimensional datasets.

Then each of these views is passed into a sepa-

rate classiﬁer speciﬁcally engineered to maximize the

extraction of certain types of patterns speciﬁc to that

subset of features. In this architecture, machine and

deep learning models such as Random Forest, XG-

Boost, SVM, MLP, and LSTM are used to make pre-

dictions. These various classiﬁers generate a predic-

tion for each view independently as Yp1, Yp2, Yp3,

Yp4 and Yp5.

The ﬁnal predictions of each classiﬁer are com-

bined using a weighted ensemble technique with

majority voting, ensuring robust and reliable clas-

siﬁcation results. The ensemble method takes ad-

vantage of the complementary strengths of individ-

ual classiﬁers, thus negating their individual weak-

nesses(Ben Brahim and Limam, 2018). The over-

all prediction Ypred is then compared against ground

truth labels to calculate accuracy and assess the

model’s performance.

The solution not only enhances classiﬁcation ac-

curacy but also brings improvement in scalability

and efﬁciency over handling high-dimensional gene-

expression data. The integration of feature clustering,

attribute bagging, and a diverse set of classiﬁers is

achieved within this framework, making the method-

ology provided complete and effective for colon tu-

mor classiﬁcation while serving as a base for much

wider applications in the realm of cancer diagnosis

and precision medicine.

2 LITERATURE REVIEW

Ritika et at.(Singh and Kumar, 2024), proposes an

ensemble-based approach towards multi-view learn-

ing with a goal to improve the performance of the

classiﬁcation. Multi-view learning can be understood

as the process in which different subsets of features,

or ”views”, obtained from the same data capture var-

ious aspects of the data. This technique has gained

much attention over the past few researches because

it allows dividing the feature set and exploiting the

advantage of each view. The Feature Set Partitioning,

a variation of attribute bagging that divides the fea-

ture set into several, fragmented subsets, is one of the

most popular techniques for view creation. This ap-

proach ensures that all views focus on different pat-

terns or relationships in the data, thus increasing di-

versity within the classiﬁers.

Support vector machines are widely used classi-

INCOFT 2025 - International Conference on Futuristic Technology

436

ﬁers in multi-view learning because they are designed

to handle high-dimensional data and are common in

medical and biological datasets(Kumar and Yadav,

2023). SVM classiﬁers are utilized in multiple stud-

ies for every view and then their outputs are com-

bined through ensemble methods like majority vot-

ing or performance weighting. Ensemble methods,

which seek to combine the output of the multiple in-

dividual classiﬁers in order to maximize the accuracy

of prediction by avoiding overﬁtting or any type of

bias from the overuse of a particular model,. No-

tably, the weighted ensemble approach gives greater

weights to views that have proven superior perfor-

mance during evaluation. Thus, more robust views

will exert a stronger inﬂuence on the ﬁnal decision.

Recent research has shown the beneﬁts of combining

multi-view learning with weighted ensemble methods

in the sense that these techniques can signiﬁcantly

boost accuracy in complex classiﬁcation tasks.

This paper continues improving by developing

multiple views, which apply the FSP method to make

predictions utilizing SVM classiﬁers. The aggrega-

tion then uses a weighted ensemble technique, thus

further improving the predictive abilities and robust-

ness of such a model, especially in environments re-

lated to high-dimensional data, such as for medical

diagnosis or genomic studies.

Yuhong et al.(Xu et al., 2024), offers a new

method called Classiﬁer Ensemble based on Mul-

tiview Optimization (CEMVO) to tackle the prob-

lem of classifying high-dimensional imbalanced data.

The strategy systematically integrates the optimiza-

tion of feature and rebalancing of sample. Optimized

Subview Generation (OSG) uses weighted random

forests to extract multiple discriminative subviews,

whereas the Selective Ensemble of Optimized Sub-

views (SEOS) reﬁnes them into a strong ensemble. To

address the class imbalance, the Synthetic Minority

Over-sampling Technique (SMOTE) is implemented,

resulting in a balanced dataset for training the clas-

siﬁers. Experimental results indicate that CEMVO

delivers superior performance, achieving higher clas-

siﬁcation accuracy, particularly for minority classes,

when compared to existing methods. Nonetheless, the

approach can involve signiﬁcant computational costs

and may face overﬁtting issues during the synthetic

sample generation, suggesting potential areas for fur-

ther research and enhancement.

Adithya et al.(Kumar and Yadav, 2023), provide

an overview of feature set partitioning methods used

in MEL. MEL increases the classiﬁcation perfor-

mance by making different classiﬁers work on sep-

arate subsets of features, known as views, created

using techniques for FSP. FSP divides high dimen-

sional datasets into multiple, non-overlapping subsets

of features. This helps reduce the curse of dimension-

ality and thereby improves generalization in models

of machine learning. A variety of FSP approaches

will be discussed, including: random split, attribute

clustering and optimization-based approaches such as

genetic algorithms and particle swarm optimization.

Each method offers distinct beneﬁts in terms of com-

putation efﬁciency, feature diversity, and predictive

power. It further focuses on the difﬁculties that occur

with view construction, which implies maintaining a

balance between view complementarity and consen-

sus. Additionally, it highlights the requirement to op-

timize FSP strategies in achieving robust performance

over a variety of datasets. Using comparisons, this

review sheds insight into how recent breakthroughs

with FSP methods can help improve classiﬁcation ac-

curacy, robustness, and scalability for a wide range

of machine learning tasks that are complex, high-

dimensional.

Afet et al.(Ben Brahim and Limam, 2018),

presents an ensemble feature selection method de-

signed to address the challenges of high-dimensional

data. The approach aims to improve both classiﬁ-

cation accuracy and the consistency of feature selec-

tion by integrating various feature selectors. Different

aggregation techniques, such as Weighted Mean Ag-

gregation (WMA), Robust Rank Aggregate (RRA),

and a novel Reliability Assessment-Based Aggrega-

tion (RAA), are employed to merge feature subsets

obtained from homogeneous and heterogeneous en-

sembles. The methodology attempts to produce a

stable subset of features that works well over di-

verse datasets. Experimental results show that this

method, quite often, outperforms conventional fea-

ture selection techniques especially in small sample-

size datasets in terms of retaining or even improving

predictive accuracy along with the stability of feature

selection. Although the method brings in some im-

provements, the choice of base learners has been sen-

sitive, and optimal performance is achieved with very

careful tuning of parameters.

Vipin et al.(Kumar and Minz, 2016), proposes a

method called OFSP that is focused on enhancing per-

formance in multi-view ensemble learning (MEL) for

data classiﬁcation, where the problem of dimension-

ality is solved as the feature set is subdivided into rel-

evant and irrelevant subsets. Only the more signiﬁ-

cant features are used while performing classiﬁcation.

This approach uses the forward selection technique

along with the reduct-based strategy to come up with

optimal feature partitions that then are used in training

classiﬁers such as K-Nearest Neighbors, Naive Bayes,

Support Vector Machines. The outcome demonstrates

A Novel Multi-View Partitioning and Ensembled-Based Cancer Classiﬁcation Using Gene Expression Data

437

the improvement in classiﬁcation accuracy and lower-

ing the complexity of computation and execution time

with the diverse range of high-dimensional datasets.

This method is based on assumptions that the features

used are sufﬁcient for learning and also uses a ﬁxed

number of partitions, which may limit its adaptabil-

ity to larger and noisier datasets with varying feature

relevance.

Vipin et al.(Kumar and Minz, 2015), a new su-

pervised method for partitioning feature sets is in-

troduced for the classiﬁcation of high-dimensional

data, utilizing the advantages of multi-view ensem-

ble learning. By dividing the feature space into mul-

tiple disjoint subsets or ”views,” the method reduces

the curse of dimensionality, where each subset is pro-

cessed by an individual classiﬁer. A combination of

outputs from these classiﬁers forms a robust ensemble

model for a ﬁnal classiﬁcation decision. In summary,

the multi-viewing approach enhances the ability to

generalize the model with features focusing on differ-

ent aspects of data and reduces overﬁtting in complex

datasets for a more accurate model. The experiment

shows better performance than the traditional single-

view classiﬁers, especially in applications like bioin-

formatics, image recognition, and text mining, which

involve high-dimensional data.

Moreover, this method is ﬂexible enough to han-

dle any type of data, so that the feature space can be

controlled more granularly. However, the method is

not without its drawbacks. The computational cost is

quite high since managing and training multiple clas-

siﬁers can be resource-consuming. Further, the ef-

fectiveness of partitioning strategy and classiﬁers also

require careful tuning with selection which may dif-

fer in any particular dataset. This technique is thus

promising with many hopes of enhancing the quality

of classiﬁcations in the area of handling extensive and

complicated data.

High-dimensional gene expression data has

brought on a challenge to the development of feature

selection methods for high performance in classiﬁ-

cation. A hybrid ensemble-based feature selection,

EFS-SU, combines ﬁlter-based techniques like Pear-

son’s correlation, Spearman’s rank, and Mutual In-

formation with Symmetric Uncertainty to select non-

redundant and relevant subsets of genes that con-

siderably increase the accuracy of SVM classiﬁers,

reaching 100% accuracy on the Leukemia dataset(R

et al., 2021). Similarly, the Minimum Redundancy

Maximum Relevance (mRmR) algorithm along with

a Random Forest classiﬁer effectively balances rele-

vance and redundancy for superior classiﬁcation met-

rics than that of SVM and kNN classiﬁers, hence

showing how the algorithm may be utilized in the de-

termination of major biomarkers in cancer research(R

et al., 2024). In addition, the Boruta algorithm proves

its robustness as a wrapper-based technique for fea-

ture selection since it uses shadow features and evalu-

ates Random Forest to preserve key features, with an

impressive classiﬁcation accuracy of as high as 92.3%

for colon tumor datasets using SVM, despite its com-

putational intensity(Kavitha et al., 2022).These stud-

ies collectively underscore the critical role of fea-

ture selection in bioinformatics, each contributing

uniquely to optimizing machine learning workﬂows

for gene expression data analysis.

3 METHODOLOGY

The program initiates by loading a ﬁle containing a

dataset, which gets processed for analysis. The tar-

get variable is checked; it is the last column of the

dataset, and in this case, it needs to be determined

whether it’s categorical or continuous. The transfor-

mation to numerical values applies if the target vari-

able is categorical. The transformation to categorical

variables happens in case the target variable is con-

tinuous with a huge number of unique values. Such

transformation guarantees that the target is correctly

prepared for tasks in machine learning. With the tar-

get variable transformed appropriately, the next task

is on feature selection, where from the dataset appro-

priate features have to be chosen. The program de-

termines the signiﬁcance of each feature by assessing

its relevance to the target variable. Features that have

no relevance or are not adding much value to the task

of prediction are eliminated. It also identiﬁes clusters

of highly correlated features, ensuring that only the

most relevant features from each cluster are retained

for further analysis. This feature selection process re-

duces noise and computational complexity while re-

taining the necessary information for model training.

Having identiﬁed the appropriate features, the sys-

tem then generates multiple views of data. This is

achieved through a division of the selected features

into different small subgroups, each representing one

view. Each view contains random selections of fea-

tures, therefore differentiating the data. It has thus

been termed attribute bagging in the process of en-

suring models developed from different views capture

diverse data, hence making the resultant predictions

more robust and generalized.

Now that all the data is prepared, split into views,

and brought together, the program moves toward

training different machine learning models based on

the algorithms chosen- Random Forest, XGBoost,

SVM, MLP, and even LSTM. All of these machines

INCOFT 2025 - International Conference on Futuristic Technology

438

are trained on the input data, and the algorithm is then

tested against what it can predict as correct and the

accuracy scores for each such model are calculated.

During this phase, all train model-related exceptions

are taken care of by exception handling techniques.

To handle class imbalance issue in the dataset,

the developed program applies a technique known

as SMOTE (Synthetic Minority Over-sampling Tech-

nique). The basic idea behind this technique is to gen-

erate synthetic samples for a minority class. Thus the

models are not biased in favor of the majority-class

and are able to detect the minority class instances

more correctly.

With models having been trained and validated,

their predictions are combined to be used by the sys-

tem by utilizing ensemble learning methods. Here,

in weighted voting, their individual predictions are

aggregated while each model’s accuracy deﬁnes a

weighing over all the models. Hence, the inﬂuence of

more performing models on the ﬁnal result increases.

Otherwise, through majority voting, the result would

be that class whose results were most predicted by

these models.

The program ﬁnally tests the ensemble predic-

tions’ performance and calculates the accuracy for

both weighted and majority voting methods. Results:

An overall assessment of how well the ensemble of

models performs and will be able to decide whether

multiple algorithms are good for a combination that

will bring more precision to the predictions. Through-

out the process, the program ensures that each step is

done in sequence with data preparation, models train-

ing, and prediction combining in such a way as to

maximize the chances of getting the right and reliable

results.

3.1 Algorithm

1. Initially, load the colon tumor gene expression

dataset. Perform preprocessing steps like normal-

ization and handling missing values.

2. Apply a clustering algorithm (e.g., K-means) to

group similar features into clusters, ensuring re-

dundancy reduction and correlation preservation

within each cluster.

3. Generate feature subsets by selecting features

from each cluster using attribute bagging to cre-

ate diverse views (V1, V2,....,Vn).

4. Each feature subset (view) is used to train a spe-

ciﬁc classiﬁer. For example, V1 is assigned to a

Random Forest, V2 to XGBoost, V3 to SVM, V4

to MLP, and V5 to LSTM. Each classiﬁer learns

independently to capture unique patterns within

its assigned feature subset.

5. Each classiﬁer independently predicts the output

for its view, producing predictions(Yp1, Yp2,...,

Ypn)(Xu et al., 2024).

6. Combine the predictions using a weighted major-

ity voting approach, where the weights are based

on the classiﬁer’s performance on the validation

dataset(Singh and Kumar, 2024). This results in

the ﬁnal prediction Ypred.

7. Compare Ypred with ground truth labels to com-

pute metrics like accuracy, precision, recall, and

F1-score, ensuring the robustness of the classiﬁ-

cation model.

3.2 Clustering Algorithm (K-Means)

Clustering(Nidheesh et al., 2017) is an unsupervised

learning method of machine learning that groups sim-

ilar data points together based on some common char-

acteristics or features. In Algorithm 1, clustering is

performed in order to identify the highly correlated

groups of features in the dataset, which could be use-

ful for dimensionality reduction and better feature se-

lection(Kumar and Yadav, 2023).

The algorithm ﬁrst computes the correlation ma-

trix of the features in the dataset. It then bases the de-

cision on a correlation threshold to determine which

characteristics are highly correlated. From the re-

sult, an adjacency matrix was constructed where the

correlation surpassing the threshold is considered an

edge between features. Using this constructed adja-

cency matrix, connected components were computed

to group features that signiﬁcantly correlate with each

other as clusters. These clusters of feature groups are

groups of highly correlated features that may carry

possibly redundant information.

Once the clusters are generated, each cluster is in-

spected to pick the most relevant feature. The algo-

rithm computes a score for each feature in a cluster

based on how much it contributes to the differenti-

ation between different classes. The feature with the

highest score in each cluster is picked, and the rest are

discarded. This reduces the number of features while

retaining the most informative ones.

The clustering process helps in organizing the fea-

ture space and can lead to better performance by fo-

cusing on key features, reducing noise, and avoiding

multicollinearity. The resulting clusters enable more

efﬁcient model training by ensuring that the input data

is both relevant and diverse.

A Novel Multi-View Partitioning and Ensembled-Based Cancer Classiﬁcation Using Gene Expression Data

439

3.3 Random Forest Classiﬁer

Random forest(D

ıaz-Uriarte and Alvarez de Andr

es,

2006) is an ensemble learning algorithm that makes

predictions with the average of multiple decision

trees, which enhances the classiﬁer’s precision. In Al-

gorithm 1, it trains a model of a random forest over

one view of the data. Every decision tree in the forest

puts out an independent prediction, which is ﬁnally

determined by taking the majority vote of all of them.

The strength of Random Forest lies in its ability to

reduce overﬁtting and provide robust predictions by

aggregating the outputs of many trees. It is partic-

ularly well-suited for high-dimensional data like the

one in the dataset, where it can capture complex rela-

tionships between features.

3.4 XGBoost Classiﬁer

XGBoost (Extreme Gradient Boosting)(Deng et al.,

2022) is a powerful gradient boosting framework that

uses an ensemble of weak learners (typically deci-

sion trees) and builds them sequentially. In Algo-

rithm 1, XGBoost improves the prediction accuracy

by focusing on the errors of the previous models and

trying to correct them. This is achieved by adding

trees that minimize the error using a process called

boosting. XGBoost is famous for its efﬁciency, ﬂex-

ibility, and performance, especially in working with

large datasets. It handles missing values, regulariza-

tion, and can be ﬁne-tuned for better accuracy.

3.5 Support Vector Machines (SVM)

Support Vector Machines(Guyon et al., 2002) are su-

pervised learning models in classiﬁcation and regres-

sion tasks. It works by ﬁnding the optimum hyper-

plane that separates data into different classes with

maximum margin. In Algorithm 1 context, the SVM

algorithm utilized a linear kernel to classify data. The

main beneﬁt of SVM is that it tries to ﬁnd a deci-

sion boundary which maximizes the margin between

classes, which gives a better generalization. However,

SVM does not work very well with extremely large

datasets or datasets that are very noisy.

3.6 Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP)(Skabar et al., 2006) is

an artiﬁcial neural network for classiﬁcation. It has

an input layer, one or more hidden layers, and an out-

put layer. The layers are fully connected. In Algo-

rithm 1, it uses MLP with a size of 128 in the hidden

layer for training on the data. MLP uses backpropaga-

tion to adjust the weights in the connections between

neurons to reduce the error. This algorithm is very

ﬂexible and can model complex patterns in data, but

it requires careful tuning and is prone to overﬁtting if

not properly regularized.

3.7 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM)(Aburass et al.,

2024) is a type of recurrent neural network (RNN)

that is particularly effective at learning and predict-

ing sequences of data. LSTMs are implemented to

address the vanishing gradient problem, which some-

times arises when learning long sequences with stan-

dard RNNs. In Algorithm 1, the LSTM model is used

for classiﬁcation where the dataset is considered a se-

quence. It has an LSTM layer followed by a dropout

layer (in order to avoid overﬁtting) and then a dense

layer with softmax activation to provide class proba-

bilities. LSTMs are particularly useful when dealing

with time-series data or when the sequential relation-

ships between data points are important, though they

can also be applied to other types of data.

3.8 Ensemble Techniques

Ensemble methods are a combination of predictions

from multiple models in order to increase the overall

accuracy, robustness, and generalization of the

model. The idea is that combining the outputs of

several models will result in better predictions than

relying on a single model. In Algorithm 1, two

ensemble techniques are applied: Weighted Voting

and Majority Voting.

3.8.1 Weighted Ensembler

Weighted voting(Zhang et al., 2014) is an ensem-

ble technique in which the weight assigned to each

model’s prediction is based on its performance (accu-

racy)(Xu et al., 2024). The more accurate a model, the

higher its weight in the ﬁnal prediction. In Algorithm

1, the accuracy of each model trained on the views is

used to compute the weights. The weight for every

model is determined by dividing its accuracy by the

sum of all accuracies, being guaranteed that the total

sum of weights will be equal to 1.

For every prediction the models make, there is a

weighted sum of their predictions. In essence, mod-

els with a better accuracy will have more of an inﬂu-

ence on the decision made. The output from the mod-

els is rounded up or thresholded at 0.5. It is in these

INCOFT 2025 - International Conference on Futuristic Technology

440

rounded values that the ﬁnal ensemble prediction re-

sults. This works particularly well when the models

vary signiﬁcantly in terms of their degree of accu-

racy. This approach gives better-performing models

the ability to inﬂuence the outcome of the decision.

3.8.2 Majority Voting

Majority voting(Aydın and Aslan, 2019) is another

ensemble method, whereby the ﬁnal prediction results

from taking a vote of the ensemble. Each model puts

forward a prediction, and the class label that receives

the most votes is assigned as the ﬁnal prediction re-

sult. In Algorithm 1, this is implemented through the

‘mode‘ function, which calculates the most frequent

prediction for each example across all the models’

predictions.

Major voting technique is simpler to apply

compared to weighted voting and does not include

the accuracy levels of individual models(Singh and

Kumar, 2024). On the other hand, it will work ﬁne,

if the models that are compared are equally valid

or even if the given dataset contains thousands of

samples, it works nicely. This procedure is also

somewhat resistant to noise as, in case if there exist

many different kinds of models, it allows the noisy

predictions of separate models not to dominate

overall predictions.

Both techniques rely on the strengths of various

models and minimize the dangers of overﬁtting or

bias generated by a single model. By combining their

predictions, the ensemble model should be more ac-

curate and stable.

4 RESULT ANALYSIS

4.1 Dataset Description

We are analyzing a dataset pertaining to colon tumors.

Colon cancer, which develops within the large intes-

tine, has the ability to spread to other regions of the

body. The colon serves as the concluding part of the

digestive system. While colon cancer can affect in-

dividuals across all age groups, it is more prevalent

among older adults. The class labels associated with

this dataset are represented in the column headings of

a matrix consisting of 2001 x 62.

Table 1 contains information about the datasets.

Table 1: Dataset Description

Dataset Samples Count Genes Count Class Label

Colon Tumor 62 2000 2 (Yes, No)

4.2 Experiment Analysis

Table 2 summarizes the cluster and feature distribu-

tion across different views of the dataset. The data

is partitioned into views based on correlation thresh-

olds, which effectively group features into clusters

that are then analyzed for their contribution to classi-

ﬁcation. The table provides an overview of the num-

ber of clusters and the corresponding number of fea-

tures for each view, demonstrating the distribution of

data across different feature sets. This distribution

reﬂects how features are grouped and highlights the

diversity of clusters. Such an arrangement is critical

for understanding redundancy among features and en-

suring meaningful patterns are preserved across the

views. By leveraging the multi-view clustering ap-

proach used in this analysis, its ability to address the

nature of the high-dimensional space and retain the

most relevant features for downstream classiﬁcation

tasks ensures a successful implementation.

Table 2: Cluster and Feature Distribution Across Views

No. of Clusters No. of Features per View (5 Views)

248 49

500 100

1000 200

1500 300

1800 360

The experimental results, presented in Tables 3

and 4, provide signiﬁcant insights into the perfor-

mance of various classiﬁers and ensemble meth-

ods across different numbers of clusters. The re-

sults not only highlight the effectiveness of the pro-

posed techniques but also reveal the challenges asso-

ciated with achieving optimal cluster counts during

the correlation-based clustering process.

One of the main observations was that it was im-

possible to get exactly 250 clusters, but rather 248

clusters. This is because correlation-based clustering

has inherent limitations. The threshold sensitivity, for

example, determines the number of clusters. In this

case, a threshold of 0.824 was used, and the algorithm

settled for 248 clusters as it adapted to the natural

structure of the data. Despite efforts to break up larger

clusters to achieve the target, the splitting mechanism

could not ensure exactly 250 clusters. This limitation

emphasizes the need for further reﬁnement of cluster-

ing approaches to more closely match predeﬁned tar-

gets without compromising the integrity of the data.

Table 3 presents the performance of individ-

A Novel Multi-View Partitioning and Ensembled-Based Cancer Classiﬁcation Using Gene Expression Data

441

Table 3: Performance Metrics Across Different Algorithms and Ensembles

No. of Clusters RF Acc. XGBoost Acc. SVM Acc. MLP Acc. LSTM Acc. Weighted Ensemble Acc. Majority Voting Acc.

248 63.16% 68.42% 78.95% 73.68% 47.37% 63.16% 63.16%

500 84.21% 63.16% 89.47% 78.95% 47.37% 78.95% 78.95%

1000 84.21% 78.95% 78.95% 84.21% 47.37% 84.21% 84.21%

1500 68.42% 73.68% 84.21% 68.42% 47.37% 73.68% 73.68%

1800 47.37% 52.63% 47.37% 47.37% 47.37% 47.37% 47.37%

Table 4: Performance Metrics for Different Ensemble Methods

No. of Clusters

Weighted Ensemble Majority Voting

Accuracy F1 Score Precision Recall Accuracy F1 Score Precision Recall

248 63.16% 0.58 0.79 0.63 63.16% 0.58 0.79 0.63

500 78.95% 0.78 0.85 0.79 78.95% 0.78 0.85 0.79

1000 84.21% 0.83 0.88 0.84 84.21% 0.83 0.88 0.84

1500 73.68% 0.72 0.83 0.74 73.68% 0.72 0.83 0.74

1800 47.37% 0.30 0.22 0.47 47.37% 0.30 0.22 0.47

ual classiﬁers, namely RF, XGBoost, SVM, MLP,

and LSTM, as well as ensemble methods, including

weighted ensemble and majority voting. The anal-

ysis showed that at 500 clusters, both SVM and the

weighted ensemble achieved their peak accuracy of

89.47%, reﬂecting that both can balance precision

with generalization at this moderate level of clus-

ters. Similarly, at 1000 clusters, classiﬁers such as

RF and SVM, along with the ensemble methods, re-

mained well consistent and achieved 84.21% in ac-

curacy. These results indicate that, at moderate clus-

ter counts, models can exploit the existing grouping

of features without over-segmenting. However, when

the number of clusters was increased to 1800, there

was an evident drop in performance by all models.

This drop, manifesting in decreased accuracy and re-

liability, points to over-segmentation as a potentially

negative phenomenon, which waters down meaning-

ful relationships within data and makes classiﬁcation