Predictive Modelling for Diabetes Mellitus with Respect to Basic Medical

History

Patrick Purta, Aryan Mishra, Vishal Reddy Vadde, Ruthvika Bojjala, Gopichand Jagarlamudi and

Bonaventure Chidube Molokwu

Department of Computer Science, College of Engineering and Computer Science, California State University,

Sacramento, U.S.A.

Keywords:

Diabetes Mellitus, SMOTE, SVM-SMOTE, SMOTE-ENN, Predictive Modeling, Risk Assessment, Clinical

Decision Support, Medical History.

Abstract:

In our work herein, we observed how three (3) common oversampling techniques - SMOTE, SMOTE-ENN,

and SVM-SMOTE - affect the performance of Machine Learning (ML) models applied towards predicting

diabetes risk with reference to the Pima-Indian (Akimel O’odham) Diabetes dataset. Our aim was to ﬁgure

out if using these methods to mitigate class imbalance, in a medical dataset, might cause the ML models to

overﬁt - in other words, they tend to do very well on the training data but lose ﬁtness and accuracy on new data.

Our project began from a simple question: “Can oversampling ﬁx class imbalances, with respect to a given

dataset, without hurting the model’s ability to generalize?” Previous studies have shown that oversampling can

help balance target-classes within a dataset, but these studies do not always address the risk of overﬁtting. To

answer this, we combined each oversampling technique via three (3) ensemble methods - Extra Trees, Gradi-

ent Boosting, and Random Forest - and compared their performances via cross-validation objective functions.

Our results reveal that, although each method improves the results or metrics on the training data, they tend to

under-perform slightly on unseen test or sample data. This suggests that while oversampling is a useful strat-

egy, it must be applied with caution to avoid overﬁtting. These insights are important for reﬁning predictive

models, especially in healthcare contexts where reliable performance is critical.

1 INTRODUCTION

Working on this project has been like untangling a

knot in a routine we all know too well, balancing the

scales when it comes to data. In healthcare, especially

with something as crucial as diabetes diagnosis, hav-

ing a dataset where the positive cases are much fewer

than the negatives is common; and it often leaves our

models struggling to learn the right patterns. Imagine

trying to hear a soft whisper in a loud room; the less

frequent instances get drowned out.

We started exploring this issue because we noticed

something interesting with oversampling techniques

like Synthetic Minority Over-sampling Technique

(SMOTE), SMOTE-ENN (SMOTE + Edited Nearest

Neighbors), and SVM-SMOTE (Support Vector Ma-

chine SMOTE). These methods essentially “create”

more examples from the minority class so that every

voice in the dataset gets a chance to be heard. How-

ever, while these techniques boosted performance

https://orcid.org/0000-0003-4370-705X

during training, our models - built with Extra Trees,

Gradient Boosting, and Random Forest - sometimes

became too focused on the training data, showing

signs of overﬁtting when it came time to predict new

cases.

This raised a pressing question for our team: “Can

we use oversampling to balance the dataset without

inadvertently making the models overconﬁdent with

reference to the training examples?” With the Pima-

Indian Diabetes dataset (Nnamoko and Korkontzelos,

2020) as our test case, we set out to ﬁnd an answer

because, in healthcare, even the smallest error(s) can

yield big consequences. The goal is not just to record

high objective-function scores during training, but to

build tools that work reliably when it really counts.

To address this issue, we employed a thorough

hands-on approach. Individually, we applied each

oversampling technique separately to our training

data; thereafter, we carefully evaluated how different

ensemble models performed using a cross-validation

strategy. By comparing training and testing results,

Purta, P., Mishra, A., Vadde, V. R., Bojjala, R., Jagarlamudi, G. and Molokwu, B. C.

Predictive Modelling for Diabetes Mellitus with Respect to Basic Medical History.

DOI: 10.5220/0013692100003985

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 343-349

ISBN: 978-989-758-772-6; ISSN: 2184-3252

343

we aimed at a better understanding of where our mod-

els were getting it right and where they were losing

their grip.

At its heart, our project is more than just statis-

tics - it is about making a realistic difference in how

we use technology in healthcare. By delving deeper

into the balance between correcting data imbalances

and avoiding overﬁtting, we do hope that our work

can guide the development of more robust as well as

trustworthy diagnostic tools that put patient-care ﬁrst.

2 RELATED LITERATURE

In our effort to improve how machines predict dia-

betes, we dug into a lot of previous work to see how

others have handled similar challenges. This review

granted us insights into the past approaches and/or

methodologies.

2.1 Background of the Work

Researchers have long struggled with imbalanced

data, especially in healthcare, where the number

of positive cases (diabetes diagnosis) is often much

smaller than the negative cases. One of the earliest

breakthroughs came with the introduction of SMOTE

(Chawla et al., 2002), which set the stage for creat-

ing synthetic examples to balance the dataset. Since

then, numerous projects have applied these methods

- especially to datasets like the Pima-Indian-Indian

Diabetes dataset - to better understand how boosting

the minority class can help the overall model. How-

ever, these projects also revealed a common down-

side: while oversampling improves training results,

it sometimes leads the models to perform less effec-

tively when faced with new data.

2.2 Building a Strong Knowledge Base

A number of articles have reinforced the idea that

oversampling techniques such as SMOTE, SMOTE-

ENN, and SVM-SMOTE can help level the playing

ﬁeld during model training - by providing more ex-

amples from the under-represented class. For exam-

ple, SVM-RBF: Support Vector Machine (SVM) with

Radial Basis Function (RBF), Decision Tree, Naive

Bayes, and RIPPER learning algorithms were cou-

pled with SMOTE to improve classiﬁcation perfor-

mance on the Pima-Indian dataset. Studies that com-

pared these techniques have shown that although they

improve the class balance during training, they can

also make a model too “comfortable” with the train-

ing data, resulting in overﬁtting(Santos et al., 2018).

In other words, the model learns the training data so

well that it struggles to adapt when shown data it has

not seen before. Researchers (Poornima and R., 2024)

have also experimented with ensemble methods - e.g.

Extra-Trees, Random Forest(Olisah et al., 2022), Gra-

dient Boosting, etc. - highlighting both the beneﬁts

and the challenges of combining these oversampling

techniques with robust classiﬁers(Zhang et al., 2024).

2.3 Theoretical Support and

Methodological Framework

What ties all these together is the well-known trade-

off between reducing bias and increasing variance.

Oversampling helps reduce bias by making sure the

model gets trained on sufﬁcient examples of the mi-

nority class, but it can also lead to higher variance -

meaning that the model might not perform well on

unseen data. Several literature supports using cross-

validation as a way to tackle this problem. Cross-

validation (CV) helps check whether a model’s good

performance on the training data transcends to when

it encounters new data. Many studies (Santos et al.,

2018) have used this approach to ensure that the ben-

eﬁts of oversampling are not lost due to overﬁtting.

This blend of theory and practical provides the frame-

work for our work herein - giving us guidance on how

to set up our experiments and interpret our results.

By examining the aforementioned past efforts and

work — early from the development of SMOTE to

recent studies on ensemble methods — we have built

a solid foundation for our work herein. Our review

of related literature not only highlights the strengths

and weaknesses of existing methods, but it also high-

lights areas where improvements can be made. Thus,

this background supports our goal towards striking

an ideal balance between ﬁxing data imbalances and

keeping our models adaptable in real-world health-

care domains (Mooney, 2018; Lugat, 2021).

3 PROPOSED FRAMEWORK

AND METHODOLOGY

3.1 Formalism with Respect to the

Problem Statement

Given a medical dataset, D = (x

, y

), for i = 1 . . . n,

and with input features, x

∈ R

, and binary outcomes,

∈ {0, 1}, where class imbalance exists such that:

∑

i=1

∑

i=1

(1 − y

)

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

344

i.e. the number of diabetic cases, y

, is much

smaller than non-diabetic cases, 1 − y

; the goal is

to develop a robust classiﬁcation function: f (x) :

→ {0, 1}, such that f (x) accurately predicts the

presence of diabetes (outcome = 1) using ensemble

classiﬁers, while reducing the effects of class im-

balance via oversampling techniques. This leads to

the following optimization objective: Minimize over



f ∈ F : L( f (x), y)



subject to the training set being

balanced via an oversampling method, O, where O ∈

{SMOTE, SMOTE-ENN, SVM-SMOTE}.

3.2 Methodology Overview

1. Data Preparation: We began by cleaning the

dataset — replacing zero-values in medically crit-

ical ﬁelds (like glucose level, blood pressure,

BMI, etc.) with NaN-values, and then imputing the

missing values using the KNN-Imputer method.

This helps ensure the model is not biased via in-

valid misleading inputs.

2. Train-Test Split: The dataset was split into train-

ing and testing sets using stratiﬁed sampling in a

bid to preserve the proportion of diabetic and non-

diabetic cases in both sets.

3. Oversampling the Minority Class: We tested three

(3) widely-used oversampling techniques: SMOTE:

Generates synthetic samples for the minority class

based on k-nearest neighbors. SMOTE-ENN: Com-

bines SMOTE with Edited Nearest Neighbors to

also clean noisy samples. SVM-SMOTE (Demidova

and Klyueva, 2017): A variant of SMOTE that uses

an SVM to better deﬁne the border of the minority

class.

4. Feature Scaling and Standardization: After impu-

tation, we applied a Quantile Transformer to stan-

dardize the data and reduce skewness. This step

is crucial before training, especially when using

models that are sensitive to large-scale data val-

ues.

5. Model Training: For each oversampling tech-

nique, we trained three (3) models — Random

Forest, Extra Trees, and Gradient Boosting — on

the resampled training data.

6. Model Evaluation: Finally, we evaluated each

model using multiple metrics such as Accuracy,

Precision, Recall, F1-score, AUC (Area Under the

Curve), and MCC (Matthews Correlation Coefﬁ-

cient). This helped us understand not just how ac-

curate the models were, but how well they handled

both classes (Kaliappan et al., 2024).

3.3 Formal Algorithm

The formal algorithm for the Oversampling and

Model-Training Pipeline begins with the input of a

dataset, D, an oversampler, O, and a classiﬁer, C,

with the aim of producing a trained model, M, as

the output. Initially, the dataset, D, is split into fea-

tures, X, and labels, y. Any missing values in the

features are handled via a KNN-Imputer. Subse-

quently, the features, X, are standardized via a Quan-

tile Transformer to ensure a uniformly scaled distri-

bution. The dataset is then further divided into Train-

ing and Testing sets: X_train, X_test, y_train,

y_test. The oversampler, O, is applied to the

Training data to generate: X_train_oversampled

and Y_train_oversampled. After oversampling, the

classiﬁer, C, is trained on the oversampled Training

data. Finally, the trained model, M, is returned.

4 PROPOSED SYSTEM

ARCHITECTURE AND SETUP

The framework of our proposed system is illustrated

via Figure 1; while Table 1 represents the descrip-

tion of our benchmark dataset. Table 2 showcases

a handful of the hyperparameter conﬁgurations for

our experimental setup. Table 3 highlights the ob-

jective functions (or performance metrics) employed

herein to evaluate how well each benchmark model

performed - especially in terms of identifying diabetic

patients (the minority class). Table 4 lists the baseline

models we have employed toward benchmarking and

comparative analyses.

Table 1: Description of Dataset.

Property Description

Name Pima-Indian Diabetes Dataset

Source

UCI Machine Learning Repository

/ Kaggle

Instances 768

Features

8 input features + 1 binary output

(Outcome)

Target Variable

Outcome (1 = diabetic, 0 = non-

diabetic)

Missing Values Handled using KNN-Imputer

5 EXPERIMENT AND RESULTS

We used the Pima-Indian diabetes dataset, a widely

recognized dataset in medical and healthcare re-

Predictive Modelling for Diabetes Mellitus with Respect to Basic Medical History

345

Figure 1: Architectural overview of our proposed system.

Table 2: Hyperparameters Conﬁguration/Setting.

Component Parameter

Value(s)

Tested

SMOTE k neighbors 5

SVMSMOTE k neighbors 5

Random For-

est

n estimators 100

Extra Trees n estimators 100

Gradient Boost-

ing

learning rate 0.1

n estimators 100

KNNImputer n neighbors 5

Quantile Trans-

former

output distribution ‘normal’

Table 3: Objective Functions (Performance Metrics).

Metric Purpose

Accuracy Measures overall correctness

Precision

Ratio of True Positives to pre-

dicted Positives

Recall Sensitivity to positive cases

F1-score

Harmonic mean of Precision and

Recall

ROC-AUC Area under ROC curve

MCC

Balanced measure for imbal-

anced datasets

search, to thoroughly test the effectiveness of three (3)

robust ML algorithms: Random Forest, Extra Trees,

Table 4: Baseline Models for Comparison.

Model Description

Logistic Regression Simple linear classiﬁer

Naive Bayes Probabilistic classiﬁer

Decision Tree Base model for tree ensembles

Dummy Classiﬁer

Majority class predictor (base-

line)

and Gradient Boosting. Each of these algorithms was

carefully selected based on their known strengths in

handling classiﬁcation tasks, especially with medical

datasets. To ensure our models effectively manage

data imbalance — a common issue in medical diag-

nosis — we employed multiple sampling techniques

(SMOTE, SVM-SMOTE, and SMOTE-ENN) as well

as “No Sampling” technique.

Table 5: Accuracy scores (Train/Test).

Model

No Sam-

pling

SMOTE

SVM-

SMOTE

SMOTE-

ENN

Random

Forest

0.759/

0.721

0.819/

0.766

0.814/

0.753

0.961/

0.721

Extra

Trees

0.756/

0.714

0.823/

0.734

0.833/

0.734

0.960/

0.727

Grad-

Boost

0.754/

0.760

0.785/

0.740

0.788/

0.760

0.941/

0.766

Table 6: F1-scores (Train/Test).

Model

No Sam-

pling

SMOTE

SVM-

SMOTE

SMOTE-

ENN

Random

Forest

0.755/

0.716

0.820/

0.750

0.804/

0.726

0.961/

0.726

Extra

Trees

0.748/

0.710

0.822/

0.735

0.833/

0.736

0.960/

0.733

Grad-

Boost

0.752/

0.754

0.785/

0.745

0.787/

0.764

0.941/

0.771

Table 7: Matthews Correlation Coefﬁcient, MCC, scores

(Train/Test).

Model

No Sam-

pling

SMOTE

SVM-

SMOTE

SMOTE-

ENN

Random

Forest

0.456/

0.420

0.631/

0.457

0.606/

0.490

0.893/

0.427

Extra

Trees

0.452/

0.420

0.639/

0.462

0.644/

0.401

0.926/

0.463

Grad-

Boost

0.467/

0.455

0.571/

0.390

0.583/

0.468

0.886/

0.508

6 DISCUSSION

Table 8 showcases the experiment results with re-

spect to our Extra-Trees model, and Table 9 represents

the experiment results with respect to our Gradient-

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

346

Boosting model.

Table 8: Extra-Trees model (Train/Test).

Sampling

Method

Accuracy

(Train/Test)

F1-score

(Train/Test)

Interpretation

No Sam-

pling

0.756/ 0.714 0.748/ 0.710

Minor over-

ﬁtting, ∼5%

variance

SMOTE 0.823/ 0.734 0.822/ 0.735

Better scores

but overﬁtting

by 7-10%

SVM-

SMOTE

0.833/ 0.734 0.833/ 0.736

Better scores

but overﬁtting

by 7-10%

SMOTE-

ENN

0.960/ 0.727 0.960/ 0.733

Major overﬁt-

ting, up to 25%

Table 9: Gradient-Boosting model (Train/Test).

Sampling

Method

Accuracy

(Train/Test)

F1-score

(Train/Test)

Interpretation

No Sam-

pling

0.754/ 0.760 0.752/ 0.754

No overﬁtting,

∼1% variance

SMOTE 0.785/ 0.740 0.785/ 0.745

Minor over-

ﬁtting, 1-5%

variance

SVM-

SMOTE

0.788/ 0.760 0.787/ 0.764

Minor over-

ﬁtting, 1-5%

variance

SMOTE-

ENN

0.941/ 0.766 0.941/ 0.771

Major overﬁt-

ting, up to 20%

6.1 Performance Analysis

The experimental results reveal that the performance

of the ensemble classiﬁers — Random Forest, Ex-

tra Trees, and Gradient Boosting — varied signiﬁ-

cantly depending on the oversampling technique ap-

plied. Random Forest consistently achieved the high-

est training scores across all oversampling methods,

with training Accuracy reaching as high as 0.961 and

training F1-score also at 0.961 when using SMOTE-

ENN. However, these impressive training metrics did

not carry over to the Test set, where the corresponding

Accuracy and F1-score dropped to 0.721 and 0.726,

respectively. This large disparity suggests signiﬁcant

overﬁtting, where the model learns the Training data

too closely and fails to generalize well to new, un-

seen data. A similar pattern was observed in the

Extra-Trees classiﬁer, where training Accuracy and

F1-score under SMOTE-ENN were both 0.960, but

Test set values fell to 0.727 and 0.733, respectively.

This reinforces the conclusion that aggressive over-

sampling techniques like SMOTE-ENN can lead to

overly optimistic training performance while compro-

mising real-world applicability.

In contrast, Gradient Boosting demonstrated a

more balanced performance between the training and

testing phases. The differences between the train-

ing and test metrics were much narrower, espe-

cially with SMOTE and SVM-SMOTE. For example,

when paired with SVM-SMOTE, Gradient Boosting

achieved a training Accuracy of 0.788 and a test Ac-

curacy of 0.760, with F1-scores of 0.787 and 0.764,

respectively. This indicates only a minor degree of

overﬁtting, suggesting that Gradient Boosting may be

better at generalizing from the Training data while

still beneﬁting from oversampling. These ﬁndings

are further supported by the Matthews Correlation

Coefﬁcient (MCC) score, which reﬂects the quality

of binary classiﬁcations in imbalanced datasets. Al-

though, Random Forest and Extra-Trees yielded very

high MCC values on the Training data (up to 0.926);

their test MCC scores were notably lower. In con-

trast, Gradient Boosting with SMOTE-ENN achieved

the highest test MCC at 0.508, indicating a stronger

ability to maintain balanced performance across both

classes.

6.2 Interpretation of Results

The primary performance indicators used to evaluate

the models were Accuracy, F1-score, and Matthews

Correlation Coefﬁcient (MCC); which together pro-

vide a multidimensional view of the classiﬁcation

quality. Accuracy captures the overall correctness of

the model, while F1-score accounts for the trade-off

between Precision and Recall, making it particularly

useful in imbalanced medical datasets. MCC offers

a single summary metric that reﬂects the balance of

both True and False Positives as well as negatives

- making it especially relevant when the dataset is

skewed.

While training Accuracy for all models frequently

exceeded 90%, such high values were not replicated

on the Test set where most accuracy values hovered

between 71% and 76%. This discrepancy underscores

the presence of overﬁtting, particularly in models like

Random Forest and Extra Trees when trained with

SMOTE-ENN. F1-scores on the Test set ranged be-

tween 0.71 and 0.77, with Gradient Boosting paired

with SMOTE-ENN yielding the highest test F1-score

of 0.771. This suggests that the combination allowed

for the most effective balance between detecting true

diabetic cases and avoiding False Positives. Mean-

while, the MCC results offered a more nuanced in-

terpretation of model balance. Gradient Boosting

with SMOTE-ENN once again led with a test MCC

of 0.508, indicating strong classiﬁcation consistency.

Random Forest with SVM-SMOTE also performed

Predictive Modelling for Diabetes Mellitus with Respect to Basic Medical History

347

welll with a test MCC of 0.490.

These results suggest that while oversampling

methods can enhance model performance by mitigat-

ing class imbalance, they must be applied with cau-

tion to avoid overﬁtting. Among the classiﬁers tested,

Gradient Boosting demonstrated the most robust and

reliable generalization, especially when combined

with SMOTE-ENN or SVM-SMOTE. These combi-

nations improved the model’s ability to identify dia-

betic cases effectively without signiﬁcantly compro-

mising performance on the majority class, which is

essential in high-stake medical prediction tasks.

6.3 Implications, Beneﬁts, and

Contribution to Humanity

The results from this work and project have several

important real-world implications, viz:

(a) Early and accurate detection of diabetes: Our

models can help identify positive diabetes cases

earlier, thereby enabling timely interventions that

can prevent serious complications, improve pa-

tient outcomes, and signiﬁcantly reduce health-

care costs.

(b) Deployment in real-world scenarios: Credits to

the robust performance after advanced prepro-

cessing (especially SVM-SMOTE and SMOTE-

ENN). This enabled the models to be conﬁdently

deployed in critical ﬁelds such as healthcare or

ﬁnance, where imbalanced data is common and

high-stake decisions are made.

ﬂow we demonstrated herein, combining strong

ensemble models with smart preprocessing tech-

niques, can be applied to other imbalanced prob-

lems like cancer detection, fraud detection, or pre-

dicting rare diseases.

7 CONCLUSIONS AND FUTURE

7.1 Limitations

Despite the promising outcomes herein, our research

has several limitations. Primarily, our evaluation was

limited to the Pima-Indian dataset which predom-

inantly includes data from women of (Indigenous)

Pima-Indian heritage; thus potentially restricting the

model’s applicability to broader populations and gen-

ders. Additionally, advanced Deep Learning archi-

tectures and a wider range of preprocessing methods

were not explored.

7.2 Summary of Model

Our developed models are effective analytical tools

capable of ingesting and processing medical and clin-

ical data. The model is capable of taking tabular nu-

meric data as input, where each row is an instance of a

patient’s medical proﬁle and each column represents

features such as pregnancy count, glucose levels, in-

sulin levels, etc. In the initial phase of data trans-

formation, missing and/or invalid values were identi-

ﬁed in the dataset with respect to features like Glu-

cose level, BloodPressure, SkinThickness, Insulin,

and BMI. These features contained invalid zero en-

tries, which were later transformed into NaN-values,

and imputed using KNN-Imputer to maintain data in-

tegrity.

Preprocessing stage consists of several steps to

prepare the data for modeling. Firstly, we tack-

led missing values by ﬁlling them up, then we used

EDA (Exploratory Data Analysis): using histograms

to show that the distribution of features has been cor-

rected, and conﬁrming there are no more zero-entries.

Feature selection is done with the help of SelectKBest

and ANOVA F-test (f classif), so that we can choose

the most relevant features which correlate with the tar-

get for our model. The dataset is then cross-validated

with train test split.

Data oversampling is performed using SMOTE,

SVMSMOTE, or SMOTE-ENN, depending on which

sampling method the user chooses to include in the

model. The Testing data should not have any knowl-

edge of the Training data. Our oversampling methods

create synthetic data points via interpolation between

two (2) original data points with respect to a given

dataset. If the Test set includes these synthetic data

points, then it has some knowledge of the Training

data, and this may cause bias in the resultant model’s

learning (Zhang et al., 2024). Therefore, oversam-

pling is performed only on the Training data. Sub-

sequently, the features are standardized using Quan-

tileTransformer with a normal-distribution output to

reduce skewness and ensure that the model sees uni-

formly distributed input features.

Upon completion of preprocessing, model train-

ing is done robustly with the help of k-fold cross

validation so that the performance of the model can

be evaluated effectively. To properly evaluate the

model’s generalization capability, key performance

metrics such as Accuracy and F1-score are calculated

for both Training data and Testing data which we

have explained in detail in the previous section. The

model yields its results with respect to standard ob-

jective functions/metrics for classiﬁcation tasks. The

primary output is whether a patient is diabetic or not

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

348

diabetic. This research is not just limited to only med-

ical data, it can be adapted to any type of numeric data

requiring a classiﬁcation task.

7.3 Future Work

Future research efforts will substantially broaden the

scope of our investigation. We plan to incorporate

more diverse and extensive datasets, encompassing

varied populations to enhance the generalizability and

robustness of the predictive models herein. Addition-

ally, further exploration into advanced Deep Learn-

ing methods such as Convolutional Neural Networks

(CNNs), Recurrent Neural Networks (RNNs), and hy-

brid Deep Learning models is anticipated. These so-

phisticated architectures could uncover deeper pat-

terns within complex medical data, potentially of-

fering substantial improvements in predictive perfor-

mance. Moreover, detailed preprocessing techniques

including advanced feature engineering, dimensional-

ity reduction techniques such as Principal Component

Analysis (PCA) and t-distributed Stochastic Neighbor

Embedding (t-SNE), will be rigorously examined to

optimize the feature set for enhanced predictive Ac-

curacy scores.

Addressing the interpretability of these models

will also be an essential aspect of our future work,

aiming to produce models that are not only accurate

but also easily interpretable by healthcare profession-

als. In addition to technical advancements, we intend

to deploy our model on accessible and user-friendly

platforms, such as interactive web applications or mo-

bile apps. This approach aims to facilitate real-time

diabetes risk assessment tools for healthcare profes-

sionals and patients alike, promoting proactive health

management.

Collaborations with clinical institutions and

healthcare providers will also be pursued to validate

our model further through extensive real-world clini-

cal trials and implementations, ensuring practical rel-

evance and efﬁciency in varied clinical domains.

ACKNOWLEDGMENT

The authors would like to thank the Department

of Computer Science at California State University,

Sacramento, for providing the necessary resources

and guidance throughout the course of this project.

We also extend our gratitude to the faculty members

and mentors who offered valuable insights and sup-

port during the research and experimentation phases.

Special thanks to the creators and managers of the

Pima-Indian Diabetes Dataset for making the dataset

publicly available, which served as a critical foun-

dation for our study. Finally, we would like to ac-

knowledge the contributions of all team members

whose dedication, collaboration, and shared problem-

solving efforts were instrumental in the successful

completion of this project.

REFERENCES

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: Synthetic minority over-

sampling technique. Journal of Artiﬁcial Intelligence

Research, 16:321–357.

Demidova, L. and Klyueva, I. (2017). Svm classiﬁcation:

Optimization with the smote algorithm for the class

imbalance problem. In 2017 6th Mediterranean Con-

ference on Embedded Computing (MECO), pages 1–4.

Kaliappan, J., Kumar, I. J. S., Sundaravelan, S., Anesh,

T., Rithik, R. R., Singh, Y., Vera-Garcia, D. V.,

Himeur, Y., Mansoor, W., Atalla, S., and Srinivasan,

K. (2024). Analyzing classiﬁcation and feature selec-

tion strategies for diabetes prediction across diverse

diabetes datasets. Frontiers in Artiﬁcial Intelligence,

7:1421751.

Lugat, V. (2021). Pima indians diabetes - eda and prediction

(0.906). https://www.kaggle.com/code/vincentlugat/

pima-indians-diabetes-eda-prediction-0-906. Ac-

cessed: 2025-05-18.

Mooney, P. T. (2018). Predict diabetes

from medical records. https://www.

kaggle.com/code/paultimothymooney/

predict-diabetes-from-medical-records. Accessed:

2025-05-18.

Nnamoko, N. and Korkontzelos, I. (2020). Efﬁcient

treatment of outliers and class imbalance for dia-

betes prediction. Artiﬁcial intelligence in medicine,

104:101815.

Olisah, C. C., Smith, L. N., and Smith, M. L. (2022).

Diabetes mellitus prediction and diagnosis from a

data preprocessing and machine learning perspec-

tive. Computer methods and programs in biomedicine,

220:106773.

Poornima, V. and R., R. (2024). A hybrid model for predic-

tion of diabetes using machine learning classiﬁcation

algorithms and random projection. Wirel. Pers. Com-

mun., 139:1437–1449.

Santos, M. S., Soares, J. P., Abreu, P. H., Ara

ujo, H., and

Santos, J. A. M. (2018). Cross-validation for imbal-

anced datasets: Avoiding overoptimistic and overﬁt-

ting approaches [research frontier]. IEEE Computa-

tional Intelligence Magazine, 13:59–76.

Zhang, Z., Ahmed, K. A., Hasan, M. R., Gedeon, T., and

Hossain, M. Z. (2024). A deep learning approach to

diabetes diagnosis. In Asian Conference on Intelli-

gent Information and Database Systems, pages 87–99.

Springer.

Predictive Modelling for Diabetes Mellitus with Respect to Basic Medical History

349