patients in rural Bali, Indonesia, using the extended
health belief model. They found that demographic
factors such as age, education, employment, and
traditional beliefs, as well as clinical factors like
alcohol use, medication, and symptom duration,
alongside diabetes knowledge, explained 71.8% of
the variance in healthy behaviors. Zhou et al. In a
sample of 550,000 adult Chinese, (Li, et al., 2020)
assessed the relationship between hereditary risk and
maintaining a healthy lifestyle. The results showed
that, even among genetically predisposed individuals,
leading a healthy lifestyle significantly lowered the
incidence of diabetes. Thirumurugan in addition to
others. (Wondmkun, Obesity, et al., 2020) analyzed
the importance of feature selection in diabetes
prediction using machine learning, highlighting
factors such as obesity and glucose levels. They
showed that the substantial predictive capacity of
Random Forest classifiers. Shulman et al. (Asril,
Tabuchi, et al., 2020) focused on the relationship
between obesity, insulin resistance, and type 2
diabetes, discussing the physiological mechanisms
that contribute to insulin resistance. Singh et al.
developed an ensemble machine learning system
using classifiers like KNN and Random Forest to
enhance prediction accuracy for diabetes. Khan et al.
used support vector machines (SVM) and feature
extraction techniques like SIFT and SURF to classify
diabetic retinopathy images, achieving a sensitivity of
94%. Donini, Monterio, et al (Donini, Monterio, et
al., 2016) employed multimodal multiple kernel
learning for Alzheimer’s detection, which could be
adapted for diabetes-related research, while Gonen et
al (Gonen, Alpaydin, et al., 2013) proposed localized
algorithms for multiple kernel learning that may
enhance prediction in heterogeneous medical
datasets. Freitas et al. (Mishra, Fasshauer, et al.,
2028) introduced a stabilized RBF-FD method with
hybrid kernels, offering potential improvements for
machine learning algorithms used in diabetes
prediction. Finally, Tiwari et al. performed a
comparative study using deep learning models like
LSTM, showing that it captured temporal
dependencies well and achieved an accuracy of 85%
in diabetes detection using the Pima Indian Diabetes
Database.
3 DATASET AND REVIEW
3.1 Gathering Dataset
This research uses datasets from Kaggle, which is a
well-known website for machine learning and data
science competitions, to predict two diseases,
including diabetes and stroke. The datasets used in
each of the disease prediction models are described in
the sections that follow.
3.2 Data Preprocessing
There are many techniques used in the carrying out of
data preparation for determining whether the data is
appropriate and also of good quality for modeling and
analysis. Handling missing data is a must to the
process and is usually done by using fillna(), which
replaces missing entries with statistical parameters
like mean, median, or mode. In this way, data
integrity is maintained as a result. On the other hand,
dropna() ensures a clean dataset by removing rows or
columns containing missing values. Another very
important step to align formats for data with model
requirement is data type conversion; this is made
easier using astype(). Using LabelEncoder(),
categorical variables such as "sex" or "smoker"
(smoker vs. nonsmoker, for instance - have been
encoded to number values that are amenable to
algorithms. StandardScale() scales features to a
standard way to standardize data.
3.3 Dataset Regarding Stroke
Prediction
The 40,028 records within the dataset utilized to
predict strokes are obtained from Kaggle and contain
11 attributes. Those features, considered critical to
evaluating the risk factors related to stroke, consist of
demographic details such as age, gender, and marital
status, as well as medical history like hypertension
and heart disease. The dataset also includes lifestyle
factors such as smoking habits and work type, which
provide deeper insights into individual risk profiles.
Furthermore, the attributes encompass key
physiological parameters like average glucose levels
and body mass index (BMI), which are strongly
associated with stroke occurrence.
Table 1: Lists the features of the dataset in addition to a
description
Features Descri