a sufficient evidence in comparing different models.
The major influence on the performance is to find the
degrees of freedom setting in each algorithm, the
higher degrees of freedom, the higher variance of the
test reducible error, and the lower bias of the estimate
observations. To verify the performance of different
statistical learning models, using error checks, for
example, the loss function can determine and find a
good parameter setting(Iniesta et al., 2016). In
general, a supervised machine learning algorithm
performance is measured on mean-squared error, or
R-square. However, measuring the performance of
the machine learning algorithm which study and
estimate binary outcomes, the accuracy, recall,
precision and F-score in the machine learning take the
significant jobs (James et al., 2023).
The main purpose of supervised machine learning
algorithms is to implement a description of the
outcome of their interests (James et al., 2023). In the
real world, data is everywhere, humans used to select
the reasonable to do research and unchanging since
times immemorial. There is an incredibly unanimous
about the fifth industrial revolution, based on fourth
industrial revolution, states that humans and machine
will work together to harmoniously contribute the
different aspects in the real world (Jain et al., 2020).
Data, as the resource to supervised machine learning,
is precious and treasured. The data widely use in
medical research, recently AlphaFold, a novel
mechine learning algorithm, based on current protein
data bank, estimate the protein structure in the high
accuracy (Jiang et al., 2020).
However, since the supervised machine learning
is data-driven. The data selection is an important and
essential step; data cleaning, noise labeling, class
imbalance, data transformation, data valuation, and
data homogeneity, all of them play a more or less
crucial role in increasing the upper prediction limit of
mechanical learning (Jumper et al., 2021).Moreover,
bias are produced from humans, similarly, machine
produce bias during the sub-tasks in the algorithm.
Scientists want to investigate the possible sources of
bias and label them to reduce the bias (Mahapatra et
al., 2022). High-quality data refers to parameters that
are sufficiently well-defined and appropriately
interpreted, as well as relevant predictors and results
that are rigorously collected. Prior to analysis, data
preprocessing requires data cleaning, data reduction
and data transformation (Noble et al., 2022).
2 METHODOLOGY
Linear regression establishes a relationship between
independent variables, such as students' study hours,
sleep hours, physical activity hours, and
extracurricular hours, and a dependent variable, GPA.
It aims to find the best-fitting line in a multi-
dimensional space by minimizing the sum of squared
residuals. When dealing with categorical features like
stress level, one-hot encoding is used to convert them
into numerical representations. Linear regression is
widely applied in predictive modeling and trend
analysis due to its simplicity and interpretability.
Polynomial regression extends linear regression
by incorporating polynomial terms of the input
features, allowing for the modeling of nonlinear
relationships. By increasing the degree of the
polynomial, the model gains more flexibility in
capturing complex data patterns. However, higher-
degree polynomials may lead to overfitting, making
regularization techniques or cross-validation essential
to balance model complexity and generalization
ability.
Logistic regression is a classification algorithm
that predicts the probability of a binary outcome by
applying the sigmoid function and mapping linear
combinations of features to probability values
between 0 and 1. The model is commonly used for
problems such as disease prediction, spam detection,
and credit risk assessment. Although logistic
regression assumes a linear decision boundary, it can
be extended to nonlinear problems using polynomial
features or kernel methods.
SVM classify data by finding the optimal
hyperplane that maximizes the margin between
different classes. It uses support vectors—data points
closest to the decision boundary—to define the
optimal separation. SVM is applicable to both linear
and nonlinear classification problems by employing
different kernel functions, such as polynomial, radial
basis function (RBF), and sigmoid kernels. Especially
in high-dimensional spaces, SVM is effective and
accurate and has been used in image recognition, text
classification, and bioinformatics.
Decision tree follows a hierarchical structure
where each node represents a decision based on
specific rules derived from the data. It splits the
dataset iteratively at each level, aiming to maximize
information gain or minimize impurity (measured by
criteria such as Gini index or entropy). Decision trees
can be easily interpret and can handle both numerical
and categorical data, but they are prone to overfitting
when the tree is too deep. Pruning techniques and
ensemble methods can help improve performance.