combined with correlation analysis and descriptive
statistics, to explore the factors influencing GPA. This
research offers senior students a channel, through
which they can adapt their time allocation to improve
their GPA.
2 METHODOLOGY
2.1 Data Source and Description
The data used in this study is sourced from the Kaggle
open data platform, involving a dataset titled Daily
Lifestyle and Academic Performance of Students"
contains data from 2,000 students collected via a
Google Form survey (researched by Sumit Kumar,
ranging from 2023.8.1 to 2024.10.24) (Kaggle, 2024).
It includes information on study hours, extracurricular
activities, sleep, socializing, physical activity, stress
levels, and CGPA. The data covers an academic year
from August 2023 to May 2024 and reflects student
lifestyles primarily from India.
2.2 Indicator Selection and Description
In this study, the selection of indicators is divided into
quantitative and qualitative variables. The
quantitative independent variables include: study
hours per day, extracurricular hours per day, sleep
hours per day, social hours per day, and physical
activity hours per day. All these quantitative variables
undergo normalization to scale the data within a
specific range, ensuring comparability. Moreover, the
qualitative independent variable is Stress Level
(students' stress level), which is processed through
label encoding. The categories are encoded as: Low =
1, Moderate = 2, and High = 3. Meanwhile, the
dependent variable is GPA, which represents students'
academic performance and is used directly without
any processing. There is an index called
“Student_ID” among them, which is not related to this
research, so this paper deletes it.
2.3 Methodology Introduction
This research utilizes multiple linear regression
(MLR) as the primary statistical method to analyze
the factors affecting students' academic performance
(GPA). Multiple linear regression is a widely used
analytical technique that models the relationship
between a single dependent variable and multiple
independent variables. The general mathematical
representation of the regression model is formulated
as follows:
𝑦=𝛽
+𝛽
𝑥
+𝛽
𝑥
+⋯+𝛽
𝑥
+∈ (1)
where y represents the dependent variable, which
in this study corresponds to students' GPA,
𝑥
, 𝑥
...... 𝑥
denote the independent variables,
including study hours, extracurricular activities, sleep
duration, social interactions, physical activity, and
stress level.𝛽
is the intercept term, representing the
expected value of GPA when all independent
variables are zero.𝛽
,
𝛽
,...... ,𝛽
are the regression
coefficients, which quantify the impact of each
independent variable on GPA. These coefficients
indicate the magnitude and direction of influence that
each predictor variable has on academic performance.
ϵ is the error term, which accounts for variations in
GPA that cannot be explained by the included
independent variables. It is assumed to follow a
normal distribution.
To estimate the regression coefficients, this study
employs the least squares estimation method, which
minimizes the sum of squared differences between the
observed values and the predicted values of GPA. The
goodness-of-fit of the model is evaluated using the R-
squared (R2) statistic, which measures the proportion
of variance in GPA that is explained by the
independent variables included in the model. A higher
R2 value suggests a stronger explanatory power of the
model.
Before constructing the regression model, this
study conducts a correlation analysis to assess the
strength and direction of the relationships between
GPA and each independent variable. Pearson’s
correlation coefficient (r) is computed for this
purpose. The coefficient r ranges from -1 to 1 and is
interpreted as follows: Firstly, when r>0, it means a
positive correlation – as the independent variable
increases, GPA tends to increase. Similarly, when
r<0, it illustrates a negative correlation. when r≈0,
there is no significant correlation and there is no
meaningful linear relationship between the variables.
Moreover, descriptive statistics serve as the
foundation for GPA analysis by summarizing the
dataset, identifying patterns, and ensuring data quality
before performing deeper statistical modeling. By
using these methods, researchers can better interpret
how lifestyle factors influence academic performance
and set the stage for more complex analytical
techniques like correlation analysis and multiple
linear regression.
To enhance the accuracy and interpretability of
the regression model, data preprocessing is
implemented before the analysis. This step includes
handling missing data, identifying and addressing
outliers, and applying normalization techniques to
continuous variables where necessary.