generate a set of decision trees with controlled
variance (Fawagreh et al., 2014). Since the method
had been developed, it was used in multiple areas.
These include agriculture, ecology, land cover
classification, remote sensing, wetland classification,
bioinformatics, biological and genetic association
studies, genomics, quantitative structure, and so on
(Tyralis et al., 2019). Especially, in the area of
management, Random Forest is widely used in the
areas of financial management, risk management, and
so on (Zhao, 2022).
However, despite the previous efforts made by
considerable scholars, which usually investigate a
specific factor and its influence on employee
performance usually with qualitative method, there is
still space for deep ploughing into a comprehensive
study of multiple factors and comparing their
influences on performance, especially with machine
learning methods. Variable importance measures are
used by RF to assess the significance of variables.
Meanwhile, Polynomial Regression is designed for
dealing with non-linear correlations. Therefore, it is
believed that RF and Polynomial Regression are
suitable models for identifying the significant factors
and exploring the relationship between multiple
factors and employee performance.
Using RF and Polynomial Regression to analyse
the effects of various factors on worker performance,
this paper endeavours to study: 1) The influential
factors that affect employee performance and their
relationships with it are well understood. 2) Methods
to leverage these factors to improve employee
performance. Firstly, RF is used to determine the top
important features, and then the Polynomial
Regression is employed to find the correlations
between these features and employee performance.
2 METHOD
2.1 Dataset Preparation
The dataset used is provided by Richard and Carla on
the platform “Kaggle” (Richard, 2020). This dataset
was chosen due to it provides various dimensions
regarding the factors which may influence employee
performance, including employees’ engagement,
satisfaction, salary etc. The original dataset contains
35 columns and 312 rows.
Then, 22 columns which are unrelated to the
objects of this study, overlap with other columns in
meaning, or may cause potential bias, such as racism
or sexism, are deleted, Therefore, 13 columns are left,
including 5 numeric features and 8 categorical
features.
After that, the dataset was preprocessed in order
to guarantee data quality and model compliance
before the analysis was performed. A numerical
feature pipeline and a categorical feature pipeline
were built separately. A two-step sequential pipeline
was developed for numerical features. First, to ensure
a reliable handling of missing data, missing values
were imputed using the median method. The data was
then standardized with a standard deviation of 1 and
centred around 0 using the StandardScaler method. In
contrast, a different pipeline was used to process the
category features. To preserve data integrity, missing
category values were first imputed using the most
common method. After that, categorical variables
were transformed into a binary format using the
OneHotEncoder, which made it possible for the
models to interpret them correctly and handle
unknown categories with grace. A
ColumnTransformer was instantiated, combining the
numerical and categorical pipelines, to combine these
preprocessing procedures. The specified numerical
and categorical features in the dataset were to be
subjected to the appropriate transformations by this
transformer.
2.2 Random Forest
For this investigation, the preprocessed data was
initially split into two sets: a testing set and a training
set. The test set consisted of 20% of the data, while
the training set contained the remaining 80%.
Additionally, the class imbalance in the dataset was
addressed by combining the Synthetic Minority Over-
sampling Technique (SMOTE) with Edited Nearest
Neighbors (ENN). By ensuring a fair representation
of the target variable classes in the training data, this
resampling technique improved the model’s capacity
to generalize. Then, GridSearchCV was used to
optimize the Random Forest classifier’s performance
through hyperparameter tuning. The parameter grid
included a number of hyperparameters, consisting of
the minimum samples for splitting, the maximum
depth of trees, the number of estimators, and the
minimum samples per leaf. The GridSearchCV
determined the ideal configuration, which maximized
the predictive power of the model by thoroughly
going over various parameter combinations.
Furthermore, the feature importances were extracted
from the best-performing Random Forest classifier.
The top ten most significant features influencing
employee performance were displayed in a horizontal