the datasets from season 2015-16 to season 2019-20.
This is because the EPL has a cycle of relegation and
promotion, where the teams who finish in the bottom
three of the league table at the end of the campaign,
are relegated to the Championship, the second tier of
English football. In this sense, only the Big 6 teams
can stay in Premier League for a relatively long time
because the strength of these teams can prevent them
from being relegated, providing sustainable and
stable data for training. After obtaining enough
training data, the model then studied the entire
Premier League match results season by season, even
though some team only played for 1 season, making
a lot of difficulties the prediction. As the result, the
“RandomForestClassifier” successfully predicted
84.15% of the matches correctly, which can barely
meet the initial expectation of this research.
3.3 Details of Training
In the experiment, the AdaBoost classifier is designed
to predict the outcomes of EPL matches in the same
way as the baseline method does. The training dataset
is processed and split using a custom preprocessing
function, where 90% of the data is used for training
and 10% for testing. The model is trained using the
`.fit(X_train, y_train)` method on the training data.
Once trained, the AdaBoost classifier is used to
predict the outcomes of specific EPL matches
between teams like Liverpool and Manchester City by
converting match details into feature vectors. The
model's predictions are then compared against the
actual outcomes to evaluate its performance.
As the result, the “AdaBoostClassifier”
successfully predicted 86.65% of the matches
correctly, which satisfies the initial expectation of this
research. The detailed training results are shown in
Figure 3 below. The prediction accuracy of
RandomForestClassifier in EPL is 84.15%, the one
for Big 6 teams is 91.93%. The research method is
slightly better than the baseline method: it has
prediction accuracy of 86.65% in EPL, and 93.12%
when Big 6 teams plays against each other.
Figure 3: Prediction result between “Random Forest
Classifier” and “AdaBoost Classifier”
(Picture credit: Original)
3.4 Model Evaluation
Initially, the expectation of this research is to create a
machine learning model that can properly clean up
the datasets and predict over 85% of the matches
correctly. The chosen model, “AdaBoostClassifier”,
successfully satisfied the basic requirement by ending
up with the overall accuracy of 86.65%. However,
this accuracy is calculated by the “accuracy()”
function in the model. The actual result is slightly
worse than the system expected. In the prediction to
all match results in EPL, season 2020-2021, the
model predicted correctly for 309 out of 380 games
of the entire season, which ends up with an practical
accuracy of 81.32%. This difference may be led by
the factor of referee, and the uncertainty of the sports
of football itself. Since the referee is initially
excluded from the training dataset, the model
assumes that all the referees are fair and only the
teams’ performance and tactics can determine the
outcomes. Because it cannot reach the theoretical
accuracy in practice, there are still many aspects that
can be potentially improved.
4 CONCLUSIONS
The research successfully demonstrated the
effectiveness of the AdaBoost classifier in predicting
English Premier League match outcomes, achieving
an accuracy of 86.65%, which slightly outperforms
the baseline Random Forest model's 84.15%. While
the model met the initial goal of over 85% accuracy,
practical application showed a slightly lower
performance of 81.32% when tested on all matches in
the 2020-2021 EPL season. This discrepancy is
attributed to the exclusion of certain variables, such
as referee decisions and other external factors, during
the data preprocessing stage. Future work can focus
on incorporating these additional variables to further
improve prediction accuracy and enhance the model's
robustness in handling the complexities of football
matches. Additionally, experimenting with different
machine learning algorithms and optimizing the
AdaBoost model could provide further advancements
in sports analytics.
REFERENCES
Anfilets, S., Bezobrazov, S., Golovko, V., Sachenko, A.,
Komar, M., Dolny, R., Kasyanik, V., Bykovyy, P.,
Mikhno, E., & Osolinskyi, O. (2020). DEEP