
tated - especially for the blind. Moreover, our archi-
tectural model herein adopts a modular and scalable
framework, such that it can readily incorporate future
expansion(s) and more advanced-generation language
model(s) or voice-synthesis tool(s).
Our methodology comprises a four-stage pipeline:
(1) object detection using YOLOv9 to identify play-
ers, ball, referees, and goals; (2) tracking players
with ByteTrack; (3) rule-based event detection; and
(4) natural language generation using templates, with
speech output synthesized via gTTS.
The unique aspect of this work lies in the seam-
less integration of real-time visual perception and lan-
guage generation. Unlike prior approaches that iso-
late either analytics or commentary, our proposed sys-
tem unifies these components into a coherent pipeline
which enables intelligent and explainable football
commentary generation from raw video input. More-
over, our work herein not only advances the technical
capabilities in sports AI, but also contributes mean-
ingfully to the accessibility and automation of sports
media.
2 REVIEW OF LITERATURE
In the past decade, blending of AI and multime-
dia processing has had tremendous effects on real-
time sports analyses. Football (soccer) has been the
most talked-about among all other sports because
of its standardized play, global fanbase, and com-
plexity in terms of spatial and temporal interactions.
Advances in Deep Learning (DL), Computer Vision
(CV), and NLP techniques have jointly been har-
nessed toward automating football-match comprehen-
sion, annotation, and commentary. Subsequent sub-
sections present a historical review of selected work
and theoretical backgrounds that have formed the ba-
sis of our proposed YOLOv9-based Football Com-
mentary System.
2.1 Related Work and Datasets
The availability of well-annotated and large-scale
datasets is one of the most crucial enablers for intel-
ligent sports analytics. SoccerNet (Giancola et al.,
2018) is a pioneering benchmark for action detec-
tion in soccer broadcasts, with over 500 fully anno-
tated games and precise timestamps for key events
like goals, cards, and substitutions. This dataset has
been instrumental in pushing the development of tem-
poral event detection models from unstructured video
data.
The Metrica Sports Open Data project (Metrica
Sports, 2025; Du and Wang, 2021) releases syn-
chronized ball-and-player tracking data from profes-
sional games, with accurate 2D positional information
per frame which allows training multi-object tracking
(MOT) algorithms under realistic game conditions.
Previous research by (Lu et al., 2013) focused on
ball tracking in broadcast videos with hand-designed
detection and motion cues, dealing with difficulties
that included motion blur, occlusion, and sudden
scene change due to camera motion—problems that
remain relevant in today’s real-time systems. To-
gether, these datasets and methods form the basic
groundwork for the development of complete systems
that integrate event detection, object tracking, and se-
mantic interpretation of football games.
2.2 Object Detection, Tracking, and
Event Recognition
Object detection serves as an essential foundational
element for advanced semantic analysis within foot-
ball video content. The core of our detection frame-
work is constituted by the YOLO family of mod-
els, specifically YOLOv9. As delineated by (Red-
mon et al., 2016), YOLO redefined the detection pro-
cess as a singular regression task, thereby facilitat-
ing real-time operational capability. YOLOv9 ad-
vances its predecessors by integrating attention mech-
anisms and dynamic label assignment, which sig-
nificantly enhances the detection accuracy of small,
rapidly moving objects—crucial for identifying foot-
balls and players amidst dense formations.
In order to ensure identity consistency between
frames, our system incorporates Deep SORT (Sim-
ple Online and Realtime Tracking with Deep Asso-
ciation Metric) (Wojke et al., 2017). Whereas the
standard SORT uses Kalman filtering and the Hun-
garian algorithm for motion-based prediction and as-
signment, Deep SORT incorporates a CNN-based ap-
pearance descriptor for reliable re-identification in the
presence of occlusion, fast motion, and camera tran-
sition — commonly occurring events in football.
The combination of YOLOv9 and Deep SORT en-
sures temporally stable tracking, assigning consistent
IDs to players, referees, and the ball. This stable out-
put is essential for downstream components such as
event recognition and team assignment.In order to de-
rive useful gameplay context, we integrate event de-
tection following the method of (Sha et al., 2020),
who introduced a CNN model that combines spatial
and temporal features for identifying important events
like passes, goals, fouls, and shots.
This integrated framework, which includes object
detection, multi-object tracking, and spatio-temporal
WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies
336