Classifying Soccer Ball-on-Goal Position Through Kicker Shooting

Action

Javier Tor

on Artiles

, Daniel Hern

andez-Sosa

, Oliverio J. Santana

Javier Lorenzo-Navarro

and David Freire-Obreg

SIANI, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain

Keywords:

Computer Vision, Soccer, Free Kick, Human Action Recognition, Dataset.

Abstract:

This research addresses whether the ball’s direction after a soccer free-kick can be accurately predicted solely

by observing the shooter’s kicking technique. To investigate this, we meticulously curated a dataset of soccer

players executing free kicks and conducted manual temporal segmentation to identify the moment of the kick

precisely. Our approach involves utilizing neural networks to develop a model that integrates Human Action

Recognition (HAR) embeddings with contextual information, predicting the ball-on-goal position (BoGP)

based on two temporal states: the kicker’s run-up and the instant of the kick. The study encompasses a

performance evaluation for eleven distinct HAR backbones, shedding light on their effectiveness in BoGP

estimation during free-kick situations. An extra tabular metadata input is introduced, leading to an interesting

model enhancement without introducing bias. The promising results reveal 69.1% accuracy when considering

two primary BoGP classes: right and left. This underscores the model’s proﬁciency in predicting the ball’s

destination towards the goal with high accuracy, offering promising implications for understanding free-kick

dynamics in soccer.

1 INTRODUCTION

In the 2021/22 season, the top 20 revenue-generating

clubs collectively made a proﬁt of C9.2 billion, mark-

ing a 13% increase from the previous season and

nearly reaching the pre-pandemic levels of 2018/19.

This resurgence was driven by the return of fans

to stadiums, resulting in a signiﬁcant increase in

matchday revenue, which rose from C111 million to

C1.4 billion. The revenue composition of clubs in

2021/22 returned to pre-pandemic levels, with 15%

from matchday activities, 44% from broadcasting,

and 41% from commercial sources (Deloitte, 2023).

Furthermore, the data indicates that the 2022 FIFA

World Cup, held in Qatar, garnered the highest view-

ership in the tournament’s history, with over ﬁve bil-

lion spectators tuning in through diverse platforms,

surpassing more than half of the global population

(FIFA, 2022).

https://orcid.org/0009-0000-5082-310X

https://orcid.org/0000-0003-3022-7698

https://orcid.org/0000-0001-7511-5783

https://orcid.org/0000-0002-2834-2067

https://orcid.org/0000-0003-2378-4277

This remarkable ﬁnancial, as well as the

widespread global viewership of soccer events, under-

score the tremendous potential and impact of soccer

as a mass sport. Furthermore, the evolution of soccer

continues after these outstanding statistics. The intro-

duction of technology into the sport is emerging as a

pivotal factor, shaping both its on-ﬁeld dynamics and

off-ﬁeld engagement. According to Microsoft, dur-

ing a match, players navigate the entire ﬁeld at high

speed, necessitating the deployment of up to 16 ﬁxed

cameras for optical tracking positioned around the

perimeter of each stadium, capturing a staggering 3.5

million data points per game (Microsoft, 2023). This

data is subsequently processed through the Media-

coach platform, making it accessible to clubs and fans

through match broadcasts and digital content. Mi-

crosoft also remarks that the data strategy is designed

to give clubs invaluable insights for adapting training

schedules, scrutinizing opponents, and preparing for

match days.

In this context, the integration of technology into

soccer has brought about a signiﬁcant transformation

in how the sport is played, assessed, and enjoyed.

Several studies and technological innovations have

highlighted the potential of technology to enhance

Artiles, J., Hernández-Sosa, D., Santana, O., Lorenzo-Navarro, J. and Freire-Obregón, D.

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action.

DOI: 10.5220/0012417100003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 79-90

ISBN: 978-989-758-684-2; ISSN: 2184-4313

Analyzed Outputs

Running Stage

Kicking Stage

Footage Sequence

Figure 1: Free-Kick BoGP Classiﬁcation. Our proposal involves a thorough analysis of free-kick actions by integrating

data from various sources, including free-kick metadata and HAR embeddings. Critically, our classiﬁer combines contextual

information with the two-stream action recognition embeddings to make accurate predictions regarding the ball’s placement

concerning the goal. It is important to note that these experiments relied solely on visual observations of the kicker during the

shot without factoring in any ball trajectory data.

various aspects of soccer. Notably, some studies in-

troduced a visual analytic system that combines video

recordings with abstract visualizations of trajectory

data, enabling analysts to delve deep into ball, player,

or team behavior (Stein et al., 2018; Kamble et al.,

2019; He, 2022). Furthermore, some comprehensive

datasets have been introduced to facilitate the local-

ization of crucial events within extended soccer video

footage (Giancola et al., 2018; Deli

ege et al., 2021).

In addition, an automatic method was proposed to lo-

calize sports ﬁelds in broadcast images, eliminating

the need for manual annotation or specialized cam-

eras (Homayounfar et al., 2017). Lastly, some ana-

lytic systems were developed to visually represent the

spatiotemporal evolution of team formations, aiding

analysts in understanding and tracking the dynamic

aspects of soccer strategies (Wu et al., 2019; Li et al.,

2023). These technological advancements have no-

tably transformed sports analysis and enhanced the

fan experience in soccer, revealing new insights and

engagement opportunities. Nevertheless, unexplored

possibilities persist. While previous studies have en-

riched our understanding of the game, untapped areas

exist where technology can drive substantial advance-

ments in soccer. For instance, incorporating predic-

tive analytics in free-kick actions could lead to the

creation of advanced algorithms that account for fac-

tors like goal distance, angle, kicker skills, defensive

wall positioning, and even the goalkeeper’s historical

performance in stopping free kicks.

This work represents a signiﬁcant step in ad-

vancing our understanding of ball-on-goal position

(BoGP) in the context of free kicks directed toward

the opponent’s goal. Utilizing HAR backbones, we

have crafted a BoGP classiﬁer, benchmarking our

models against a novel and extensive collection of

free-kicks. To accomplish this, we have gathered

and processed free-kick footage from various sources

on the Internet. Building upon this dataset, multiple

models that integrated contextual information and uti-

lized pre-trained HAR encoders (commonly referred

to as backbones) were tested to predict the ﬁnal des-

tination of the kicked ball into the goal. Notably, our

methodology incorporates two crucial stages as inputs

to the model: the running and the kicking stages, both

depicted in Figure 1.

The signiﬁcance of this approach lies in the fact

that it captures the dynamic nature of a free-kick, al-

lowing our classiﬁer to consider the player’s approach

and the moment of impact. This nuanced perspective

is pivotal for a more accurate and comprehensive un-

derstanding of BoGP in free kicks. Furthermore, we

conducted two distinct analyses. The ﬁrst analysis in-

volved categorizing the goal into three classes (left,

center, and right), providing a ﬁne-grained BoGP as-

sessment. The second analysis simpliﬁed the catego-

rization into two classes (left and right), allowing for

a broader perspective on BoGP accuracy. This dual

approach enabled a deeper exploration of free-kick

complexities; please refer to Figure 1.

Our contributions can be summarized as follows:

• We introduce a novel soccer free-kick dataset

comprising 603 short clips from actual matches.

This dataset has been curated from online sources

and is readily accessible to the public.

• Through a series of experiments, we empirically

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

showcase the feasibility of addressing the BoGP

challenge by employing a classiﬁer that combines

contextual data with a two-stream approach. Each

stream offers a distinct embedding path, encom-

passing the running stage and the kicking stage of

the free-kick process.

• Within the scope of this study, we conduct a com-

parative analysis of eleven different HAR back-

bone architectures, assessing their respective per-

formance in BoGP classiﬁcation.

• An in-depth error analysis study was undertaken

to evaluate how the various classes inﬂuence the

performance of the top-performing model.

The subsequent sections of this paper are struc-

tured as follows. Section 2 discusses previous related

work. Section 3 outlines the proposed pipeline. Sec-

tion 4 details the experimental setup and presents the

results. Section 5 offers an analysis of errors. Lastly,

Section 6 draws our conclusions.

2 RELATED WORK

Sports analysis has consistently captured the commu-

nity’s attention, leading to a substantial surge in pub-

lished research over the past decade. In this sporting

domain, technology has become an integral and trans-

formative force, signiﬁcantly shaping our understand-

ing of sports, as well as how athletes train and com-

pete. This section offers a comprehensive examina-

tion of two speciﬁc elements addressed in this study:

datasets in sports and their computing application.

The available sports video datasets can be cate-

gorized into two main groups: still-image and video-

sequence datasets. The ﬁrst group encompasses

datasets primarily designed for image classiﬁcation.

For instance, the UIUC Sports Event Dataset com-

prises 1,579 images spanning eight sports event cat-

egories (Li and Li Fei-Fei, 2007). Each category may

contain subsets of images ranging from 180 to 205,

categorized as easy or medium based on human sub-

ject judgments. Another noteworthy collection is the

Leeds Sports Pose Dataset (Johnson and Everingham,

2010), featuring 2.000 pose-annotated images of ath-

letes gathered from the Internet. Each image includes

annotations for 14 joint locations. More recently,

ultra-distance runners competitions have also been

captured in wild conditions (Penate-Sanchez et al.,

2020).

In contrast, the video-sequence datasets offer time

series information about the actions occurring within

the scene. These sequences are typically captured us-

ing stationary cameras. Sequences from individual

sports provide a suitable context for activity recog-

nition, while sequences from team sports can be used

for player tracking and event detection. In this con-

text, many sports datasets have been assembled from

international competitions to advance research in au-

tomatic quality assessment for sports. Some of the

most recent datasets include the MTL-AQA diving

dataset (Parmar and Morris, 2019b), the UNLV AQA-

7 dataset, which includes diving, gymnastic vaulting,

skiing, snowboarding, and trampoline (Parmar and

Morris, 2019a), and the Fis-V skating dataset (Xu

et al., 2020). These datasets have been collected in

controlled, non-obstructed environments, with excep-

tions like the UNLV AQA-7 snowboarding and skiing

subsets, gathered in quiet conditions with a dark sky

(night) and snowy ground.

The semantic structure of sports video content can

be categorized into four layers: raw video, object,

event, and semantic layers (Shih, 2018). The foun-

dation of this pyramid consists of raw video input,

from which objects are identiﬁed in the higher layers.

Speciﬁcally, critical objects featured in video clips are

recognized through object extraction, such as players

(Guo et al., 2020) and object tracking, including the

ball (Wang et al., 2019) and players (Lee et al., 2020).

The event layer signiﬁes the actions of critical objects.

Various actions, combined with scene information,

generate event labels that depict the related actions

and interactions among multiple objects. Research in

areas like action recognition (Freire-Obreg

on et al.,

2022), re-identiﬁcation (Akan and Varli, 2023; Freire-

Obreg

on et al., 2023), facial expression recognition

(Brick et al., 2018; Santana et al., 2023), trajectory

prediction (Teranishi et al., 2020), and highlight de-

tection (Gao et al., 2020) falls within the scope of

this layer. The topmost layer, the semantic layer, is

responsible for summarizing the semantic content of

the footage (Cioppa et al., 2018). As our objective is

BoGP, we seek to classify the outcome of a free-kick

action. Furthermore, the mentioned collections pre-

dominantly feature professional athletes. In this con-

text, our work does not address the team dimension,

as it speciﬁcally focuses on a particular action. Never-

theless, several pivotal individuals are visible during

this action, including the kicker, the referee, the other

players, especially those forming the defensive wall,

and the goalkeeper.

3 DESCRIPTION OF THE

PROPOSAL

This paper introduces and assesses a sequential

pipeline consisting of two core modules, where video

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action

Original Frame Frame Detections Context-Constrained Frame

Figure 2: Context Removal. For every frame at time t, the process entails isolating the kicker’s bounding box, which is then

superimposed onto a stable background derived from the mean of τ frames.

Stages split

Footage Context Constraint

Running

Kicking

Figure 3: Video Pre-Processing Module. The initial

video material undergoes a pre-processing phase wherein

the kicker is separated from a dynamic background. Follow-

ing this, two sets of frames are manually chosen to delineate

the running and kicking stages. The remaining frames are

excluded.

pre-processing is performed manually before enter-

ing the pipeline. The core modules include a video

pre-processing module, a stage-embeddings extrac-

tion module, and a classiﬁer. Figures 4, and 5 depict

visual representations of these modules, while Figure

3 illustrates the executed video pre-processing. The

following subsections comprehensively describe the

video pre-processing step and each module.

3.1 Context Constraint

In order to optimize the quality of the embeddings

generated by the backbone, it is imperative to ensure

that the input footage provided to the action recogni-

tion networks is devoid of extraneous elements, as in-

dicated in a prior study (Freire-Obreg

on et al., 2022).

Within the context of the dataset utilized for the ex-

periments detailed in this research, as described in

Section 4.1, these extraneous elements encompass un-

related players, staff, supporters, and referees. Given

their lack of relevance within the purpose of this work,

an initial pre-processing phase is conducted to reﬁne

the raw input data by isolating the primary subject,

i.e., the kicker. This task is accomplished by lever-

aging ByteTrack (Zhang et al., 2021), a multi-object

tracking network that can precisely track the kicker

within each video footage, see Figure 2. Following

this, a context-constrained pre-processing technique

is applied to establish an ideal setting for conducting

the experiments.

In the context of acquiring context-constrained

video frames for a speciﬁc kicker (k) at a given time

(t) within a speciﬁed time interval ([0, T]), the bound-

ing box (BB

(t)) plays a crucial role. This bound-

ing box outlines the area occupied by kicker k within

the frame recorded at time t. To facilitate this pro-

cess, two primary factors are considered: the bound-

ing box area of the kicker (BB

(t)) and the average

number of frames required (τ) to establish a static

background against which the isolated kicker (k) ap-

pears in the pre-processed video frame. The resulting

pre-processed frame (F

′

(t)) is generated through the

following equation:

′

= BB

(t) ∪ τ

Here, the ∪ operation involves aligning and superim-

posing the bounding box of kicker k onto the aver-

age of the selected τ frames. This sequence of pre-

processed frames constitutes the new video footage,

with the kicker as the sole moving element.

Lastly, as depicted in Figure 3, the resultant

footage is temporally segmented. This manual seg-

mentation identiﬁes two distinct moments aligned

with the kicker’s actions: the running stage and the

kicking stage. Any elements in the video, such as

the free-kick outcome or the kicker’s reaction, have

been excluded from the analyzed stages. This study

focuses exclusively on the running stage (the phase in

which the kicker approaches the ball) and the kick-

ing stage (comprising the 16 frames before and the 16

frames after the ball is kicked).

3.2 Stage-Embeddings Extraction

The preprocessed input footage for each stage, con-

sisting of m frames, undergoes a twofold procedure.

Initially, the footage is downsampled, which results in

its division into n video clips, represented as v

, ..., v

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

...

Features ExtractionStage Input

Backbone

Action

Embeddings

Backbone

Pooling

...

Figure 4: Embeddings Extraction Module. Each stage footage undergoes downsampling, dividing it into n smaller clips.

A pre-trained human-action model is then applied to extract features from these clips. These features are combined using a

pooling technique, resulting in a ﬁnal tensor that serves as input to the classiﬁer. This work examines two pooling methods,

average pooling and max pooling.

where each clip comprises a sequence of q consecu-

tive frames that encapsulate a snapshot of the activity,

see Figure 4. In practical terms, the n clips exhibit

partial overlap, spaced one frame apart from the pre-

ceding one. These video clips traverse a pre-trained

HAR encoder (backbone), producing r-dimensional

feature vectors. It is worth noting that these encoder

models have undergone prior training on the Kinet-

ics 400 dataset, which encompasses a broad spectrum

of 400 action categories (Kay et al., 2017). Follow-

ing the acquisition of feature vectors for all n video

clips, a pooling layer ensures the contribution from

each clip. In this regard, we have evaluated both av-

erage and max pooling layers, as seen in Section 4.

We have chosen eleven backbones to test our ap-

proach to tackle the BoGP problem. Some are more

complex backbones (Slowfast or I3D) than others

(the X3D instances and C2D). This section offers

an overview of the HAR models considered for this

study. The C2D (Convolutional 2D) model, designed

for video action classiﬁcation (Simonyan and Zisser-

man, 2014), exploits the power of 2D Convolutional

Neural Networks (CNN) for spatial feature extraction

from video frames. Its architecture comprises con-

volutional layers, pooling layers, and fully connected

layers. Convolutional layers extract spatial features

while pooling layers reduce dimensionality to prevent

overﬁtting. The C2D model processes each frame in-

dependently, employing CNNs to extract spatial fea-

tures, which are combined to capture temporal action

dynamics.

In contrast to the C2D model, the SlowFast model

is conceived based on the principle that different

video segments possess diverse temporal resolutions

and contain crucial information for action recognition

(Feichtenhofer et al., 2018). For example, some ac-

tions occur swiftly and necessitate high temporal res-

olution for detection, while others unfold more slowly

and can be recognized with a lower temporal resolu-

tion. To address this variability, the SlowFast model

adopts a dual-pathway approach, comprising fast and

slow pathways that operate on video data at varying

temporal resolutions.

Similarly, Slow adopts a two-stream architecture

to capture both short-term and long-term temporal

dynamics in videos (Feichtenhofer et al., 2021). Its

slow pathway processes high-resolution frames but at

a lower frame rate, similar to the C2D model. Addi-

tionally, Slow incorporates a temporal-downsampling

layer to capture longer-term temporal dynamics. The

Inﬂated 3D ConvNet (I3D) model is designed to han-

dle short video clips as 3D spatiotemporal volumes,

enabling the capture of both appearance and mo-

tion cues using a two-stream approach (Carreira and

Zisserman, 2017). In this design, the ﬁrst stream

deals with RGB images, utilizing weights that are

pre-trained on extensive image classiﬁcation datasets.

Simultaneously, the second stream processes optical

ﬂow images and undergoes ﬁne-tuning in conjunction

with the RGB stream.

A revised variant of the I3D model, I3D NLN, in-

corporates non-local operations to enhance spatiotem-

poral dependency modeling in videos (Wang et al.,

2017). I3D NLN retains the two-stream architecture

involving RGB and optical ﬂow streams, processing

3D spatiotemporal volumes. In contrast to the Incep-

tion module, I3D NLN employs non-local blocks ca-

pable of learning long-range dependencies across fea-

ture map positions. By computing weighted sums of

input features from all positions based on the sim-

ilarity between these positions in the feature maps,

I3D NLN captures global context information and im-

proves the modeling of temporal dynamics.

Finally, we have leveraged four X3D model varia-

tions, distinguished by their sizes: extra small (X3D-

XS), small (X3D-S), medium (X3D-M), and large

(X3D-L). Each expansion incrementally transforms

X2D from a compact spatial network to a spatiotem-

poral X3D network (Feichtenhofer, 2020) by modi-

fying temporal (frame rate and sampling rate), spa-

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action

Running

Embeddings

Dense/16u

Sigmoid

Softmax

BatchNorm

Concat

Kicking

Embeddings

Free-Kick

Metadata

Dense/128u

Dense/128u BatchNorm

Dense/128u

Dense/64u

BatchNorm

Figure 5: The proposed classiﬁer. Features from the HAR backbones for running and kicking stages are processed alongside

free-kick metadata, combining information from various sources to contribute to the model’s decision-making process. The

features extracted from the HAR backbone offer a ﬁne-grained understanding of the kicker’s movements. At the same time,

free-kick metadata provides valuable context, inﬂuencing the classiﬁcation outcome, particularly in diverse free-kick scenar-

ios.

tial (footage resolution), width (network depth), and

depth dimensions (number of layers and units). X3D-

XS results from ﬁve expansion steps, followed by

X3D-S, which includes one backward contraction

step after the seventh expansion. X3D-M and X3D-L

are generated by the eighth and tenth expansions, re-

spectively. X3D-M augments the spatial resolution by

elevating the spatial sampling resolution of the input

video. At the same time, X3D-L expands the spatial

resolution and network depth by increasing the num-

ber of layers in each residual stage.

3.3 Classiﬁer

The proposed classiﬁer involves feature extraction

from the identical HAR backbone for both the run-

ning and kicking stages, as well as the inclusion of

free-kick metadata, see Figure 5.

This three-input approach combines information

from various sources, each contributing unique and

complementary insights to the model’s decision-

making process. The features extracted from the

HAR backbone offer a ﬁne-grained understanding of

the kicker’s movements and actions during the free

kick. Simultaneously, free-kick metadata provides

valuable context and situational information that can

signiﬁcantly inﬂuence the classiﬁcation outcome, es-

pecially when dealing with various free-kick scenar-

ios. In this regard, the free-kick metadata encom-

passes four distinct input variables, each contributing

speciﬁc information to the model’s decision-making

process. These variables include pitch side, free-kick

side, free-kick distance, and kicker foot. The pitch

side variable operates as a binary indicator, distin-

guishing between left and right. In contrast, the free-

kick side variable offers a more detailed classiﬁcation,

representing three distinct values related to the shoot-

ing point: left to the goal, center to the goal, and right

to the goal. Similarly, free-kick distance, another bi-

nary variable, provides insight into whether the free

kick occurs near or far from the penalty box. Lastly,

the kicker foot variable, also binary, characterizes the

preferred kicking foot as either left or right.

As a result, the model receives three distinct in-

puts, each of which undergoes processing via dedi-

cated fully connected layers with varying units (16

and 128) based on the nature of the input. The run-

ning and kicking paths also include batch normaliza-

tion layers. Subsequently, all paths are concatenated,

followed by two fully connected layers (128 and 64

units, respectively), separated by a batch normaliza-

tion layer. Finally, the model’s output, denoting the

ball’s position on the goal, is determined by either a

Sigmoid or a Softmax layer, depending on whether

the output comprises two or three classes.

In the conventional classiﬁcation framework, the

primary objective is to assign a sample to its appro-

priate class. In this context, we have conducted two

experiments on the ball’s positioning within the goal.

The ﬁrst experiment considers three distinct classes

(left, right, and center), while the second experiment

operates as a binary classiﬁer, explicitly distinguish-

ing between left and right placements. Consequently,

we employ the categorical cross-entropy loss function

for the ﬁrst experiment:

Loss

= −

∑

i=1

log(p

) (1)

Where C is the number of classes, y

is the true

probability distribution (one-hot encoded vector) of

the ground truth class, and p

is the predicted proba-

bility for class i. For the second experiment, the con-

sidered loss function to tackle the problem is the bi-

nary cross-entropy:

Loss

−1

∑

i=1

−(y

log(p

) + (1 − y

)log(1 − p

))

(2)

Where p

is the i-th scalar value in the model out-

put, y

is the corresponding target value, and N is the

number of scalar values in the model output.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Time

Figure 6: Free-kick Dataset Sequences. The video dataset used in this study was gathered from the Internet without any

imposed usage restrictions. Due to this unrestricted collection approach, the dataset exhibits notable pose, scale, and lighting

conditions variability. Video clips were carefully edited to retain frames from just before the running stage until the moment

of the free-kick outcome.

4 EXPERIMENTAL SETUP

This section is divided into three subsections re-

lated to the dataset acquisition, experimental setup,

and achieved results of the designed experiments.

The ﬁrst subsection provides technical details regard-

ing the dataset, including its acquisition and data-

cleaning processes. The second subsection outlines

the technical aspects of our proposal, such as the data

split. Finally, the third subsection summarizes the

achieved results.

4.1 Dataset

To our knowledge, there is no publicly available soc-

cer free-kick dataset. Our data collection approach

hinges on generality, intending to construct robust de-

tection models for practical use. This compilation of

videos was sourced from the Internet without any us-

age restrictions, resulting in considerable variations in

pose, scale, and lighting conditions, see Figure 6. The

data collection process encompasses three steps:

1. Web Scraping: an extensive search was con-

ducted to gather relevant images using keywords

like ”free-kick soccer,” ”free-kick compilation”,

and the names of various soccer players well

known for frequently shooting free kicks.

2. Shot Labeling: labeling involves carefully edit-

ing each video clip. These clips are trimmed to

cover the period from just before the kicker ini-

tiates the run to the occurrence of the shot out-

come. This stage results in a subset of 603 free-

kick clips.

3. Manual Annotation: each free-kick clip is man-

ually reviewed and annotated. Annotations en-

compass various variables, including pitch side,

free-kick side, free-kick distance, kicker foot (left

or right), kick outcome, barrier conﬁguration,

gender, goalkeeper zone, and the speciﬁc frame in

which the ball is kicked. The resolution of these

clips is 1920 × 1080 pixels.

Despite the initial inclusion of 603 free-kick clips

in the dataset, several factors reduced this number. A

critical consideration was the camera viewpoint, as

it played a substantial role in the selection process.

To maintain shooting action stability, clips where the

camera perspective was positioned behind the goal-

keeper or the kicker were excluded. As described in

Section 3.1, the remaining 584 videos underwent peo-

ple detection using ByteTrack. Unfortunately, some

videos exhibited low image quality, resulting in sub-

par detection performance. As a consequence, the

dataset was further reduced to 539 clips.

Subsequently, the duration of the videos became

a focal point, as clips that were excessively short

in length were found to be inadequate for extract-

ing meaningful information. For instance, videos

commencing precisely as the player initiated the kick

(without a preceding running stage) were omitted

from consideration due to the need for a minimum

frame count to extract pertinent information. All clips

containing fewer than 32 frames were accordingly ex-

cluded, ultimately reducing the dataset to 451 clips.

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action

Table 1: Comparative Performance Analysis of HAR Architectures for BoGP Estimation when Considering Three

Classes. This table compares different backbone architectures used to detect BoGP during free-kick shots. The ﬁrst column

lists the backbone models, while the second column speciﬁes the number of frames the model utilizes for generating HAR

embeddings. The table includes crucial performance metrics such as the number of frames per embedding backbone, the

applied pooling method, and the values of the performance metrics: accuracy, precision, recall, and F1-Score.

Backbone #Frames Pooling Accuracy Precision Recall F1-Score

C2D (Simonyan and Zisserman, 2014) 8 Average 52.9% 49.4% 43.1% 46.1%

I3D (Carreira and Zisserman, 2017) 8 Average 51.4% 42.7% 39.6% 41.1%

I3D NLN (Wang et al., 2017) 8 Average 51.9% 44.6% 41.2% 42.8%

Slow4x16 (Feichtenhofer et al., 2021) 4 Average 55.0% 49.4% 44.6% 46.9%

Slow8x8 (Feichtenhofer et al., 2021) 8 Average 55.3% 46.1% 41.5% 43.7%

SlowFast4x16 (Feichtenhofer et al., 2018) 32 Max 55.0% 47.1% 43.9% 45.4%

SlowFast8x8 (Feichtenhofer et al., 2018) 32 Average 53.4% 47.4% 45.2% 46.2%

X3D-XS (Feichtenhofer, 2020) 4 Max 51.2% 46.3% 43.9% 45.1%

X3D-S (Feichtenhofer, 2020) 4 Max 53.4% 44.9% 43.5% 44.2%

X3D-M (Feichtenhofer, 2020) 13 Max 53.6% 47.9% 43.0% 45.3%

X3D-L (Feichtenhofer, 2020) 16 Average 57.2% 50.0% 48.5% 49.3%

The problem’s intrinsic nature also emerged as a

signiﬁcant determining factor during clip selection.

Speciﬁcally, any clips in which the kick did not suc-

cessfully reach the goal, such as instances where the

ball failed to surpass the defensive barrier, were omit-

ted. In such cases, it was infeasible to ascertain the

target location within the goal, rendering these clips

inapplicable. Therefore, a reﬁned subset of 418 clips

was designated for inclusion in this study.

4.2 Experimental Setup

The results presented in this section refer to the av-

erage accuracy on ﬁve repetitions of 10-fold cross-

validation for each experiment. Signiﬁcantly, the

class distribution within the dataset is characterized

as follows: 187 free-kick shots are directed towards

the left side of the goal, 181 are aimed at the right

side, and 50 target the center area of the goal. The

class distribution exhibits a notable imbalance, par-

ticularly in the case of the center-side shots. We have

implemented a class weighting strategy during the

model training phase to address this issue. The adjust-

ment of class weights in the training process serves

to amplify the model’s sensitivity to minority classes,

effectively mitigating the inherent challenge of dis-

parate class distributions. This approach serves as a

valuable mechanism to rectify any potential bias aris-

ing from the overrepresentation of majority classes,

thereby ensuring equitable model performance across

all classes.

4.3 Results

Table 1 presents a comparative performance analysis

of various HAR backbone architectures utilized to es-

timate the BoGP during free-kick shots, speciﬁcally

when considering three different target classes: left,

center, and right. The table highlights the number

of frames used for each embedding backbone (de-

noted as q in Section 3.2), the pooling method em-

ployed, and key performance metrics including accu-

racy, precision, recall, and F1-Score. The presented

HAR backbone architectures encompass a range of

models described in Section 3.2. Each model is eval-

uated based on the aforementioned metrics, providing

valuable insights into their effectiveness in BoGP es-

timation during free-kick situations.

A noteworthy observation pertains to the choice of

pooling layers for the HAR embeddings (see Figure

4). The data presented in Table 1 reveals an intrigu-

ing trend: lighter models, exempliﬁed by the X3D in-

stances, tend to favor the utilization of the MaxPool

layer, while heavier models typically demonstrate a

preference for the AveragePool layer. This distinction

in pooling layer selection reﬂects these models’ di-

verse architectural considerations and requirements,

underscoring the need to suit the pooling method to

the speciﬁc characteristics and demands of a given

HAR model.

The table prominently illustrates the distinct per-

formance levels exhibited by various models. X3D-L,

in particular, stands out as the top performer, boast-

ing the highest accuracy (57.2%), precision (50.0%),

recall (48.5%), and F1-Score (49.3%). Following

closely in classiﬁcation performance are the SlowFast

and Slow instances, although they lag by a margin of

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Table 2: Comparative Performance Analysis of HAR Architectures for Soccer Player Free-Kick Shoot Zone Estimation

when Considering Two Classes. This table compares different backbone architectures used to detect soccer player shoot

zones during free-kick shots. The ﬁrst column lists the backbone models, while the second column speciﬁes the number of

frames the model utilizes for generating HAR embeddings. The table includes crucial performance metrics such as the number

of frames utilized, the pooling method applied, accuracy, precision, recall, and F1-Score. These metrics offer valuable insights

into the effectiveness of each backbone architecture for this speciﬁc task.

Backbone #Frames Pooling Accuracy Precision Recall F1-Score

C2D (Simonyan and Zisserman, 2014) 8 Max 67.4% 56.5% 60.2% 58.3%

I3D (Carreira and Zisserman, 2017) 8 Average 63.1% 51.3% 56.4% 53.7%

I3D NLN (Wang et al., 2017) 8 Max 62.8% 52.6% 51.7% 52.2%

Slow4x16 (Feichtenhofer et al., 2021) 4 Average 66.9% 60.0% 68.3% 63.9%

Slow8x8 (Feichtenhofer et al., 2021) 8 Max 65.8% 57.6% 67.7% 62.2%

SlowFast4x16 (Feichtenhofer et al., 2018) 32 Average 69.1% 57.7% 76.1% 65.7%

SlowFast8x8 (Feichtenhofer et al., 2018) 32 Max 63.6% 56.6% 69.9% 62.5%

X3D-XS (Feichtenhofer, 2020) 4 Max 61.9% 47.2% 48.3% 47.7%

X3D-S (Feichtenhofer, 2020) 4 Average 64.4% 50.9% 74.3% 60.4%

X3D-M (Feichtenhofer, 2020) 13 Max 66.0% 58.4% 51.4% 54.7%

X3D-L (Feichtenhofer, 2020) 16 Max 65.8% 59.7% 56.6% 58.1%

Figure 7: Three-class SlowFast4x16 confusion matrix.

Figure 8: Two-class SlowFast4x16 confusion matrix.

2.2% in accuracy. It is worth noting that the overall

performance in the context of three-class classiﬁca-

tion remains relatively modest, as evidenced by the

F1-Score, though exceeding that of a random classi-

ﬁer. Section 5 provides a comprehensive error analy-

sis.

To complete our evaluation, Table 2 presents a

comparative performance analysis of various HAR

backbone architectures used in soccer player free-

kick shoot zone estimation when considering two

classes: left and right. Once again, the models are

evaluated in this scenario based on details about the

number of frames used, the pooling method applied,

and the four key performance metrics: accuracy, pre-

cision, recall, and F1-Score. Comparing this table

with the previously discussed Table 1, we observe an

interesting transition regarding the number of classes

considered. The simpliﬁcation of the classiﬁcation

task has a notable impact on model performance.

Despite the reduced complexity of the classiﬁcation

problem, there are variations in the performance of

the backbone architectures, indicating that the choice

of backbone remains critical. Performance-wise, sev-

eral observations can be made. For instance, Slow-

Fast4x16 exhibits the highest accuracy (69.1%) in this

two-class classiﬁcation scenario, outperforming other

models. Additionally, Slow4x16 achieves a remark-

able 60.0% precision, indicating its ability to accu-

rately classify instances. The F1-Score, which com-

bines precision and recall, is also noteworthy, with

SlowFast4x16 leading the way with a score of 65.7%.

These metrics provide valuable insights into the effec-

tiveness of the backbone architectures for the speciﬁc

task of soccer player free-kick shoot zone estimation.

In contrast to the outcomes in the three-class scenario,

the utilization of MaxPool and AveragePool layers is

evenly distributed in this table.

The architecture of the classiﬁer described in Sec-

tion 3.3 poses an intriguing question: how does the

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action

Label: Left - Prediction: CenterLabel: Rigth - Prediction: Center

Figure 9: SlowFast4x16 Misclassiﬁed Clips. These frames represent the ultimate phase of two distinct samples. It is

important to note that the proposed model exclusively examines the actions of the kicker, meaning it does not consider any

frames beyond the 16 post-kicking frames, and the background remains static. Consequently, the frames presented in this

ﬁgure were never seen by the models; they are included solely to exemplify the intricacies associated with the center class.

Notably, the classiﬁer erroneously categorizes these clips as center when labeled as right and left, respectively.

incorporation of free-kick metadata impact perfor-

mance? Upon calculating the mean accuracy across

all scrutinized models, the obtained outcome indi-

cates that without consideration for free-kick meta-

data, the accuracy diminishes by 3%, and the F1-

Score experiences a 4% decline. This signiﬁes that

metadata enhances contextual information regarding

free-kick embeddings, yet it does not introduce bias

to the proposed model.

In summary, as shown in this table, the transi-

tion from a three-class to a two-class problem empha-

sizes the consequences of simplifying the classiﬁca-

tion task. It underscores the performance differences

among various HAR backbone architectures and their

potential suitability for speciﬁc sports action recogni-

tion tasks. However, these ﬁndings have raised sev-

eral questions, including the inﬂuence of the center

class on classiﬁcation, the distribution of error pre-

dictions, and the examination of confusion matrices

for the top-performing models. These questions will

be addressed in the following section.

5 ERROR ANALYSIS

In this section, our primary focus is on the top-

performing model, which employs the SlowFast4x16

backbone. It is crucial to comprehensively analyze its

performance under scenarios involving two and three

classes. As a case in point, Figures 7 and 8 visually

represent the confusion matrices for both experimen-

tal settings.

Our analysis presents the confusion matrix for our

classiﬁcation model, designed to categorize free-kick

soccer actions into one of three classes: left, center,

or right. As illustrated in Figure 7, this matrix pro-

vides valuable insight into the model’s performance

and ability to classify BoGP correctly. The diagonal

elements of the matrix represent instances where the

model’s predictions align with the actual classes. For

instance, the model achieved an accuracy of approxi-

mately 60.4% in identifying left shots, 12.0% for cen-

ter, and 61.3% for right. These values indicate the

model’s proﬁciency in correctly classifying shots into

their respective categories. However, the off-diagonal

elements reveal cases of misclassiﬁcation. Notably,

there is some confusion between the center and the

other two classes. The model often misclassiﬁes cen-

ter shots as left (46.0%) or right (42.0%), suggesting

improvement in distinguishing center shots from the

others. Additionally, left shots are occasionally mis-

classiﬁed as right (35.8%), and right shots are occa-

sionally mislabeled as left (37.6%) or center (1.1%).

Our analysis suggests that the classiﬁer faces chal-

lenges in accurately distinguishing the center cate-

gory, as illustrated in Figure 9. The intricacies of this

classiﬁcation become apparent, even for human anno-

tators, as the camera perspective can sometimes ob-

scure the goal’s position. This issue is compounded

by the limited number of center samples, coupled

with the wide range of camera angles in the dataset.

Consequently, achieving a ﬁne-grained classiﬁcation

for center may not be practically feasible given these

constraints.

The confusion matrix shown in Figure 8 suggests

a notable accuracy in classifying instances, particu-

larly on the diagonal elements. The top-left quadrant

indicates a correct classiﬁcation rate of 64.2% for the

left category, while the bottom-right quadrant signi-

ﬁes an 86.5% accuracy in classifying the right cate-

gory. However, there is some misclassiﬁcation, as ev-

idenced by the off-diagonal elements, with 35.8% of

left instances being erroneously classiﬁed as right and

13.5% of right instances being misclassiﬁed as left.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

6 CONCLUSIONS

In conclusion, this study extensively examined the

performance of various HAR backbone architectures

in estimating the BoGP during free-kick shots. The

investigation covered three-class (left, center, and

right) and two-class (left and right) classiﬁcation sce-

narios, providing valuable insights into the effective-

ness of different models.

X3D-L emerged as the top performer for the three-

class classiﬁcation with notable accuracy, precision,

recall, and F1-Score. However, the overall perfor-

mance in this context remained modest, prompting a

comprehensive error analysis in Section 5. In con-

trast, the two-class scenario revealed a transition in

the number of classes and demonstrated that despite

the reduced complexity, the choice of backbone archi-

tecture remains critical. SlowFast4x16 exhibited the

highest accuracy and noteworthy precision and F1-

Score, highlighting its effectiveness in soccer player

free-kick shoot zone estimation. The inclusion of

Free-kick metadata in the analysis showcased its im-

pact on performance, revealing a 3% accuracy drop

and a 4% decline in F1-Score when not considered.

Importantly, this decline signiﬁes the role of meta-

data in enhancing contextual information without in-

troducing bias to the model.

The focus on the top-performing model, Slow-

Fast4x16, included a detailed examination of confu-

sion matrices for the three-class and the two-class sce-

narios. While the model demonstrated proﬁciency in

classifying instances, challenges were identiﬁed, par-

ticularly in distinguishing the center category. The

limited number of samples and diverse camera angles

posed practical challenges in achieving ﬁne-grained

classiﬁcation for center.

These ﬁndings highlight the complexity of sports

action recognition tasks, emphasizing the need for

careful consideration of the model architecture and

task simpliﬁcation’s inﬂuence. Further questions

were raised, including the impact of the center class

on classiﬁcation, the distribution of error predictions,

and the exploration of confusion matrices for top-

performing models, providing avenues for future re-

search and improvement.

ACKNOWLEDGEMENTS

This work is partially funded by the Spanish Ministry

of Science and Innovation under project PID2021-

122402OB-C22 and by the ACIISI-Gobierno de Ca-

narias and European FEDER funds under project

ULPGC Facilities Net and Grant EIS 2021 04.

REFERENCES

Akan, S. and Varli, S. (2023). Reidentifying soccer play-

ers in broadcast videos using body feature alignment

based on pose. In Proceedings of the 2023 4th Inter-

national Conference on Computing, Networks and In-

ternet of Things, page 440–444, New York, NY, USA.

Association for Computing Machinery.

Brick, N. E., McElhinney, M. J., and Metcalfe, R. S. (2018).

The effects of facial expression and relaxation cues

on movement economy, physiological, and perceptual

responses during running. Psychology of Sport and

Exercise, 34:20–28.

Carreira, J. and Zisserman, A. (2017). Quo Vadis, Action

Recognition? A New Model and the Kinetics Dataset.

In 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 4724–4733.

Cioppa, A., Deli

ege, A., and Van Droogenbroeck, M.

(2018). A bottom-up approach based on semantics for

the interpretation of the main camera stream in soccer

games. In 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW),

pages 1846–184609.

Deli

ege, A., Cioppa, A., Giancola, S., Seikavandi, M. J.,

Dueholm, J. V., Nasrollahi, K., Ghanem, B., Moes-

lund, T. B., and Droogenbroeck, M. V. (2021).

Soccernet-v2 : A dataset and benchmarks for holis-

tic understanding of broadcast soccer videos. In The

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR) Workshops.

Deloitte (2023). Deloitte football money league 2023. Ac-

cessed on November 3, 2023.

Feichtenhofer, C. (2020). X3D: Expanding Architectures

for Efﬁcient Video Recognition. 2020 IEEE/CVF

Conf. on Computer Vision and Pattern Recognition

(CVPR), pages 200–210.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2018).

Slowfast networks for video recognition. 2019

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 6201–6210.

Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R. B.,

and He, K. (2021). A large-scale study on unsuper-

vised spatiotemporal representation learning. 2021

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 3298–3308.

FIFA (2022). 2019-2022 revenue. Accessed on November

3, 2023.

Freire-Obreg

on, D., Lorenzo-Navarro, J., Santana, O. J.,

Hern

andez-Sosa, D., and Castrill

on-Santana, M.

(2022). Towards cumulative race time regression

in sports: I3D ConvNet transfer learning in ultra-

distance running events. In International Conference

on Pattern Recognition (ICPR), pages 805–811.

Freire-Obreg

on, D., Lorenzo-Navarro, J., Santana, O. J.,

Hern

andez-Sosa, D., and Castrill

on-Santana, M.

(2023). A Large-Scale Re-identiﬁcation Analysis in

Sporting Scenarios: the Betrayal of Reaching a Crit-

ical Point. In International Joint Conference on Bio-

metrics (IJCB).

Classifying Soccer Ball-on-Goal Position Through Kicker Shooting Action

Gao, X., Liu, X., Yang, T., Deng, G., Peng, H., Zhang, Q.,

Li, H., and Liu, J. (2020). Automatic key moment ex-

traction and highlights generation based on compre-

hensive soccer video understanding. In 2020 IEEE

International Conference on Multimedia Expo Work-

shops (ICMEW), pages 1–6.

Giancola, S., Amine, M., Dghaily, T., and Ghanem, B.

(2018). Soccernet: A scalable dataset for action spot-

ting in soccer videos. In 2018 IEEE/CVF Conference

on Computer Vision and Pattern Recognition Work-

shops (CVPRW), pages 1792–1810.

Guo, T., Tao, K., Hu, Q., and Shen, Y. (2020). Detection

of ice hockey players and teams via a two-phase cas-

caded cnn model. IEEE Access, 8:195062–195073.

He, X. (2022). Application of deep learning in video

target tracking of soccer players. Soft Computing,

26(20):10971–10979.

Homayounfar, N., Fidler, S., and Urtasun, R. (2017). Sports

ﬁeld localization via deep structured models. In 2017

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 4012–4020.

Johnson, S. and Everingham, M. (2010). Clustered pose

and nonlinear appearance models for human pose es-

timation. In Proc. BMVC, pages 12.1–11.

Kamble, P., Keskar, A., and Bhurchandi, K. (2019). A deep

learning ball tracking system in soccer videos. Opto-

Electronics Review, 27(1):58–69.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,

Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,

Natsev, P., Suleyman, M., and Zisserman, A. (2017).

The Kinetics Human Action Video Dataset. CoRR.

Lee, J., Moon, S., Nam, D.-W., Lee, J., Oh, A. R., and Yoo,

W. (2020). A study on sports player tracking based

on video using deep learning. In 2020 International

Conference on Information and Communication Tech-

nology Convergence (ICTC), pages 1161–1163.

Li, L. and Li Fei-Fei (2007). What, where and who? classi-

fying events by scene and object recognition. In 2007

IEEE 11th International Conference on Computer Vi-

sion, pages 1–8.

Li, L., Zhang, T., Kang, Z., and Zhang, W.-H. (2023).

Design and implementation of a soccer ball de-

tection system with multiple cameras. ArXiv,

abs/2302.00123.

Microsoft (2023). Shaping the future of the game. Accessed

on November 3, 2023.

Parmar, P. and Morris, B. (2019a). Action quality assess-

ment across multiple actions. In IEEE Winter Con-

ference on Applications of Computer Vision, WACV

2019, Waikoloa Village, HI, USA, January 7-11, 2019,

pages 1468–1476. IEEE.

Parmar, P. and Morris, B. T. (2019b). What and how well

you performed? A multitask learning approach to ac-

tion quality assessment. In IEEE Conference on Com-

puter Vision and Pattern Recognition, CVPR 2019,

Long Beach, CA, USA, June 16-20, 2019, pages 304–

313. Computer Vision Foundation / IEEE.

Penate-Sanchez, A., Freire-Obreg

on, D., Lorenzo-Meli

an,

A., Lorenzo-Navarro, J., and Castrill

on-Santana, M.

(2020). TGC20ReId: A dataset for sport event re-

identiﬁcation in the wild. Pattern Recognition Letters,

138:355–361.

Santana, O. J., Freire-Obreg

on, D., Hern

andez-Sosa, D.,

Lorenzo-Navarro, J., S

anchez-Nielsen, E., and Cas-

trill

on-Santana, M. (2023). Facial expression analysis

in a wild sporting environment. Multimedia Tools and

Applications, 82(8):11395–11415.

Shih, H.-C. (2018). A survey of content-aware video anal-

ysis for sports. IEEE Transactions on Circuits and

Systems for Video Technology, 28(5):1212–1231.

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

ArXiv, abs/1406.2199.

Stein, M., Janetzko, H., Lamprecht, A., Breitkreutz, T.,

Zimmermann, P., Goldl

ucke, B., Schreck, T., An-

drienko, G., Grossniklaus, M., and Keim, D. A.

(2018). Bring it to the pitch: Combining video and

movement data to enhance team sport analysis. IEEE

Transactions on Visualization and Computer Graph-

ics, 24(1):13–22.

Teranishi, M., Fujii, K., and Takeda, K. (2020). Trajectory

prediction with imitation learning reﬂecting defensive

evaluation in team sports. In 2020 IEEE 9th Global

Conference on Consumer Electronics (GCCE), pages

124–125.

Wang, S., Xu, Y., Zheng, Y., Zhu, M., Yao, H., and Xiao,

Z. (2019). Tracking a golf ball with high-speed stereo

vision system. IEEE Transactions on Instrumentation

and Measurement, 68(8):2742–2754.

Wang, X., Girshick, R. B., Gupta, A. K., and He, K. (2017).

Non-local neural networks. 2018 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 7794–7803.

Wu, Y., Xie, X., Wang, J., Deng, D., Liang, H., Zhang, H.,

Cheng, S., and Chen, W. (2019). Forvizor: Visualiz-

ing spatio-temporal team formations in soccer. IEEE

Transactions on Visualization and Computer Graph-

ics, 25(1):65–75.

Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.-G., and

Xue, X. (2020). Learning to score ﬁgure skating sport

videos. IEEE Transactions on Circuits and Systems

for Video Technology, 30(12):4578–4590.

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu,

W., and Wang, X. (2021). ByteTrack: Multi-Object

Tracking by Associating Every Detection Box. In Eu-

ropean Conference on Computer Vision.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods