Deep Learning-Based Models for Performing Multi-Instance Multi-Label

Event Classiﬁcation in Gameplay Footage

Etienne Julia

, Marcelo Zanchetta do Nascimento

, Matheus Prado Prandini Faria

and Rita Maria Silva Julia

Computer Science Department, Federal University of Uberl

andia, Uberl

andia, Minas Gerais, Brazil

Keywords:

MIML Event Classiﬁcation, Gameplay Footage, Super Mario, Frame-Based and Chunk-Based Data

Representations, Deep Extractor Neural Networks, Fine-Tuned Backbone, Deep Classiﬁer Neural Networks.

Abstract:

In dynamic environments, like videos, one of the key pieces of information to improve the performance of au-

tonomous agents are the events, since, in a broad manner, they represent the dynamic changes and interactions

that happen in the environment. Video games stand out among the most suitable domains for investigating the

effectiveness of machine learning techniques. Among the challenging activities explored in such research, it

highlights that which endows the automatic game systems with the ability of identifying, in game footage, the

events that other players, interacting with them, provoke in the game environment. Thus, the main contribu-

tion of this work is the implementation of deep learning models to perform MIML game event classiﬁcation

in gameplay footage, which are composed of: a data generator script to automatically produce multi-labeled

frames from game footage (where the labels correspond to game events); a pre-processing method to make the

frames generated by the script suitable to be used in the training datasets; a ﬁne-tuned MobileNetV2 to per-

form feature extraction (trained from the pre-processed frames); an algorithm to produce MIML samples from

the pre-processed frames (each sample corresponds to a set of frames named chunk); a deep neural network

(NN) to perform classiﬁcation of game events, which is trained from the chunks. In this investigation, Super

Mario Bros is used as a case study.

1 INTRODUCTION

In machine learning (ML), the ability of autonomous

agents to retrieve relevant information about the envi-

ronment they operate in is an indispensable requisite,

since it can be very useful in improving their decision-

making process (Neto and Julia, 2018). One of the

most relevant information to be retrieved from the en-

vironment in which agents in general operate refers

to the ongoing events. Despite the complexity that the

event deﬁnition can assume in artiﬁcial intelligence

(AI) (Ravanbakhsh et al., 2017), (Yu et al., 2020), in

short, events can be seen as ﬂags that depict the most

important things happening in the environment.

In areas like personal assistance robots, camera

monitoring networks and autonomous cars, instantly

being able to react to events happening in the environ-

https://orcid.org/0000-0003-3750-4264

https://orcid.org/0000-0003-3537-0178

https://orcid.org/0000-0003-1468-9243

https://orcid.org/0000-0001-5181-5451

ment is crucial. For this reason, AI research carried

out in the domain of online event detection systems

can be very valuable (Geest et al., 2016).

Indeed, such a challenge has motivated recent AI

research on developing online event detection systems

for real-world videos, be it sports, trafﬁc, or everyday

activities (Gao et al., 2017), (Geest et al., 2016) and

(Xu et al., 2018).

For agents operating in dynamic environments,

like videos, the events, which describe the dynamic

changes and interactions that happen in the envi-

ronment, represent fundamental information to guide

their decision-making. In such situations, as the sce-

nario the agents operate in can change drastically, the

occurrence of the events can also be very variable.

Depending on the problem the studies carried out

with the objective of retrieving events from videos

cope with, they will perform a different kind of clas-

siﬁcation activity. Such variation will depend on the

following factors: on the way the video scenes will be

presented to the system, and on the number of events

intended to be retrieved from such scenes.

110

Julia, E., Zanchetta do Nascimento, M., Faria, M. and Julia, R.

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage.

DOI: 10.5220/0012358900003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

110-121

ISBN: 978-989-758-679-8; ISSN: 2184-4321

In this way, the systems implemented in such stud-

ies can be divided into the following groups: 1) sin-

gle instance, single label (SISL), where a scene is

presented to the system through a single frame from

which at most one label can be retrieved (knowing

that, in this work, a label corresponds to a game

event); 2) single instance, multi-labels (SIML), where

a scene is presented through a single frame from

which more than one label can be retrieved; 3) multi

instance, single label (MISL), where a group of scenes

is presented to the system through a set of frames from

which at most one label can be retrieved; 4) multi in-

stance, multi-label (MIML), where a group of scenes

is presented through a set of frames from which more

than one label can be retrieved.

Video games have been proving to be a very ap-

propriate case study for investigating the effectiveness

of machine learning (ML) techniques for several rea-

sons, among which stand out: economically, they are

part of the group of products that generate the high-

est proﬁts in the entertainment industry (Global Data,

2021); in education, they have been showing as a

friendly, efﬁcient, and attractive tool; technically, they

present a very high difﬁculty level, which represents

a great challenge to any ML approaches used to deal

with them (Faria et al., 2022).

Research aimed at producing player agents capa-

ble of recovering, from game videos, the events pro-

voked in the game environment by the actions exe-

cuted by the opponents, can be extremely useful. In-

deed, for example, in the construction of player sys-

tems aimed at improving the cognitive abilities of

their users, the results of the analysis of these events

can be used to map the users’ proﬁle, and this pro-

ﬁle can be used to train the player system to adjust

its decision-making process so as to lead the game to

situations that stimulate the development of the users’

cognitive abilities.

As argued in (Faria et al., 2022), despite the fact

that, frequently, the events can be directly retrieved

from the game system ﬂags (named as game logs),

the following obstacles can make such information re-

trieval unfeasible: ﬁrstly, game engines are frequently

unavailable, due to the companies’ privacy concerns

(Luo et al., 2019); secondly, in studies aiming at re-

trieving events from gameplay footage, which are not

real-time games, it is not possible to count on game

information provided by the game engines, even if the

game platform they deal with count on open game en-

gines.

The great applicability of systems capable of re-

covering events in video games, associated with this

obstacle to accessing game logs, are factors that have

been motivating new studies aimed at creating auto-

matic systems with such ability.

Taking into consideration the aforementioned ar-

guments, the authors of this paper, by way of inves-

tigation, propose two new models to perform game

event MIML classiﬁcation (that is, multi-label classi-

ﬁcation of game events) in gameplay footage of Su-

per Mario Bros (Karakovskiy and Togelius, 2012).

The authors’ main motivation for using Super Mario

Bros as a case study are the results presented in (Prena

and Sherry, 2018), showing that such a game is quite

suitable for the development of cognitive skills, es-

pecially for users with some type of weakness in the

learning process, which ﬁts well the authors’ intended

goals in their future works (Section 6). By way of

introducing some examples of event retrieval in Su-

per Mario Bros, the occurrence of the EventJump and

EventLand events indicate that Mario has just took off

and Mario has just landed, respectively.

In order to cope with the intended objectives, each

model proposed in this work uses a chunk-based rep-

resentation for the game scenes (where chunk refers

to a set of frames) and counts on distinct and spe-

ciﬁc deep neural networks (NNs) to perform fea-

ture extraction and multi-label classiﬁcation. More

speciﬁcally, one of the model uses a new ﬁne-tuned

MobileNetV2 (produced here) as a feature extractor,

and the recurrent neural network (RNN), long short-

term memory (LSTM), to perform the MIML clas-

siﬁcation, whereas the other one uses the same ﬁne-

tuned MobileNetV2 as backbone, and the deep MIML

(Feng and Zhou, 2017) to execute the MIML classiﬁ-

cation.

Such architectures of the models were deﬁned

taking into consideration the results obtained in the

studies carried out in (Faria et al., 2022) and (Feng

and Zhou, 2017). More speciﬁcally, in (Faria et al.,

2022) the authors investigated various architectures

to perform SISL event classiﬁcation in footage of Su-

per Mario and proved that: 1) Concerning the game

scenes, chunk-based representation performs better

than frame-based representation; 2) Using two dis-

tinct deep neural networks to perform feature extrac-

tion and game event classiﬁcation provides much bet-

ter results than using just one deep neural network to

execute both tasks; 3) The architecture that performed

better was made up of the standard MobileNetV2

(Sandler et al., 2018), as a feature extractor(or back-

bone), and the LSTM (Hochreiter and Schmidhuber,

1997), as a classiﬁer network.

Concerning the second state-of-the-art work

which inspired the choice of the architectures pro-

posed in this paper, in (Feng and Zhou, 2017) the

authors introduced the deep MIML neural network to

perform MIML classiﬁcation in the domain of images

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

111

and texts (they did not cope with footage).

Then, in short, the main contributions of this work

are:

• Implementation of two new models to perform

MIML of game event in footage of Super Mario

Bros, thus extending the framework proposed in

(Faria et al., 2022), which only coped with single

event classiﬁcation;

• Implementation of an automatic data genera-

tor script, which automatically generates multi-

labeled frames from Super Mario Bros game

footage (where each frame contains zero or more

game events);

• Proposition of a pre-processing method to make

the frames generated by the script suitable to be

used in the training datasets;

• Production of a ﬁne-tuned version of the standard

MobileNetV2 used in the preliminary studies car-

ried out in (Faria et al., 2022), by retraining it

from frames produced by the script that were pre-

processed;

• Proposition of an algorithm to produce MIML

samples from the pre-processed frames (each

sample corresponds to a set of frames named

chunk);

• In order to investigate an alternative classiﬁer

deep NN to deal with MIML problem in game-

play footage of Super Mario Bros, in addition to

testing the LSTM (which performed well in the

single label classiﬁcation scenarios investigated in

(Faria et al., 2022)), the present work also tested

the deep MIML neural network proposed in (Feng

and Zhou, 2017) (in this later, the performance

of the Deep MIML coping with footage was not

investigated). Both classiﬁer NNs were trained

from the MIML samples (chunks) just mentioned.

The experiments proved the signiﬁcant perfor-

mance gain obtained in the models with the use of the

ﬁne-tuned MobileNetV2 instead of the standard Mo-

bileNetV2. In addition, regarding the classiﬁers, they

showed that the LSTM model presented slightly bet-

ter results than DeepMIML in terms of mean average

precision and F1-Score.

2 THEORETICAL FOUNDATION

This section will present a general view of the main

theoretical topics related to this work.

2.1 Multi-Instance Multi-Label

Framework

The MIML framework, proposed in (Zhou et al.,

2012), is deﬁned by each data example being com-

prised of multiple instances, associated with multi-

ple labels. Traditionally in ML, each instance of

data is associated with a single label, or even, each

instance with multiple labels. However, for prob-

lems involving complicated examples with multiple

semantic meanings, using more than one instance to

represent them can allow for additional inherent pat-

terns to the data to become more clear. Then, in these

situations, the MIML framework can be a more natu-

ral and appropriate way to represent the data.

An important idea in multi-instance learning are

sub-concepts. For example, when considering a broad

concept like ”Africa”, it can be hardly described by

just one aspect, which requires a group of cultural

and environmental identiﬁers that make possible such

description. In this case, the broad concept ”Africa”

could be identiﬁed, for example, from images that

combine low-level concepts such as a grassland en-

vironment, lions and trees. It is important to note that

these low-level concepts, called sub-concepts, alone

are not necessarily sufﬁcient to describe the broad

concept ”Africa”, however, when combined, they be-

come capable of doing that.

2.1.1 Multi-Label Evaluation Metrics

In traditional supervised learning, as mentioned in the

previous subsection, a single instance is associated

with a single label. This allows for an accuracy met-

ric (the percentage of examples correctly classiﬁed)

to often be a good enough indicator of performance

(Zhou et al., 2012). However, in multi-label situa-

tions, the goal is not to identify a single label, but

rather to correctly classify the highest amount pos-

sible from a group of labels. Then, for example, it

is better for a model to identify 4 out of 5 labels cor-

rectly, than getting 2 out of the same 5 labels. This has

to be taken into account by a multi-label evaluation

metric for it to be an effective performance indicator.

The evaluation metrics that were relevant to this

work are explained below:

• Hamming loss: a loss metric that represents the

fraction of labels that are incorrectly predicted,

be it a correct label that is missed, or a wrong

label that is predicted. Considering the func-

tion hamming

oss(h) = 0, the lower the value of

hamming

oss(h) the better the performance of h.

• Mean average precision: the average precision

(AP) is a relation between the class precision

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

112

and class recall. Precision (P) indicates how

many predictions were correct and recall (R) in-

dicates how many of all instances of a class

were predicted by a model. The average pre-

cision corresponds to the area under the graph

precisionXrecall. Since the average precision

corresponds to a single class, the mean average

precision (mAP) corresponds to the average APs

over all classes.

• F1 score: the F1 score, also is a relation between

class precision and class recall. However, it corre-

sponds to the harmonic mean between the two, so:

F1 = 2 ∗

P∗R

P+R

. Like mAP, the F1 score also refer-

ences a single class, so in a multi-label context,

this metric corresponds to the average F1 score of

all classes.

2.2 Deep MIML Network

As previously mentioned in section 2.1, a lot of real-

world applications greatly beneﬁt from being repre-

sented in a MIML context. The deep MIML network

(Feng and Zhou, 2017) is a model that adds an auto-

matic instance generator and a sub-concept learning

structure to the traditional neural network formation.

The structure of the deep MIML network (Figure 1)

consists on a feature-based-instance generator module

(usually a CNN network), followed by a sub-concept

layer, which is essentially a classiﬁer that matches the

scores between instances and sub-concepts for every

label. This approach allows for an instance-label rela-

tion discovery that works very well in a MIML frame-

work. Then, the role of the sub-concept layer is to

automatically abstract a set of sub-concepts (Subsec-

tion 2.1) present in the received instances (images)

that will be useful in helping to identify the labels

that occur in these instances. It means that the deep

MIML, in addition to performing an automatic extrac-

tion of features, also performs an automatic extrac-

tion of sub-concepts that will help in the classiﬁcation

task.

Figure 1: Representation of a deep MIML network, image

from (Feng and Zhou, 2017).

2.3 Super Mario Bros

The game Super Mario Bros was ﬁrst released in 1985

by the company Nintendo. In it, the player plays as a

plumber called Mario and has to traverse a series of

stages to reach and save a princess. Even though it

is a very simple premise, each stage presents a differ-

ent set of obstacles and enemies that require precise

actions from the player in order to advance. These

actions are: walking, running, jumping and throwing

a ﬁreball. All of these actions trigger different game

events, that represent the interactions the player has

with the environment through his actions. Notewor-

thy here is that the same action can trigger more than

one game event in the game scenes. The main game

events related to Super Mario Bros are listed in sec-

tion 4.1.

From a visual standpoint, the game has very sim-

ple graphics in a ”pixelated” style. The game also

presents a 2D (two-dimensional) point of view, which

lacks the object depth that is present in 3D games

(Roettl and Terlutter, 2018) as can be seen in Figure

Figure 2: Super Mario Bros on left (2D), Super Mario

Galaxy on right (3D).

3 RELATED WORKS

This section is structured as follows: Subsection 3.1

presents in more detail both state-of-the-art works

whose results inspired the deﬁnition of the architec-

tures of the models proposed in this paper, whereas

Subsection 3.2 resumes other state-of-the-art works

related to SISL or MIML classiﬁcation (in games or

other domains involving images).

3.1 Basic Related Works

This subsection summarizes the state-of-the-art works

whose architectures serve as a basis for the models

proposed here. The work (Faria et al., 2022) is a

precursor of the present proposal in which the au-

thors investigated 26 DL-based models to identify

events in gameplay footage depicting Super Mario

runs (Mario AI Framework (Karakovskiy and To-

gelius, 2012)). Their main objective was to ﬁnd the

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

113

most adequate architecture and game scene represen-

tation to perform SISL classiﬁcation of game events

in footage of Super Mario. One of the investigated

models was similar to the single standard AlexNet-

based model proposed in (Luo et al., 2018), where

the AlexNet architecture performed both the feature

extraction and the classiﬁcation. The remaining mod-

els combined a CNN automatic feature extractor (Mo-

bileNetV2, ResNet50V2, VGG16 or AlexNet) - to

produce a concise and adequate feature-based repre-

sentation for the game scenes - and a NN game event

classiﬁer (long short-term memory (LSTM), gated re-

current unit (GRU) or fully-connected layers (FCL)

- to identify the game event occurring in the feature-

based representation produced by the CNN extractor.

Concerning the game scene representation at the in-

put of the CNN extractor, two kinds of data were in-

vestigated: individual frame and chunk (a set of con-

secutive frames). The major contribution provided

by (Faria et al., 2022) consists on proving that, con-

cerning the problem of event classiﬁcation in game-

play footage: ﬁrstly, the chunk-based representation

performs better than the frame-based representation;

secondly, the models which use distinct deep neu-

ral networks to perform feature extraction and event

classiﬁcation provide much better results than those

in which a single CNN carries out both activities; ﬁ-

nally, the model that presented the best performance

combines the standard MobileNetV2 (Sandler et al.,

2018), as backbone, and the LSTM (Hochreiter and

Schmidhuber, 1997), as classiﬁer network (this bet-

ter model uses a chunk-based representation for the

game scenes), which motivated the authors of this pa-

per to propose a new model that extends and improves

the best model produced in (Faria et al., 2022) as fol-

lows: proposing a new MobileNetV2+LSTM - based

model able to face the more complex problem of per-

forming MIML classiﬁcation of events in gameplay

footage; implementing an automatic data generator

script to produce the frames from which the train-

ing dataset will be constructed; proposing a method to

build multi-labeled datasets to train the new models;

ﬁnally, by replacing the standard MobileNetV2 used

in the best model of (Faria et al., 2022) with a ﬁne-

tuned version of MobileNetV2, which is produced by

training the standard CNN from the frames that are

generated by the automatic data generator script.

The other state-of-the-art work closely related to

the architectures proposed in this paper is (Feng and

Zhou, 2017), where the authors, based on the Multi-

Instance Multi-Label framework proposed in (Zhou

et al., 2012) (which is widely used to structure classi-

ﬁcation problems across many research ﬁelds), cre-

ated the deep MIML network. In (Feng and Zhou,

2017), the deep MIML (resumed in Section 2.2) was

trained from numerous and varied real datasets, so as

to be able to perform MIML classiﬁcation of events

occurring in images and texts (in such work, the au-

thors did not cope with footage). In this way, the sec-

ond model proposed in this paper has as purpose to in-

vestigate the performance of the Deep MIML dealing

with MIML classiﬁcation of events occurring in game

footage. For this, here the authors replace the LSTM

of the ﬁrst proposed model with the deep MIML.

3.2 Other Related Works

This subsection resumes other related works involv-

ing SISL or MIML classiﬁcation in games or domains

involving images. In (Song et al., 2018), the authors

explored the usage of CNNs on multi-label image

classiﬁcation problems. They proposed a deep multi-

modal CNN, that was designed for MIML learning,

and whose main appeal was combining the beneﬁts

of the MIML framework with the image classiﬁcation

capabilities of CNNs in order to solve the problem of

weakly labeled images in classical datasets, like Im-

ageNet, in which images presenting multiple object

classes are labeled with just a single label. Differ-

ently from the approaches implemented in this paper,

in (Song et al., 2018): the authors uses just a single

deep neural network to perform both the feature ex-

traction and the MIML classiﬁcation; and the authors

do not face the challenge of performing MIML clas-

siﬁcation in footage. The work (Luo et al., 2018) pro-

posed a DL based approach to deal with SISL game

event classiﬁcation in footage of the Gwario (a Super

Mario’s clone). Also differing from this work, in (Luo

et al., 2018): a single CNN (AlexNet) was used for

both feature extraction and classiﬁcation; the authors

only investigated the frame-based representations for

the game scenes; and, ﬁnally, the models were trained

from a manually labeled Gwario game dataset.

4 ARCHITECTURE OF THE

MODELS

This section details the models implemented in this

work to tackle MIML event classiﬁcation in game-

play footage of Super Mario Bros. These proposed

approaches are based on the association of a back-

bone CNN and a classifying deep NN, where the for-

mer must automatically extract the most relevant fea-

tures corresponding to the game scenes (represented

through chunks), and the later has as its purpose to

identify the game events occurring in such scenes, as

shown in Figure 3.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

114

Figure 3: General architecture.

The models were implemented in the PyTorch

API.

The approaches were created so as to extend

the results obtained in the following state-of-the-art

works: 1) In (Faria et al., 2022) (Section 3.1), the

authors investigated various models to perform SISL

game event detection in footage of Super Mario Bros

(differently from the present paper, such work did

not cope with MIML classiﬁcation). These investi-

gated models differed from each other according to

the following aspects: ﬁrst, in the way of representing

the game scenes: frames or chunks; secondly, in the

composition of the architecture of the models: one

of the architectures was composed of a single deep

network to perform both feature extraction and sin-

gle event classiﬁcation and a second alternative archi-

tecture composed of two distinct networks, where a

CNN performed feature extraction and another net-

work (a deep RNN or a multi-layer NN) performed

the single label classiﬁcation. The winner model was

made of a MobileNetV2 feature extractor combined

with a classiﬁer LSTM, and used chunks to repre-

sent the game scenes; 2) In (Feng and Zhou, 2017),

the authors proposed the DeepMIML 2D sub-concept

layer to perform multi-label classiﬁcation in images

and texts (distinctly from this work, they did not in-

vestigate the performance of DeepMIML executing

MIML classiﬁcation in videos).

Then, in order to cope with the objective of per-

forming MIML classiﬁcation in game footage of Su-

per Mario Bros and taking into consideration the

results obtained in such state-of-the-art-works, both

models proposed in this work, additionally to use

chunks to represent the game scenes, count on two

distinct NN to perform feature extraction and MIML

classiﬁcation. The architecture of these models can

be resumed as follows:

• Both models count on an automatic data generator

(script): the script has as its purpose to generate

the labeled data (frames) that will be used to build

the training datasets. For this, the script uses the

Mario AI Framework (Karakovskiy and Togelius,

2012) to produce game footage from real games.

Then, from such footage, the script generates the

following elements: the frames representing the

game scenes and a set of .cvs ﬁles storing the la-

bels corresponding to these frames.

• CNN backbone: in both models, a new ﬁne-tuned

MobileNetV2 was implemented with the purpose

of replacing and enhancing the standard Mo-

bileNetV2 backbone used in (Faria et al., 2022).

Such a new version was produced by retraining

the standard MobileNetV2 from a dataset com-

posed of the frames produced by the automatic

data generator (script);

• Multi-label classiﬁers: by way of investigation,

one of the models uses the LSTM, which proved

to be the best single label classiﬁer in (Faria

et al., 2022), whereas the other one uses the Deep

MIML proposed in (Feng and Zhou, 2017), in

such a way as to evaluate to which extent it per-

forms well in making MIML in videos.

The details of such MIML classiﬁcation from

gameplay footage will be explained in detail in the

following sub-sections.

4.1 A Script to Automatically Generate

Labeled Frames from Game

Footage

As this work copes with MIML classiﬁcation, it has to

count on a signiﬁcant quantity of multi-labeled sam-

ples to train the deep classiﬁer NNs (Section 4.3.2).

Then, it was necessary to implement a script that auto-

matically retrieved data from gameplay footage. For

this, the script, through adequate resources provided

by the Mario AI Framework (Karakovskiy and To-

gelius, 2012), produced the gameplay footage from

which the following elements were extracted: the

data corresponding to frames that represent the game

scenes, and some .cvs ﬁles containing the labels

(game events) associated to these frames. It is inter-

esting to remember (Section 2.3) that the game events

appear as a consequence of the user actions and that

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

115

the same action can generate more than one game

event.

The footage is composed of 75 videos, each one

containing the recording of the same group of 15 lev-

els of Super Mario Bros played by a different com-

petitor picked up from a set of 5 players, among which

4 are humans and one is the automatic player available

in the Mario AI Framework.

The script also had to deﬁne both the frame rate

of the footage - which represents how many times the

game scene refreshes per second of the game - and the

set of events that is adequate to label the data samples

(frames).

Concerning the frame rate, the footage was cap-

tured at 30 frames per second. It means that the time

interval between two consecutive frames is about just

0.03 seconds, which very frequently makes the script

generate frames devoid of events (in fact, as the time

interval between 2 consecutive frames f 1 and f 2 is

very small, usually there is no time for a user action

- which would introduce a new event in f 2 - to be

executed between both frames). It is interesting to

note that, in this case, reducing the frame rate in or-

der to reduce the number of frames with no event is

not a good alternative, since it would cause a problem

even more serious: a too low capture of information

related to the game scenes, which would compromise

the training of the models.

Then, the frames generated by the script can

present zero or more events.

In order to label the frames, the script used the

same 12 game events pointed out by the framework it-

self as being fundamental to model the dynamics of a

match (such events stand out among the 270 variables

generated by the Mario AI Framework to describe a

given data scene, which include game events, coordi-

nates, and others) (Faria et al., 2022).

The game situations corresponding to these 12

game events are:

• EventBump: Mario bumps his head on a block

after jumping;

• EventCollect: Mario collects a coin;

• EventFallKill: an enemy dies by falling out of the

scene;

• EventFireKill: Mario kills an enemy by shooting

ﬁre at it;

• EventHurt: Mario takes damage after being hit by

an enemy;

• EventJump: Mario performs a jump;

• EventKick: Mario kicks a Koopa shell;

• EventLand: Mario lands on the ground after hav-

ing jumped;

• EventLose: Mario loses the game level, which can

be caused by the player dying or the time running

out;

• EventShellKill: Mario kills an enemy by kicking

a Koopa shell at it;

• EventStompKill: Mario kills an enemy by jump-

ing on top of it;

• EventWin: The player completes the current level

(illustrated as an example in Figure 4).

Figure 4: Example of an EventWin frame.

4.2 Data Pre-Processing

The models were trained from datasets built from

pre-processed frames that were generated by the au-

tomatic script. The purpose of the pre-processing

is to reﬁne the row samples that were produced by

the script, making them more suitable to enhance the

quality of the training of the proposed models. In

short, the following pre-processing stages were im-

plemented: 1) Deﬁnition of an adequate capture win-

dow: normally, whenever the user executes an action,

the Super Mario AI Framework registers all the events

associated with that action in the frame that depicts

the game scenario at the exact moment in which such

an action was executed. However, for some kinds of

events, such an annotation is not adequate, since, vi-

sually, the effect of such execution (that is, the visu-

alization of the game events) will only be noticed in a

later frame. Due to this, the ﬁrst pre-processing stage

has as its purpose to deﬁne an adequate custom cap-

ture window for each kind of event label. For exam-

ple, speciﬁcally for the event EventBump, instead of

annotating its occurrence in the frame f where it was

ﬁred off (as the framework does), here, through this

pre-processing stage, it will be registered in the frame

that succeeds f ; 2) Frame resizing: with the purpose

of reducing the feature extraction runtime, in this sec-

ond pre-processing step, each frame was resized from

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

116

its original size to 224×224 pixels, and a pixel-wise

normalization was applied. All three RGB color chan-

nels were kept.

4.3 Training Datasets

In order to train the models proposed in this work, two

datasets had to be created: a frame-based dataset and

a chunk-based dataset.

For this, ﬁrstly, once the process of generating

training samples has been completed by the script

(Subsection 4.1), all labeled row frames with at least

one event that were generated were pre-processed

(Section 4.2) and stored in a primary dataset. Next, as

such a dataset was originally very unbalanced due to

the fact that some events naturally occur much more

than others in the footage from which the data were

retrieved, it had to be submitted to an adequate bal-

ancing strategy. As, due to intrinsic characteristics of

the Super Mario Bros game, about 98% of the frames

with at least one event are single-labeled instances

(that is, they present just one event), it was possi-

ble to use a simple oversampling balancing in which

the frames belonging to the underclasses were ran-

domly duplicated until a more even distribution was

reached. At the end, this pre-processed and balanced

primary dataset was composed of 10, 000 examples,

among which only 218 were multi-labeled frames

(that is, frames with more than one event). As shown

in the following subsections, this pre-processed and

balanced primary dataset represents the basis for cre-

ating the training databases.

4.3.1 Frame Dataset

In order to obtain a more effective feature extractor

method, here the standard MobileNetV2 pre-trained

from ImageNet (Sandler et al., 2018) had to be re-

trained. To speed up such retraining, it was per-

formed from a Frame Dataset exclusively made up

of the 9, 782 single labeled frames present in the pre-

processed and balanced primary dataset aforemen-

tioned. The idea, in this case, is to improve the back-

bone NN by retraining it from images speciﬁcally cor-

responding to the problem it will cope with, that is,

the Super Mario Bros images.

4.3.2 A Special Algorithm to Create the

Multi-Labeled Samples of the Chunk

Dataset

In order to build the Chunk Dataset to train the MIML

classiﬁers of game events of both models, it was nec-

essary to create an algorithm to produce a signiﬁcant

quantity of multi-labeled samples, since, due to spe-

cial characteristics of Super Mario Bros game, a very

large part of the frames generated by the script are

devoid of events (Section 4.1) or, then, present just a

single event (as discussed in the beginning of Section

4.3).

Each multi-labeled sample (or chunk) of the

Chunk Dataset produced by such an algorithm corre-

sponds to a sequence of seven pre-processed frames,

where each one contains zero or more game events, as

detailed in the following.

Each chunk C f of the Chunk Dataset was gen-

erated through the application of the following algo-

rithm to one of the 10, 000 frames f which makes

up the pre-processed and balanced primary dataset (it

means that 10, 000 chunks were generated since all

the frames of the primary dataset have at least one

game event (Section 4.3)):

1. Take a frame f from the pre-processed and bal-

anced primary dataset;

2. Find f in the sequence of row frames produced

by the generator script; in the same sequence,

take the three frames that precede f and the three

frames that follow it (it is interesting to note that

such frames can be devoid of event);

3. Apply the same pre-processing described in 4.2

to all the frames taken in the previous step that

present no event (which had not yet been pre-

processed, since they do not belong to the primary

pre-processed and balanced dataset). Note that all

the remaining row frames taken in the previous

step (that is, those with at least one event) have al-

ready been pre-processed since they belong to the

primary pre-processed and balanced dataset;

4. Build the chunk C f by grouping, in the same or-

der they appear in the sequence produced by the

generator script, the three frames that precede f ,

f itself, and the three frames that follow f (and

all these 7 frames must already be in the pre-

processed form). It is interesting to point out that

such a strategy of generating chunk-based data by

grouping consecutive frames retrieved from the

footage by the script is very useful for two rea-

sons: ﬁrst, this produces a kind of synthesis of

a temporal fraction of the game scenes, which

will contribute to improving the perception of the

game by the deep classiﬁer NN and, consequently,

their performance; secondly, it allows for signif-

icantly increase the number of multi-labeled in-

stances in the Chunk Dataset (as detailed in the

following), which is very useful to enhance the

multi-label classiﬁer training.

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

117

5. Label the chunk C f with the set of labels occur-

ring in the 7 frames that compose it.

This algorithm proposed here to generate multi-

labeled samples proved to be quite effective, since,

despite the fact that a very large part of the frames

with which it operated were devoid of events or, then,

had just a single event, the algorithm managed to pro-

duce a very expressive quantity of multi-labeled sam-

ples to train the classiﬁers. In fact, among the 10, 000

chunks that make up the Chunk Dataset, 3, 791 cor-

respond to multi-labeled examples (with two or more

labels), thus enabling the training of the MIML clas-

siﬁers (which would have been impossible to do

with the pre-processed and balanced primary dataset,

which just counts on 218 multi-labeled samples, as

explained at the beginning of section 4.3).

Table 1 summarizes information regarding frame

and chunk Datasets.

Table 1: Summarization of the Training Datasets.

Total

Instances

Single

Label

Instances

Multi

Label

Instances

Frame-Dataset 9782 9782 0

Chunk

Dataset

10000 6209 3791

4.4 Construction of the Fine-Tuned

Backbone

Before presenting how the ﬁne-tuned backbone NN

used here was produced, it is important to remem-

ber that the architecture of both models proposed in

this paper to perform MIML classiﬁcation of game

events in footage of Super Mario Bros is inspired

by the best model obtained in the preliminary stud-

ies carried out in (Faria et al., 2022), in which the

feature extraction and the classiﬁcation processes are

performed by distinct deep NNs, and the standard

MobileNetV2 proved its effectiveness in performing

feature extraction in SISL classiﬁcation of the game

event in footage of Super Mario Bros. For this, the

training of the feature extractor and the classiﬁer NNs

had to be executed separately, as explained in the fol-

lowing.

The ﬁne-tuned MobileNetV2 backbone proposed

here was created in two steps: 1) Retraining, on the

Frame Dataset presented in Section 4.3.1, of the stan-

dard MobileNetV2 (Sandler et al., 2018). The mo-

tivation, in this case, is the fact that the ImageNet

dataset, from which the standard MobileNetV2 was

trained, is composed of real-world images that differ

a lot from the visual aspects of Super Mario’s game

scenes. Thus, the retraining of MobileNetV2 from the

Frame Dataset, whose instances correspond to frames

reporting the real visual aspects of the game scenes,

will allow for producing a more specialized backbone

to deal with Super Mario Bros. Such retraining was

performed with a 5-fold cross-validation method for

up to 100 epochs, or until there was no improvement

on the loss for ﬁve consecutive epochs. The optimizer

used was Adam (Kingma and Ba, 2015), the loss pa-

rameter was categorical cross-entropy, and the learn-

ing rate was equal to 0.0005, following what was used

in (Faria et al., 2022); 2) Shortening of the architec-

ture of the specialized backbone obtained in step 1 by

cutting off its fully connected layer, responsible for

classiﬁcation (in such a way as to keep only the ex-

tractor layers).

This way, such a shortened architecture, - from

this point on, referred to just as ﬁne-tunned Mo-

bileNetV2 -, will be able to produce, at its extractor

output layer, the feature-based representation of each

chunk that must be presented at the input layer of the

deep classiﬁer NN, with the purpose of training it, as

presented in the next section.

Finally, it is interesting to point out that this ﬁne-

tuned MobileNetV2 model produced here has the

same number of features (at the output extract layer)

and parameters as the extractor part of the pre-trained

MobileNetV2 (Sandler et al., 2018), that is, 1280 fea-

tures and 2,257,984 parameters.

4.5 Classiﬁers

In order to ﬁnd an adequate algorithm to perform clas-

siﬁcation in game-play footage, this work investigated

two distinct deep NN classiﬁers: the sub-concept

layer of the DeepMIML (Feng and Zhou, 2017) and

the LSTM (Hochreiter and Schmidhuber, 1997).

The motivation here to investigate the sub-concept

layer of the DeepMIMIL as a classiﬁer NN in one of

the proposed models is the following: as resumed in

Subsection 2.2, it corresponds to the classiﬁer por-

tion of the DeepMIML NN architecture, which was

conceived to operate with MIML problems (Feng and

Zhou, 2017). The effectiveness of such a classiﬁer

layer to automatically generate the sub-concepts that

will help to identify each label makes it quite suitable

for dealing with MIML problems. Further, the fact

that the present work uses multiple frames with mul-

tiple labels (chunks) as the basis for representing the

game scenes at the input of the NN classiﬁers, allows

for including the problem it deals with in the class of

the MIML problems (Zhou et al., 2012), which nat-

urally justiﬁes the proposition of the hypothesis that

the DeepMIML can be an interesting approach to be

included among the classiﬁer deep NNs to be investi-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

118

gated herein.

With respect to the LSTM model, it was selected

for the following reasons: ﬁrstly, because it proved

to be very effective as an NN classiﬁer in the winner

model obtained in the preliminary studies related to

single event classiﬁcation in Super Mario Bros game-

play footage carried out in (Faria et al., 2022). Such a

good result is due to the fact that, in LSTM, the inter-

mediary neurons are replaced with memory cells that

allow storing the persistence of relevant information

in the NN across the processing of different instances.

The second reason that motivated the authors of this

paper to investigate the LSTM in one of their mod-

els is that in the very work that proposed the Deep

MIML (Feng and Zhou, 2017), in order to evaluate

it, the authors carried out experiments in which they

compared its performance in executing MIML clas-

siﬁcation in images with that of LSTM (differently

from this work, they did not investigate the perfor-

mance of LSTM and Deep MIML operating as MIML

classiﬁers in footage).

Both classiﬁers were trained in the following way:

ﬁrstly, each training chunk was retrieved from the

chunk dataset (Section 4.3.2) and presented at the

input layer of the ﬁne-tuned MobileNetV2; next,

the feature-based representation generated, for each

chunk, at the output layer of such backbone, was pre-

sented at the input layer of the NN classiﬁer in order

to be classiﬁed.

This way, in both models, the ﬁne-tuned Mo-

bileNetV2 model provides the feature-based repre-

sentation for every chunk from the Chunk Dataset

that will be used to train the classiﬁers, as mentioned

in Section 4.4. More speciﬁcally, concerning the

deep MIML-based model, it is interesting to point out

that the ﬁne-tuned MobileNetV2 model plays the role

of the feature-based-instance generator module men-

tioned in Section 2.2.

Further, the output layers of both classiﬁers repre-

sent the 12 game events (labels) mentioned in Section

4.1.

Both classiﬁers were also implemented according

to a 5-fold cross-validation method, with a 0.0005

learning rate used for training and testing. The chosen

optimizer was dadelta (Zeiler, 2012).

5 EXPERIMENTS AND RESULTS

This section presents the experiments carried out to

evaluate the performance of both models proposed in

this paper. Such an evaluation will be done through

the following two experiments: the ﬁrst one aims to

compare the efﬁcacy of the ﬁne-tuned backbone pro-

duced here and the standard MobileNetV2. The sec-

ond evaluative experiment has as its purpose to com-

pare the performance of both MIML classiﬁers of

game events investigated in this paper, that is, LSTM

and DeepMIML.

5.1 Evaluation of the Feature Extractor

This subsection has as its purpose to make a com-

parative evaluation between the performance of the

ﬁne-tuned MobileNetV2 model produced here (Sec-

tion 4.4) and the standard MobileNetV2 model.

Such a comparison was made through the same

evaluative parameters used in the state-of-the-art

(Faria et al., 2022) to compare the performance

among backbone NNs, that is, the Mean Accuracy

and the Mean Loss. Table 2 shows that the ﬁne-tuned

MobileNetV2 model proposed in this work (accuracy

90.81% and Mean Loss 0.283%) performs better than

the standard MobileNetV2 model (accuracy 74.69%

and mean loss 0.702%). This makes sense, consid-

ering that ImageNet, from which the standard Mo-

bileNetV2 model was trained, is a dataset comprised

of real-world images, whereas the Frame Dataset

used to train the ﬁne-tuned MobileNetV2 model was

composed of images related to the problem faced

here.

These good results obtained in this ﬁrst exper-

iment conﬁrm the following contributions of this

work: 1) The data generator script created and the pre-

processing method proposed were successful, since

the labeled frames produced by both, in fact, allowed

for building an appropriate Frame Dataset to train the

ﬁne-tuned MobileNetV2 model; 2) The ﬁne-tunned

MobileNetV2 model, in fact, improves the perfor-

mance of the standard MobileNetV2, which made it

be chosen to perform feature extraction in both mod-

els proposed.

Table 2: Mean Accuracy and Loss Results

Mean Accuracy Mean Loss

MobileNetV2 74.69% 0.702

MobileNetV2-

FineTuned

90.81% 0.283

5.2 Evaluation of the Classiﬁers

As explained before, this work proposes two dis-

tinct models to perform MIML event classiﬁcation

in-game footage: 1) a ﬁrst one, combining the ﬁne-

tuned MobileNetV2 model as a feature extractor and

the sub-concept layer of the DeepMIML as a clas-

siﬁer, named ﬁne-tuned-MobileNetV2 + DeepMIML-

SubConcept-Layer; and 2) a second model, combin-

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

119

ing the ﬁne-tuned MobileNetV2 model as a feature

extractor and LSTM as a classiﬁer, named ﬁne-tuned-

MobileNetV2 + LSTM.

This section presents the results obtained in the

second experiment conceived to make a comparison

between the performance of the LSTM and the Deep

MIML-based models.

This experiment was carried out based on the

same parameters that have been used in the state-of-

the-art (Feng and Zhou, 2017) to compare the perfor-

mance of MIML classiﬁers, that is: the Mean Average

Precision, the F1-Score and the Hamming Loss (Sec-

tion 2.1.1).

The results obtained in the experiments are pre-

sented in Table 3.

Table 3: Classiﬁer Results

mAP HammingLoss F1

DeepMIML

Sub-Concept

Layer

87.92% 0.014 0.91

LSTM 89.33% 0.011 0.93

The results show that the LSTM classiﬁer per-

forms a little better than MIML in all evaluated met-

rics. In fact, the Mean Average Precision, the Ham-

ming Loss and the F1-Score produced by the former

and the latter were, respectively: 89.33%, 0.011%,

0.93, and 87.92%, 0.014%, 0.91. It is interesting

to note that the same slight performance superior-

ity of the LSTM model over the DeepMIML model

observed here, dealing with MIML classiﬁcation in-

game footage, was also observed in (Feng and Zhou,

2017), where the authors compared the performance

of models based on LSTM and on MIML classiﬁers

coping with problems related to making MIML clas-

siﬁcation in images.

The satisfactory results produced by both mod-

els in this second experiment, in addition to cor-

roborating the gains obtained with the data gener-

ator script and the pre-processing method proposed

here (since, as shown in section 4.3.2, the Chunk

Dataset used to train the classiﬁer NNs was also pro-

duced from them), also conﬁrm the following contri-

butions: 1) The efﬁciency of the algorithm that pro-

duce MIML samples (named as chunks) from the pre-

processed frames (since the Chunk Dataset is made

up of these chunks); 2) The effectiveness of both deep

NNs trained to perform MIML classiﬁcation of game

events in footage: the LSTM and the deep MIML.

6 CONCLUSIONS AND FUTURE

WORKS

In order to extend the current state of the art in the

domain of performing MIML classiﬁcation of game

events in gameplay footage (using Super Mario Bros

game as a case study), this work proposed two dis-

tinct deep learning models, in which the instances are

represented by chunks and the feature extraction is

made by a ﬁne-tuned MobileNetV2 model produced

in this study. With respect to the deep NN classi-

ﬁer, the ﬁrst model uses the sub-concept layer of the

DeepMIML - so far, only tested in the domain of

images and texts (Feng and Zhou, 2017) -, whereas

the second one uses the LSTM model - only inves-

tigated in (Faria et al., 2022) as a MISL approach to

perform single label classiﬁcation in footage of Super

Mario. In order to produce such models, in this work

the following elements had to be implemented: 1) A

data generator script to automatically produce multi-

labeled frames from game footage (where the la-

bels correspond to game events); 2) A pre-processing

method to make the frames generated by the script

suitable to be used in the training datasets; 3) An

algorithm to produce MIML samples from the pre-

processed frames (each sample corresponds to a set

of frames, named chunk), and will be used to train

the deep classiﬁer NN). The experiments conﬁrmed

the higher performance reached by the ﬁne-tuned Mo-

bileNetV2 compared to the standard MobileNetV2

and also proved that both models satisfactorily suc-

ceed in the task of performing MIML classiﬁcation of

game events in footage, even though the ﬁne-tunned-

MobileNetV2 + LSTM model performed a little bet-

ter than the Fine-tunned-MobileNetV2 + DeepMIML-

SubConcept-Layer model.

Taking into consideration the fact that video

games have also been extremely used in the ﬁelds of

medicine and psychology as valuable tools to study

behavior, improve motor skills (especially hand-to-

eye coordination), and assist with the treatment of de-

velopmental disorders (Squire, 2003), (Janarthanan,

2012), (Kozlov and Johansen, 2010), (Boyle et al.,

2011), as future work, the authors intend to use the

models produced herein in the implementation of Su-

per Mario Bros player agent endowed with the abil-

ity to stimulate the cognitive and/or motor skills of

patients with some kind of syndrome (such as Down

syndrome), in the following way: the events identi-

ﬁed in the gameplay footage will represent the basis to

perceive the actions performed by the game users and,

through them, to map their possible cognitive weak-

nesses. From this information, the agent must adapt

its decision-making engine to lead the game into situ-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

120

ations that stimulate the cognitive development of its

users. The authors also aim to explore deep learn-

ing models based on transformers (Khan et al., 2022)

to extract high-level representations from gameplay

footage.

AKNOWLEDGMENTS

To CAPES, for ﬁnancial support.

REFERENCES

Boyle, E., Connolly, T. M., and Hainey, T. (2011). The role

of psychology in understanding the impact of com-

puter games. Entertainment Computing, 2(2):69–74.

Serious Games Development and Applications.

Faria, M. P. P., Julia, E. S., Nascimento, M. Z. d., and Ju-

lia, R. M. S. (2022). Investigating the performance

of various deep neural networks-based approaches de-

signed to identify game events in gameplay footage.

volume 5, New York, NY, USA. Association for Com-

puting Machinery.

Feng, J. and Zhou, Z.-H. (2017). Deep miml network. In

AAAI.

Gao, J., Yang, Z., and Nevatia, R. (2017). RED: rein-

forced encoder-decoder networks for action anticipa-

tion. CoRR, abs/1707.04818.

Geest, R. D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C.,

and Tuytelaars, T. (2016). Online action detection.

volume abs/1604.06506.

Global Data (2021). Video games market set

to become a 300bn-plus industry by 2025.

https://www.globaldata.com/video-games-market-

set-to-become-a-300bn-plus-industry-by-2025.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Janarthanan, V. (2012). Serious video games: Games for

education and health. In 2012 Ninth International

Conference on Information Technology - New Gener-

ations.

Karakovskiy, S. and Togelius, J. (2012). The mario ai

benchmark and competitions. IEEE Transactions

on Computational Intelligence and AI in Games,

4(1):55–67.

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S.,

and Shah, M. (2022). Transformers in vision: A sur-

vey. ACM computing surveys (CSUR), 54(10s):1–41.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In 3rd International Confer-

ence on Learning Representations, ICLR 2015, San

Diego, CA, USA, May 7-9, 2015, Conference Track

Proceedings.

Kozlov, M. D. and Johansen, M. K. (2010). Real behavior

in virtual environments: Psychology experiments in

a simple virtual-reality paradigm using video games.

Cyberpsychology, Behavior, and Social Networking.

Luo, Z., Guzdial, M., Liao, N., and Riedl, M. (2018). Player

experience extraction from gameplay video. CoRR,

abs/1809.06201.

Luo, Z., Guzdial, M., and Riedl, M. (2019). Making cnns

for video parsing accessible. CoRR, abs/1906.11877.

Neto, H. C. and Julia, R. M. S. (2018). Ace-rl-checkers:

decision-making adaptability through integration of

automatic case elicitation, reinforcement learning, and

sequential pattern mining. Knowledge and Informa-

tion Systems, 57(3):603–634.

Prena, K. and Sherry, J. (2018). Parental perspectives on

video game genre preferences and motivations of chil-

dren with down syndrome. Journal of Enabling Tech-

nologies, 12:00–00.

Ravanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L.,

Regazzoni, C., and Sebe, N. (2017). Abnormal event

detection in videos using generative adversarial nets.

Roettl, J. and Terlutter, R. (2018). The same video game in

2d, 3d or virtual reality – how does technology impact

game evaluation and brand placements? PLOS ONE,

13:1–24.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Song, L., Liu, J., Qian, B., Sun, M., Yang, K., Sun, M., and

Abbas, S. (2018). A deep multi-modal cnn for multi-

instance multi-label image classiﬁcation. IEEE Trans-

actions on Image Processing, 27(12):6025–6038.

Squire, K. (2003). Video games in education. International

Journal of Intelligent Simulations and Gaming, 2:49–

62.

Xu, M., Gao, M., Chen, Y., Davis, L. S., and Crandall, D. J.

(2018). Temporal recurrent networks for online action

detection.

Yu, M., Bambacus, M., Cervone, G., Clarke, K., Duffy, D.,

Huang, Q., Li, J., Li, W., Li, Z., Liu, Q., Resch, B.,

Yang, J., and Yang, C. (2020). Spatiotemporal event

detection: a review. International Journal of Digital

Earth, 13(12):1339–1365.

Zeiler, M. D. (2012). Adadelta: an adaptive learning rate

method. arXiv preprint arXiv:1212.5701.

Zhou, Z.-H., Zhang, M.-L., Huang, S.-J., and Li, Y.-F.

(2012). Multi-instance multi-label learning. Artiﬁcial

Intelligence, 176(1):2291–2320.

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classiﬁcation in Gameplay Footage

121