Automatically Generating Image Segmentation Datasets for Video Games

David Gregory LeBlanc and Greg Lee

Acadia University, Wolfville, Canada

Keywords:

Image Segmentation, Computer Vision, Machine Learning, Deep Reinforcement Learning, Video Games.

Abstract:

Image segmentation is applied to images fed as input to deep reinforcement learning agents as a way of

highlighting key-features and removing non-key features. If a segmented image is of lower resolution than

its source, the problem is further simpliﬁed. However, the process of creating a dataset for the training of

an image segmenting network is long and costly if done manually. This paper proposes a methodology for

automatically generating an arbitrarily large image segmentation dataset with a speciﬁable segmentation res-

olution. A convolutional neural network trained for image segmentation using this automatically generated

dataset had higher accuracy than a network using a manually labelled training set. Furthermore, an image seg-

menting network trained on a dataset generated in this manner gave superior performance to an autoencoder

in reducing dimensionality while preserving key features. The method proposed was tested on Super Mario

Bros. for the Nintendo Entertainment System (NES), but the techniques could apply to any image segmenta-

tion problem where it is possible to simulate the placement of key objects.

1 INTRODUCTION

The motivation for this work originates from a desire

to discover a method of simplifying and generaliz-

ing state inputs to deep reinforcement learning agents,

especially in video game domains. There are many

video games, typically within the same genre, which

convey similar information with their graphics, such

as how most platforming games have the concept of

an enemy object, but utilize different sprites to dis-

play enemies. If the state fed to an agent uses a game-

independent encoding, then the agent should be capa-

ble of producing a more general solution than it would

with a game-dependent encoding, such as pixel data.

Super Mario Bros. (Nintendo, 1985) was chosen

as a subject for the proposed dataset creation method

because it is an NES game: it is one console genera-

tion ahead of where state of the art deep reinforcement

learning agents can reliably outperform humans (e.g.

Atari 2600) while using game-agnostic techniques.

Furthermore, Super Mario Bros. is part of the plat-

former genre, and there are many other well known

games that fall within the same category on the NES

to which the same segmentation scheme could be ap-

plied such as Mega Man (Capcom, 1987), Castlevania

(Konami, 1986), and Adventure Island (Hudson Soft,

1986).

The ability to specify a segmentation resolution

was another key requirement in this work, as a lower

segmentation resolution lowers the memory overhead

for a deep reinforcement learning agent (deep-RL).

NES games have twice the number of pixels per frame

compared to any Atari 2600 game, such as Pong

(Atari, 1972), and later console generations increase

this gap further. Some problems are only possible

to solve with deep reinforcement learning if large

batches are used (Baker et al., 2020), and this may

include these more sophisticated video game environ-

ments. Thus, it is important that a state encoding is

capable of lowering the memory requirements so that

these batches, as well as the experience replay mech-

anisms from which they are sampled (Schaul et al.,

2016), are computationally affordable.

2 RELATED WORK

There are two branches in the state of the art for

state representation in the deep-RL video game do-

main. Firstly, there are those which outperform hu-

mans in modern games such as OpenAI Five in Dota

2 (Berner et al., 2019) which utilize a great deal

of game-speciﬁc information in their state represen-

tation. For instance, OpenAI Five utilizes roughly

16,000 inputs to their agent, many of which are mean-

ingless outside of the game of Dota 2 (ex. “is Roshan

deﬁnitely dead?”, “time since seen enemy courier...”).

LeBlanc, D. and Lee, G.

Automatically Generating Image Segmentation Datasets for Video Games.

DOI: 10.5220/0011693800003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 509-516

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

509

In contrast, Agent57 (Badia et al., 2020) utilizes

very little game speciﬁc information in its state repre-

sentation, and is capable of outperforming humans in

57 different Atari 2600 games (a much older domain

compared to Dota 2). Its state representation consists

only of pixel-data from the games being played with

some game-agnostic image preprocessing (ex. con-

version to greyscale).

Image segmentation has been utilized in the do-

main of autonomous vehicles to simplify informa-

tion used to make driving decisions (Papadeas et al.,

2021). Popular datasets, used to create these seg-

mentation models, such as the Cityscapes Dataset, are

manually annotated, and can take as long as 1.5 hours

per sample to create (Cordts et al., 2016).

There are some techniques that have been ex-

plored for automating the creation of segmentation

datasets, such as in the domain of hand segmenta-

tion (Bojja et al., 2018). Lasso-type tools, such as

those found in Adobe Photoshop or GIMP, improve

the speed at which a human may segment an image.

3 APPROACH

Algorithm 1 details the proposed automatic data gen-

eration algorithm. If the segmentation resolution (the

size of the grid of segments) is higher than the source

resolution, Algorithm 2 is used to create the segmen-

tation grid. The state of each object is described by

its position in the scene, and the sprite that it is using.

Each sprite should have its own bounding box deﬁned

for the purposes of Algorithm 2.

Static objects are those whose states do not change

past the initialisation of the scene. Semi-static objects

are those with special rules that outline a small num-

ber of states they can be in. Dynamic objects are those

which can appear anywhere in the scene, so long as

they do not overlap with another object. The distinc-

tion between semi-static and dynamic objects is made

because there are many objects in video games that

follow rules that are simple to simulate (semi-static),

and other objects that are difﬁcult to simulate accu-

rately (dynamic objects). An example of a dynamic

object is a player character, where there are many pos-

sible states that the character can be in that depend

on the states of other objects in the environment. By

contrast, a semi-static object could be an animated but

immobile piece of terrain.

Applying this to Super Mario Bros. the main la-

bels are listed below in order of priority, player being

the highest priority label and ground being the lowest

priority label:

Algorithm 1: The data generation algorithm for the plat-

former autolabeller.

1: Labels: A list of possible labels ordered by prior-

ity of segmentation

2: S: The background of the environment being sim-

ulated

3: O: Objects, which are static, semi-static, or dy-

namic. Each object has an associated label from

Labels and a number of sprites which it can dis-

play.

4: N

needed

: Number of samples needed

5: N

current

: Number of samples created so far

6: Grid

x,y

: Label located at (x,y) in segmentation

grid

7: C

: Camera x and y positions in S

8: Res

: Horizontal image resolution

9: Res

: Vertical image resolution

10: Screen: region(C

+ Res

)

11: Initialise S, create all O

static

, O

semistatic

12: Deﬁne all valid C

for camera

13: while N

current

< N

needed

14: Assign random new valid C

15: Destroy all O

dynamic

16: Instantiate a random valid number of O

dynamic

17: Place O

dynamic

instances randomly within

Screen, do not allow overlapping

18: Randomize sprite used for each O

dynamic

19: Update state of all O

semistatic

20: Grid ← update grid(Grid) {See Algorithm

21: Save Gr id, pixel data of Screen as a sample

22: N

current

← N

current

+ 1

23: end while

Algorithm 2: The grid update function for the platformer

autolabeller.

1: CellW ← Res

/width(Grid)

2: CellH ← Res

/height(Grid)

3: for CellX ←C

, CellX < Res

, CellX ← CellX +

CellW do

4: for CellY ← C

, CellY < Resv, CellY ←

CellY +CellW do

5: CellRegion ← region(CellX,CellY,CellX +

CellW,CellY +CellH)

6: Find Label, s.t. Label is the label with

highest priority associated with objects in

CellRegion according to Labels

7: Grid

CellX ,CellY

← Label

8: end for

9: end for

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

510

• Player: the object the player controls (Mario,

Luigi). This is a dynamic object, and there is al-

ways exactly one per generated sample.

• Enemy: any hostile object that can be defeated

normally (that is, by jumping on them, or with the

use of a power-up). These are dynamic objects,

and there are up to two of these in a given sam-

ple (There are typically fewer than 2 enemies at a

given time in normal gameplay).

• Hazard: objects that will harm the player on

contact that cannot be defeated normally (e.g.

Bowser, Piranha Plants).

• Ground: objects that may be stood on. These are

a mixture of semi-static objects (e.g. moving plat-

forms that follow a set path) and static objects

(e.g. bricks).

This labelling scheme was designed in such a way

that it could apply to any platformer game, although

some may require some additional labels (perhaps

for power-ups, friendly projectiles, or interactable ob-

jects). Due to the automatic nature of the algorithm,

it is relatively straightforward to edit objects’ classi-

ﬁcation. The main source of work for implementing

this algorithm lies in importing the relevant resources

for an object or scene.

Any region of the screen which does not fall into

one of the identiﬁed labels (e.g. score indicators or

background objects) is given a none label to indicate

that there are no key features in that region. The

player’s location is a component in all decision mak-

ing in platfomer games, hence its position at the top

of the priority list. Enemies and hazards both take

priority over the ground labels, as it is generally more

advantageous to avoid hazards or enemies than to be

cognisant of whether there is terrain in that same re-

gion. This is especially relevant in Super Mario Bros.

where enemies may be used as a form of terrain if they

are jumped on. Note that in Algorithm 1, dynamic

objects are placed randomly within a room with ran-

dom sprites, and this this can produce screenshots that

are improbable or impossible to reproduce within the

game being simulated. Similarly, it is possible to mix

and match resources from multiple games within a

single dataset.

For example, the enemy sprites could be a mixture

of Super Mario Bros. and Castlevania sprites. The

reasoning behind keeping this behaviour is twofold:

perfectly simulating the source game slows down the

process of creating the simulation, and it may be that

the unconventional placement of sprites leads to the

generation of models that are more general. For in-

stance, it is very rare in Super Mario Bros. for the

player character to be at the very top of the screen, but

in a game with more verticality, such as Mega Man,

such scenarios are common.

For the experiments in this paper, the automatic

labelling algorithm was implemented in GameMaker

using assets from Super Mario Bros. The ﬁrst world,

consisting of four levels, and the ﬁrst level of world

2 were simulated. The automatic labeller could gen-

erate samples of the implemented levels at a rate of

165 samples per second. The process of adding a new

level to the simulation consisted of importing the level

as a background to a new room, labelling all the ter-

rain (ground labels), and adding in any special objects

(e.g. moving platforms). This process would take

roughly 1 hour per level added. Some elements of

the game were not added to the simulation; the GUI,

consisting of the white text and ﬂashing coin sprite

was only partially simulated. The coin would ﬂash

as it would in normal gameplay, but all text elements

were left static in the automatically created sets.

In addition to the automatically created datasets,

a manually created dataset was also generated. The

manual labelling software used was created in Python

using Tkinter (Lundh, 1999) speciﬁcally for the pur-

poses of this research. The created software fea-

tures hotkeys to switch between label types and im-

ages to increase the speed of labelling. After some

practice using the software, an expert user could pro-

duce one sample every 15 seconds on average (or

0.07 samples per second). Note that this number is

based on the segmentation resolution being 15x15;

a higher segmentation resolution would result in a

slower labelling speed. The 15x15 segmentation res-

olution was chosen for both automatically generated

and manually generated datasets as when dividing the

resolution of Super Mario Bros. this way, each cell of

the segmentation grid corresponds to a roughly 16x16

region of the source image (16x14.9 after overscan),

and many sprites in Super Mario Bros. are comprised

of 16x16 tiles.

The manual dataset and the small automatically

generated dataset consist of 3,734 samples each,

while the large automatic dataset consists of 1 million

samples. Furthermore, the large automatic dataset

took approximately 2 hours to generate unsupervised

in one session, the small automatic dataset took less

than one minute, while the manual dataset took ap-

proximately 24 hours to generate by hand, spread

across 5 sessions. In addition to the time spent la-

belling the manual dataset, an extra hour was spent

playing the game to generate sufﬁciently diverse

gameplay footage whose frames formed the images

to be labelled. This is necessary to prevent state bias

in a resultant model.

Automatically Generating Image Segmentation Datasets for Video Games

511

In theory, it is possible to generate the images

for manual labelling at a rate matching the game’s

framerate (60 frames per second in the case of Su-

per Mario Bros.), but in practice, many frames have

to be discarded as the gameplay contains irrelevant

images such as game-over screens and menus. In ad-

dition, normal gameplay produces a skewed dataset;

a skilled player may get the player character into a

powered up state, which has its own sprite, and never

reach the other player states that may be seen in the

game. Conversely, an unskilled player would produce

gameplay with few frames in the powered up state.

In addition to the videos whose frames were uti-

lized for the manually labelled dataset, other game-

play videos were recorded for testing purposes. That

is, another set of videos were created with frames

that did not appear in the datasets to act as a test set.

Since the intention for the creation of the test videos

is to benchmark models created from both the manu-

ally and automatically generated datasets, whose seg-

menting methodologies are different, no ground truth

segmentation is given for frames in the test videos.

Due to limitations in the recording environment, the

test videos all suffer from some compression arti-

facts, meaning that there is noise present in the test

videos that would not have been seen in the automatic

dataset. Since the manual dataset was derived from

similar videos, the manual dataset has the advantage

of sharing noise characteristics with the test videos.

To test the viability of the automatically created

dataset, three segmenting models were trained; one

using the full automatic dataset (large automatic seg-

menting model), another with the manual dataset

(manual segmenting model), and ﬁnally with one

trained on the small automatic dataset (small auto-

matic segmenting model). All used images from their

respective datasets as input to predict the segmenta-

tion given by the grid in the dataset. They were eval-

uated using test sets generated from the same dataset

they were trained on (although samples from the test

set were not given to the agent during the training

phase), as well as frames from the test video.

The small automatic and manual segmenting mod-

els were subject to 10-fold cross-validation to account

for their small datasets. In those cases, 187 samples

(5% of the total number of samples) of each dataset

were removed prior to the cross-validation process, so

that each model could be evaluated on a ﬁxed test set.

The models with the highest macro averaged F1-score

on their respective test set were selected for further

evaluation using the test videos.

In addition to the segmenting networks, an autoen-

coder was trained on just the image components of the

automatic dataset. Unless otherwise stated in the ta-

bles, all non I/O layers of the neural networks made

use of a ReLU activation function, and all networks

were trained using a 1e-4 learning rate with an Adam

optimizer. Additionally, early stopping was applied

to all training sessions with a patience of 10 epochs

and a minimum delta of 1e-4. Mean Squared Error,

or MSE, was used as the loss function for all investi-

gated models.

Each segmenting model had its own model archi-

tecture and hyperparameters optimized for validation

accuracy on its respective datasets using a combina-

tion of a Hyperband tuner (Li et al., 2018) and hand-

tuning. Table 1 lists the values that were altered dur-

ing the tuning process. Values that do not appear in

the table were not tuned. Table 2, Table 3, and Table

4 give the model architectures for the small automatic

/ manual segmenting models, automatic segmenting

model, and autoencoder, respectively.

As mentioned above: there were some changes

to the networks during the model designing process

that were not determined by the tuner; the manual

and small automatic segmenting models received max

pooling and dropout layers to help in overcoming the

small size of the source dataset and avoid overﬁt-

ting. While tuning the automatic segmenting model,

hyperparameters were tuned in batches using multi-

ple datasets, with smaller ranges for larger datasets.

Smaller datasets were created of sizes ranging from

100,000 samples to the ﬁnal 1,000,000 sample dataset

during this process. All networks utilized the same

preprocessing pipeline:

• Crop 8 pixels from each side of the image (over-

scan) to simulate how Super Mario Bros. would

be displayed in an emulator like in Gym Retro

(Nichol et al., 2018).

• Convert the image to greyscale to reduce dimen-

sionality; the RGB channels are combined into

one greyscale channel.

• Normalize all pixel values to the range [0,1] to

reduce absolute distance between similar colors.

When evaluating the segmenting models, each

cell in the predicted segmentation grid has its value

rounded to the nearest integer (label) and clipped to

the range of possible labels, [0,4]. This way, per-class

accuracy can be calculated. To evaluate the autoen-

coder, reconstructions of frames from the test video

were reviewed by a human expert familiar with the

rules and appearance of Super Mario Bros. All ex-

periments were performed on a PC utilizing 32 GB of

RAM, a 3.5 GHz 8 core processor, and an NVIDIA

GeForce RTX 3070 GPU with 8GB of VRAM.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

512

Table 1: The parameters that were altered during the tuning

process for the automatic and manual segmenting models.

Parameter Considered Values

# Convolutional layers 2, 3

Filters 32,64

Stride 16, 4, 2, 1

Kernel Size 16x16, 8x8, 3x3, 4x4,

2x2, 1x1

Max pooling Yes, No

# Dense Layers 1, 2, 3

Dense Units [32,512], steps of 16

Learning Rate 1e-3, 1e-4, 1e-5

Dropout value 0.0, 0.1

Table 2: Neural network architecture for the manual and

small automating segmentation networks.

Layer Type Details

Input -

Conv2D 64 ﬁlters, stride 4, 16x16 kernel

Max pooling -

Conv2D 64 ﬁlters, stride 2, 8x8 kernel

Max pooling -

Flatten -

Dense 512 units

Dropout 0.1 chance

Dense 512 units

Dropout 0.1 chance

Dense No activation, n

out puts

units

Reshape Reshape to 2D for output

Table 3: Neural network architecture for the segmenting

model trained on the full automatically generated set.

Layer Type Details

Input -

Conv2D 32 ﬁlters, stride 4, 4x4 kernel

Conv2D 64 ﬁlters, stride 2, 2x2 kernel

Conv2D 64 ﬁlters, stride 1, 1x1 kernel

Flatten -

Dense 128 units

Dense No activation, n

out puts

units

Reshape Reshape to 2D for output

Table 4: Neural network architecture for the autoencoder

trained on the automatically generated set.

Layer Type Details

Input -

Conv2D 16 ﬁlters, stride 4, 16x16 kernel

Conv2D 16 ﬁlters, stride 2, 8x8 kernel

Dense No activation, n

out puts

units

Conv2D-

Transpose

16 ﬁlters, stride 2, 8x8 kernel

Conv2D-

Transpose

16 ﬁlters, stride 4, 16x16 kernel

Conv2D 1 ﬁlter, stride 1, 3x3 kernel

4 EXPERIMENTS

4.1 Segmentation Comparison

The manual and the automatic segmenting models

gave near perfect accuracy in labelling the ground and

non-key features (none label), with 92-99% accuracy

across the different models. Given that the two most

common labels are ground and none, this result is ex-

pected. These two labels are also the two least vari-

able; conﬁgurations of ground and empty space do

not change for a given place in a level, simulated or

not, with the exception of moving platforms, which

comprise a small portion of total ground labels.

The largest performance difference between the

segmenting models is that the large automatic seg-

menting model demonstrates much higher accuracy

in correctly labelling player, enemy, and hazard ob-

jects. The smaller models tend to over-predict ground

labels, likely due to the ground labels being the most

common label, as well as the smallest non-zero label

in the set.

With the information from the confusion matrices,

it was calculated that the manual segmenting model

achieved a macro averaged F1-score of 0.52±0.00 on

its test data (41850 predictions), the automatic seg-

menting model achieved a macro averaged F1-score

of 0.88 on its test set (921600 predictions), and the

small automatic segmenting model achieved a macro

averaged F1-score of 0.48±0.01 (41850 predictions).

By all metrics, the small automatic segmenting model

does not perform as well as the manual model, but

only slightly in comparison to the difference of both

with the large automatic model. In all probability,

this is due to how the manually created data is much

more tightly correlated than the automatically gen-

erated data, as the data is created through gameplay

which has more strict rules than the automatic sample

generation process. For example, in all samples in the

manually generated set, Mario, labelled player, tends

to be close to the ground due to the in-game gravity.

However, the automatically generated sets had Mario

in any position onscreen with equal probability.

The large automatic segmenting model achieves

higher accuracy than the manual segmenting model

on the dynamic and semi-static objects (hazards, en-

emies, and the player) even though those same ele-

ments were more variable in the automatically gen-

erated set than the manually generated set. This sug-

gests that the higher variability may be overcome with

a sufﬁcient number of samples. This is of course

very feasible given the multiple orders of magnitude

in time advantage the automatic approach has over the

manual approach.

Automatically Generating Image Segmentation Datasets for Video Games

513

Most misclassiﬁcations by the models are in

nearby classes, which perhaps would be remedied

by encoding the segmentation grids with a one-hot

encoding instead of a unique integer for each label.

Even on samples from the test set, like the compari-

son shown in Figure 2, the large automatic segment-

ing model produced more reasonable predictions of

the encoding. This is despite the test images contain-

ing elements that were not simulated in the training

data. For example, the numbers in the GUI at the top

of the screen are different from what they are any-

where in the training set. It is the case that the GUI

region in the test image is erroneously categorised as

ground in several cells of the prediction in Figure 2,

and this may account for some of the false positives

predicted in the ground class.

In Figure 2 the differences between the labelling

schemes of the datasets can be seen. The manually la-

belled dataset featured closer-ﬁtting segmentations to

most objects, thus the thinner labels for the pipes and

player. The predictions for the automatic segmenting

model could be made closer ﬁtting by adjusting the

bounding boxes of the relevant sprites.

The results from the automatic segmenting model

on the test video demonstrate that a model is capable

of overcoming the noise introduced by the video com-

pression artifacts. This could be useful in situations

where it is not possible to capture noise-free footage.

In summary, the large automatic segmenting

model outperformed the manual segmenting model in

terms of per-class accuracy and summary F1-score,

and neither the imperfect simulation performed to cre-

ate the automatically generated set nor the test video’s

compression artifacts prevented the automatic seg-

menting model from generalizing to footage from the

actual game. Much of the full automatic segment-

ing model’s success may in part be due to the larger

training set. However, the sample creation rate of the

proposed algorithm is higher than that of the manual

approach (165 samples per second, compared to 0.07

samples per second, a difference of 4 orders of mag-

nitude), and these differences would be exacerbated

by a higher segmentation resolution, as there would

be less computation required for each image. That

is, the large automatic model’s performance is repre-

sentative of the sample creation rate increase that the

automatic approach offers over the manual approach.

4.2 Dimensionality Reduction and Key

Feature Preservation

Figure 3 shows the reconstruction of an image from

the test set as created by the autoencoder. Comparing

to predicted encodings created by the automatic seg-

(a) The confusion matrix manual model. 187 samples from

the original dataset were removed from the training set to

form the test set for this confusion matrix.

(b) The confusion matrix for the large automatic model. An

additional 4096 samples were generated to form the test set.

Like the manual model, 187 samples from the set were re-

moved to form the test set.

Figure 1: Confusion matrices for the segmenting models.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

514

Figure 2: The predicted segmentation for the segmenting models on an image from a test video. From left to right: the image

being used as input, the manual model’s prediction, the small automatic model’s prediction, and the large automatic model’s

prediction. The bottom source image depicts Mario between two pipes (ground) and a Piranha Plant (hazard).

Figure 3: A reconstruction of a screenshot from Super

Mario Bros. created by the autoencoder (above) and the

source (below).

menting model (Figure 2), a decrease in key feature

preservation is shown. For example, in all test frames

reviewed (a 2 minute, 60fps test video showing recon-

structions alongside the source footage), neither the

player character nor any enemies that were present in

the source frames could be distinguished in the recon-

struction by a human proﬁcient in playing the game.

Furthermore, some of the reconstructions did not con-

tain all of the terrain that was present in the source

image. In the case of the reconstruction shown in Fig-

ure 3, the topmost bricks are completely missing in

the reconstruction.

In contrast, when shown a video containing the

source frames of the same video alongside predictions

made by the automatic segmenting model (Figure 2),

the human expert could identify most key labels and

their associated objects in both images throughout the

video. The automatic segmenting model utilized a

15x15 segmentation resolution, and the autoencoder

utilized a 15x15 latent space. That is, given the same

space to encode key features from the source frames,

the automatic segmenting model preserved more of

the key features than the autoencoder as judged by a

human expert, to the extent that the autoencoder did

not preserve any enemies or player characters.

5 CONCLUSION AND FUTURE

WORK

The proposed algorithm for automatically generating

an image segmenting dataset is capable of producing a

dataset that, when used to train a segmenting network,

leads to a more effective network compared to one

that is trained on a manually labelled dataset created

over a longer period of time.

It is understood that there are some additional

costs to using this algorithm over the manual ap-

proach:

• Sufﬁcient expertise in the environment is required

to craft a sufﬁciently realistic simulation (e.g. ob-

jects need to be classiﬁed as static, semi-static, or

dynamic with proper behaviours).

• Assets from the simulated environment must be

available, or else close replicas need to be pro-

duced.

However, the automatic labelling approach offers

a number of advantages over the manual method:

• Given the greater rate of sample creation com-

pared to the manual approach, it is easier to re-

Automatically Generating Image Segmentation Datasets for Video Games

515

generate the dataset with altered parameters such

as new labels or new objects.

• Automatic labelling leads to perfect consistency

in the labelling process; human error is contained

within the parameter setting process.

• One human expert may have control over the

dataset’s parameters, rather than having a hu-

man expert training a number of less experienced

workers.

• The automatic labelling approach allows the com-

bination of assets which otherwise would not be

seen together, potentially leading to a dataset that

could create a more general model.

From the experiment on the autoencoder, with the

speciﬁed latent space size / segmentation resolution

of 15x15, the automatic segmenting model outper-

formed the autoencoder at maintaining key features.

From that result, it is assumed that being able to spec-

ify the segmentation resolution of the dataset is a use-

ful tool in creating a model while seeking to optimize

the amount of space used to summarize key features.

It may be possible to generate a dataset by treat-

ing all game objects as dynamic with the proposed

algorithm, or in other words, placing all game assets

randomly within a screen with no adherence to game

rules. Such an approach was not tested for this pa-

per, as it was assumed that a more realistic dataset

should be used to create more accurate segmentation

models. However, this may be worthy of additional

experimentation.

One of the areas deemed most important in further

evaluating the overarching segmentation approach is

to create deep reinforcement learning agents that use

a segmenting encoding as state input powered by a

model trained on a dataset created with the proposed

methods. This may reveal whether a segmented state

input is a useful component in creating agents capa-

ble of exceeding human performance across a broader

state space, perhaps one that even spans multiple en-

vironments. The utilization of a low segmentation

resolution in the dataset could increase the number

of samples that could be stored in an experience re-

play mechanism, as well as the number of samples

that could be used in a batch in an environment.

Another potential future work of interest is apply-

ing a similar algorithm to a 3D environment. In par-

ticular, automatically creating a trafﬁc dataset for the

purposes of training an autonomous driving agent.

In conclusion, the automatic labelling approach is

an effective way of lowering the time cost of dataset

generation over manual methods. Generating data in

this way enables rapid experimentation with image

segmentation parameters, and as such it should be

used to determine the effectiveness of segmentation

as input to deep reinforcement learning agents.

REFERENCES

Atari (1972). Pong. Atari 2600.

Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P.,

Vitvitskyi, A., Guo, D., and Blundell, C. (2020).

Agent57: Outperforming the atari human bench-

mark. number: arXiv:2003.13350 arXiv:2003.13350

[cs, stat].

Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell,

G., McGrew, B., and Mordatch, I. (2020). Emer-

gent tool use from multi-agent autocurricula. arXiv:

1909.07528.

Berner, C., Brockman, G., Chan, B., Cheung, V., Denni-

son, C., Farhi, D., Fischer, Q., Hashme, S., Hesse,

C., J

ozefowicz, R., Gray, S., Olsson, C., Pachocki,

J., Petrov, M., Salimans, T., Schlatter, J., Schneider,

J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and

Zhang, S. (2019). Dota 2 with large scale deep rein-

forcement learning.

Bojja, A. K., Mueller, F., Malireddi, S. R., Oberweger, M.,

Lepetit, V., Theobalt, C., Yi, K. M., and Tagliasac-

chi, A. (2018). Handseg: An automatically labeled

dataset for hand segmentation from depth images.

arXiv:1711.05944 [cs].

Capcom (1987). Mega man. Nintendo Entertainment Sys-

tem.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding.

Hudson Soft (1986). Adventure island. Nintendo Entertain-

ment System.

Konami (1986). Castlevania. Nintendo Entertainment Sys-

tem.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A.,

and Talwalkar, A. (2018). Hyperband: A novel

bandit-based approach to hyperparameter optimiza-

tion. arXiv:1603.06560 [cs, stat].

Lundh, F. (1999). tkinter.

Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman,

J. (2018). Gotta learn fast: A new benchmark for gen-

eralization in rl. arXiv preprint arXiv:1804.03720.

Nintendo (1985). Super mario bros. Nintendo Entertain-

ment System.

Papadeas, I., Tsochatzidis, L., Amanatiadis, A., and

Pratikakis, I. (2021). Real-time semantic image seg-

mentation with deep learning for autonomous driving:

A survey. Applied Sciences, 11(19):8802.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016).

Prioritized experience replay. arXiv:1511.05952 [cs].

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

516