Better the Phish You Know: Evaluating Personalization

in Anti-Phishing Learning Games

Rene Roepke

1,∗

, Vincent Drury

2,∗

, Ulrike Meyer

and Ulrik Schroeder

Learning Technologies Research Group, RWTH Aachen University, Ahornsstr. 55, 52074 Aachen, Germany

IT-Security Research Group, RWTH Aachen University, Mies-v.-d.-Rohe-Str. 15, 52074 Aachen, Germany

Keywords:

Anti-Phishing Education, Game-based Learning, Personalization, User Study, Gameplay Analysis.

Abstract:

Anti-phishing learning games present a motivating, interactive approach to user education and thus, various

games have been developed and studied in the past. A common trend among these games is a limited use

of game mechanics and no consideration of learners using methods of personalization. In this paper, we

compare an anti-phishing learning game with its personalized version in the scope of a longitudinal user study

with 89 participants. For personalization, the player’s familiarity with different services is used to provide

personalized content in the form of URLs in the game. To further understand the effects of personalization,

we analyze game log data and evaluate how players interact with personalized learning game content. While

the comparison of both game versions did not yield signiﬁcant differences in the participants’ performance in

URL tests, the in-game analysis conﬁrmed that players interact differently when confronted with URLs based

on services they are not familiar with compared to those they use or know. These differences when handling

unknown URLs in the in-game analysis might indicate, that personalization could be leveraged to improve

awareness and the knowledge transfer to the real world.

1 INTRODUCTION

A common threat to Internet users worldwide is

phishing, “a scalable act of deception whereby im-

personation is used to obtain information from a tar-

get” (Lastdrager, 2014). Current trend reports ob-

serve high numbers of newly created phishing web-

sites (APWG, 2021) as well as clicks on phishing

links (Kaspersky, 2021). While phishers employ a

diverse repertoire of attack vectors, including email,

instant messaging, and even voice phishing (Aler-

oud and Zhou, 2017), these trend reports indicate

that links to phishing websites still present an im-

minent threat to users. Teaching users to recognize

potentially malicious URLs, and therefore phishing

websites and malicious links, can help alleviate the

problem. Therefore, researchers have explored differ-

ent approaches to user education, ranging from tra-

ditional awareness campaigns to user training using

simulated phishing attacks or game-based learning.

While various anti-phishing learning games have

been proposed in the past, a common trend of existing

games seems to be the use of limited game mechanics

∗

These authors contributed equally.

and failing to consider the learner by means of person-

alization (Roepke et al., 2020a). With phishing being

an imminent threat, users will be presented with var-

ious phishing messages claiming to be from services

they know and those they do not know. Depending on

which case, users can apply different strategies to rec-

ognize phishing and protect themselves. As existing

anti-phishing do not yet consider the learners’ famil-

iarity with different services, they fail to reﬂect this

situation, which presents a research gap in the ﬁeld of

game-based anti-phishing education.

Considering the learners’ familiarity with services

in anti-phishing learning games can enable new ap-

proaches for elaborated feedback or adaptive game-

play to support the learning experience. Furthermore,

the use of more relevant services or a more realistic

decision strategy, which considers the learners’ famil-

iarity with a service, might have a positive impact on

their awareness in a real-world attack. To achieve this,

personalization needs to be implemented, e.g., using

the conceptual approach and framework for personal-

ization of anti-phishing learning games as presented

in (Roepke et al., 2021b). Consequently, these imple-

mentations need to be compared with traditional, non-

personalized games to better understand the advan-

458

Roepke, R., Drury, V., Meyer, U. and Schroeder, U.

Better the Phish You Know: Evaluating Personalization in Anti-Phishing Learning Games.

DOI: 10.5220/0011042100003182

In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU 2022) - Volume 2, pages 458-466

ISBN: 978-989-758-562-3; ISSN: 2184-5026

tages and beneﬁts of personalization. An exploratory

analysis of gameplay using detailed event log data

would allow even more insights into personalization

and its effects on players.

In this paper, we present a comparative user study

(N=89) of an existing learning game and its personal-

ized version in a pre-/post-test design. The used game

prototype was previously presented in (Roepke et al.,

2021a) and personalized using the personalization

framework presented in (Roepke et al., 2021b). Ad-

ditional longitudinal tests (N = 36) as well as an in-

game analyses of the participants’ gameplay (N = 49)

allow further insights into how personalization affects

the participants’ performance and behavior. While the

results of our comparison of the game’s personalized

and non-personalized version in the post-test are in-

conclusive, in that personalization did not outperform

the traditional version of the game, the analysis of

in-game behavior using game log data revealed dif-

ferences in players’ actions. As expected, the results

show that players classiﬁcation accuracy differs for

different levels of familiarity, i.e. players show difﬁ-

culties when classifying URLs of unknown services.

We therefore demonstrate, that there are deﬁnite ad-

vantages to using the personalized version, and pro-

pose several possible venues for future research.

2 RELATED WORK

This paper describes a comparative study of a per-

sonalized anti-phishing learning game with a non-

personalized version and explores the effects of game

content personalization. In prior studies on game-

based anti-phishing education, different approaches

have been evaluated and the effectiveness of games

for anti-phishing education has been shown for dif-

ferent user groups (Sheng et al., 2007; Canova et al.,

2015; Drury et al., 2022). However, existing games

have been criticized as their design may limit the po-

tential learning outcomes and does not consider the

learners and their familiarity with the learning con-

tent, i.e. personalizing presented URLs which have

to be classiﬁed as either malicious or benign within

the game (Roepke et al., 2020a). So far, personal-

ization of anti-phishing learning games has not been

explored or even implemented for possible evaluation

in user studies. There are, however, other types of

anti-phishing educational material that have explored

personalization or customization. In particular, re-

searchers have taken a look at spear phishing, a more

sophisticated type of phishing that is tailored towards

a recipient, and whether customized training can help

prevent it. In (Kumaraguru et al., 2008), the em-

bedded training against spear phishing was explored,

showing that customized content led to an advantage

when detecting spear phishing attacks compared to

regular educational material.

Beyond anti-phishing learning games, personal-

ization of games has been the subject of different

research projects (Law and Rust-Kickmeier, 2008;

Kickmeier-Rust and Albert, 2010). Here, adaptiv-

ity on a micro or macro level has been implemented

to provide personalized storytelling or dynamic dif-

ﬁculty adjustment, i.e. adaptive gameplay which

matches difﬁculty to players’ skill level. Personaliza-

tion through adaptivity focuses more on sequencing

and structuring of learning content and less on actual

adaptation of the content itself. As we did not ﬁnd

any projects using game content personalization, the

respective research area still has a potential to be ex-

plored. However, outside the educational domain, re-

search on game content generation (Dey and Konert,

2016) may provide interesting approaches.

Since neither implementation nor evaluation of

personalized anti-phishing learning games has been

done prior to this work and existing games fail to

consider individual learners (Roepke et al., 2020b),

we identify an untapped potential for the personaliza-

tion of anti-phishing learning games to provide a more

suitable game-based learning environment which sup-

ports learners in their different learning contexts. Fur-

thermore, a comparison to existing non-personalized

games could yield meaningful insights regarding the

effectiveness of personalized games. With recent

work introducing a concept (Roepke et al., 2020b) as

well as an implementation of personalization frame-

work for anti-phishing learning games (Roepke et al.,

2021b), the natural next step is to conduct a study

comparing personalized and non-personalized ver-

sions of a game. In addition, an exploratory analysis

of gameplay may provide insights into players’ be-

havior when dealing with different learning content.

3 STUDY SETUP

For the comparison of a personalized and a non-

personalized version of a learning game, we chose

a between-group design in a pre-/post-test setup in-

cluding an additional longitudinal test. The study was

performed in two batches: the ﬁrst group played the

non-personalized game in November 2020, and an-

other group of participants played the personalized

version in May 2021. While the games serve as the in-

dependent variables, the performance and conﬁdence

in pre-, post- and longitudinal tests serve as depen-

dent variables. This allows for a comparison of the

Better the Phish You Know: Evaluating Personalization in Anti-Phishing Learning Games

459

effect of personalization as well as the exploration

of in-game behavior in the personalized game. Ad-

ditionally, the results of the longitudinal study were

analyzed to gain insights into knowledge retention as

well as several self-reported characteristics of the par-

ticipants after playing either one of the games. Our

study was therefore designed to answer the following

research questions (RQs):

1. How does personalization affect the participants’

performance/conﬁdence in classifying URLs?

2. How do the participants’ performances change in

pre-, post- and longitudinal test?

3. How does personalization (i.e. familiarity with

services) affect in-game behavior?

3.1 Games and Personalization

For the main intervention in this study we used the

learning game prototype “All sorts of Phish” pre-

sented in (Roepke et al., 2021a). In the following, we

refer to it as the analysis game

. The analysis game

teaches the basics of the URL structure and differ-

ent manipulation techniques used to create malicious

URLs and deceive users in phishing attacks. The URL

structure is explained by introducing three main parts:

subdomain, registrable domain, and path. For each

part, different manipulation techniques are presented

to understand how phishers create malicious and de-

ceiving, but also valid URLs.

The game utilizes a sorting mechanic where play-

ers have to analyze and classify given URLs into dif-

ferent categories by sorting them into different buck-

ets (see Figure 1). Each bucket represents a spe-

ciﬁc URL category derived from applied manipula-

tion techniques for phishing URLs. The categories

are based on the URL structure and indicate, where

the original domain or deceptive keyword is present.

As such, the considered categories are: “IP”, “Ran-

dom”, “RegDomain”, “Subdomain”, and “Path”. Fur-

thermore, buckets for benign URLs (“No-Phish”) and

for discarding unknown URLs (“No idea”) are avail-

able. The more elaborate sorting mechanic extends

the state-of-the-art as most games rely on a binary de-

cision scheme in which players only classify URLs as

benign or malicious. The extended classiﬁcation al-

lows for more insights into the decision process and

can reveal players’ misconceptions (e.g., by analyzing

classiﬁcation outcomes for different URL categories;

see Section 5.2).

In our approach to extend current state of the art,

we adapted the analysis game by utilizing a person-

https://gitlab.com/learntech-rwth/erbse/analysis-gam

e, online, accessed 2022-02-18

Figure 1: Level of “All sorts of Phish”. Players have to

classify given URLs, which are hidden behind coins.

alization framework to provide personalized learning

game content (Roepke et al., 2021b). The new ver-

sion of the game is referred to as personalized game.

The framework ﬁrst provides a selection interface for

players to select services they either use, know but

do not use, or do not know at all from a set of ser-

vices (e.g. “PayPal”, “eBay”). The players’ selec-

tion is then used to compute a learner model, an ab-

stract representation of the learner’s characteristics

(Bull, 2004), in this case realized as information about

the players’ familiarity with different services. Next,

URLs for all URL categories are created using a URL

generator which applies different manipulation tech-

niques to base URLs of a given services (e.g. manipu-

lation of the registrable domain, subdomain or path).

The learner model is used as input to the generator

such that a set of URLs for different types of service

can be created (i.e. services that players use, know

but do not use, or do not know). Generated URLs are

then embedded into the game to provide a personal-

ized version of the game for individual players. The

game purposefully includes a number of services that

are less well known, which are included in the game

to understand how participants handle such unknown

services. The current version of the game presents

known and unknown URLs at a 4:1 ratio. Due to ran-

domness implemented in the rules used by the URL

generator, returning players will encounter different

URLs compared to previous gameplay sessions. Be-

yond personalization, the game fully supports event

logging of all in-game actions, timings and results.

By utilizing the personalization framework, we

are able to create a personalized version for each par-

ticipant of our study and thus, we are able to compare

the personalized and non-personalized versions of the

analysis game as well as explore in-game behavior of

players of the personalized game using event log data.

CSEDU 2022 - 14th International Conference on Computer Supported Education

460

3.2 Procedure

The study was conducted as a remote lab study us-

ing video conferencing software and a web browser

on participants’ devices. It was structured into ﬁve

phases: (1) For the introduction, participants were

briefed about the topic of the study and presented

with a deﬁnition of phishing. (2) Next, participants

were presented with the pre-test part of the survey.

(3) After ﬁnishing the pre-test, the survey software di-

rected participants to either the analysis game or the

personalized game. (4) After playing either one of

the games, participants returned to the survey for the

post-test. (5) When all participants ﬁnished the sur-

vey, a debrieﬁng informed the participants about the

overall goal of the study and answered open questions

before closing the session.

Participants were asked to start the survey and pro-

ceed at their own pace, as no further instructions were

necessary. In case of questions or if technical sup-

port was needed, participants were able to immedi-

ately contact the instructors and receive help without

disrupting other participants in continuing the study.

The participants were not told that different games

would be tested, nor did they know which group they

were assigned to.

For the longitudinal test, all participants were con-

tacted three months after the original study, and in-

vited to take part within a four-weeks time frame.

The longitudinal study did not require additional ex-

pert support and only contained a two-part survey, as

described in the next section.

3.3 Apparatus and Materials

The following questionnaires were used in the differ-

ent parts of the study:

• URL Test: A test consisting of 20 (pre) and 30

(post, longitudinal) URLs to be classiﬁed as ei-

ther benign or phishing URLs. It also includes

a question regarding the participants’ conﬁdence

in their decision for each URL using a 6-point

Likert scale. The test was included to measure

the overall effect of the interventions, including

the comparison of URLs of familiar and unknown

services. For the post-test as well as longitudi-

nal testing, ten additional URLs were provided

to check for potential learning bias. A list of all

URLs used in the longitudinal test can be found

in Table 1, while the URLs used in pre- and post-

test can be found in (Drury et al., 2022).

• Recognition of Services (post/longitudinal): A

questionnaire listing all services that were used

to create URLs of the URL tests for participants

Table 1: URLs of URL Test in longitudinal test; for URLs

used also in pre- and post-test, see (Drury et al., 2022).

URL Category

https://www.facebook.com/login/device-based/re... Benign

https://www.dropbox.com/login Benign

https://www.twitch.tv/ Benign

https://www.45m64or.ru/NZYJolaEiBSOSOC... Random

https://mobile-support.de/en/auth/login?client id=0... RegDomain

https://meine.deutssche-bank.de/?client id=HyB... RegDomain

https://www.fodus.de/ajax/login/ RegDomain

https://idealo.de%76%73%6C%38%6A%6D%31... RegDomain

https://www.commerzbank.de-account.support/... Subdomain

https://login.live.com.id.online/de/login.exe?to=%... Subdomain

Table 2: Behavioral Change Questionnaire.

Item

App1 I have been using the things I learned in the game during

the past months.

App2 Since playing the learning game, I have been checking

the URLs of websites before I click on them.

App3 Since playing the learning game, I have been checking

the URLs of websites before I enter personal data (e.g.,

account credentials).

Int1 Playing the learning game has raised my interest in phish-

ing or other IT security topics.

Int2 I would like to learn more about phishing or other IT se-

curity topics by playing learning games.

BC1 Since playing the learning game, I have become more

aware of phishing attacks.

BC2 After playing the learning game, I adapted my behavior

in dealing with URLs.

PT1 After playing the learning game, I feel like I can protect

myself against phishing attacks.

PT2 After playing the learning game, I feel less likely to fall

for phishing attacks.

to select for each service whether they (a) use it,

(b) do not use it, but know it, or (c) whether it is

unknown to them. This test was included to be

able to analyze the effect of familiarity with a ser-

vice on classiﬁcation performance and conﬁdence

in the URL tests.

• Demographics: Questionnaire which is used to

collect demographic data, including age, gender

and educational background.

• Behavioral Change Questionnaire: Consists of

nine items about participants’ behavior towards

phishing after participating in the pre-/post-test

part of the study (see Table 2). Dividing the

items into four categories (with Cronbach’s α re-

liability) provides insights into self-reported ap-

plication of knowledge (App, α = .861), inter-

est in learning more about security using games

(Int, α = .629), behavioral change (BC, α = .830)

and the perception of phishing as a threat (PT,

α = .782). The items use a 6-point Likert scale

(1 = “strongly disagree” to 6 = “strongly agree”).

The URLs used in the pre-, post- and longitudinal

Better the Phish You Know: Evaluating Personalization in Anti-Phishing Learning Games

461

tests were generated by collecting benign login URLs

from popular websites in our country of origin (ac-

cording to Alexa

and Tranco

). Then, different ma-

nipulation techniques were applied to these benign

differentiate these manipulation techniques by which

part of the URL contains the original target domain

or a deceptive keyword: a subdomain, the registra-

ble domain, the path, or none (random URLs). We

further differentiate URLs that contain an IP address

as host from other URLs with a deceptive part in the

path. In all, 13 phishing and 7 benign URLs were

created for the pre-test, with 7 phishing and 3 be-

nign URLs added in the post- and longitudinal tests

respectively to control for learning bias of the pre-test

URLs. While all participants were shown the same

URLs as part of the URL test, the order was random-

ized to avoid learning bias between the URLs.

3.4 Participants

The study was conducted with 89 participants (N

40, N

= 49), which were recruited online by post-

ing information about the study in different social net-

work groups of universities and distributing it via uni-

versity mailing lists. Recruitment advertised the study

for people with a general interest in playfully learning

about IT security, regular online activities and little

to no prior knowledge in IT security and Computer

Science. Due to the duration of the study, a ﬁnan-

cial incentive of 15 EUR was offered to each partici-

pant. For participants of the longitudinal testing three

months later, a lottery of 4 × 10 EUR was offered.

Both recruiting and ﬁnancial incentives may have in-

troduced a potential selection bias.

Among the participants, 55.06% identiﬁed as fe-

male and 44.94% as male. Most participants were

between 20 and 29 years of age (76.40%), followed

by participants aged 30 or more (16.85%). The anal-

ysis of the participants’ level of education revealed

that most participants were students with either Bach-

elor’s degree or high school diploma (82.02%). Other

participants reported to have completed a Master’s de-

gree (15.73%), or vocational training (2.25%).

For the longitudinal test three months after the

ﬁrst part of our study, we experienced a dropout of

59.55%, leading to a response rate of only 36 partici-

pants (N

= 17, N

= 19). This limits the evaluation

of longitudinal effects and calls for reproduction with

a larger participant sample.

https://www.alexa.com/topsites/countries online,

accessed 2022-02-18

https://tranco-list.eu/ online, accessed 2022-02-18

4 RESULTS

In this section, we attempt to answer the RQs deﬁned

in Section 3 using a series of analyses and statisti-

cal tests. We ﬁrst present results of the pre-, post-

and longitudinal tests, before analyzing in-game data

of the personalized game. For each test, we consider

two groups depending on which game the participants

played: the analysis game group and the personalized

game group. Note, that longitudinal tests were evalu-

ated only on the reduced set of participants who com-

pleted the additional survey.

Performance scores are calculated as the number

of correctly classiﬁed URLs divided by the total num-

ber of URLs was used. Similarly, the conﬁdence

levels were computed as the mean conﬁdence of all

URLs. Depending on the hypotheses used to answer

our research question, one-tailed t-tests or ANOVA

were conducted with a signiﬁcance level α = .05.

Parametric Student’s or Welch’s t-tests were used if

no deviation from normality was detected in prelimi-

nary data screening. Otherwise, non-parametric test-

ing was performed, e.g., Wilcoxon signed-rank test.

Effect sizes are provided using either Cohen’s d, rank-

biserial correlation coefﬁcient r, or partial η

, de-

pending on the computed statistical test.

4.1 Survey Results

Before evaluating our research questions in detail, we

check for a potential learning bias on URLs that were

present in the pre-test (see Table 3). We therefore

compare M

post-pre

to M

post-new

, as well as M

long-pre

long-new

, by performing one-tailed Student’s t-tests

with the hypothesis that means for URLs that were

also used in the pre-test are higher than the new URLs

in the post- and longitudinal test. As neither of the

two tests is signiﬁcant (p > .725), and means are in

fact higher for new URLs in most cases, we argue that

learning bias is negligible for our sample.

Next, we analyze the overall effectiveness of the

games. Both games were generally effective, in that

a one-tailed comparison of pre- and post-test scores

(using only URLs that were also present in the pre-

test) gives signiﬁcant results for improvements: Stu-

dent’s t-test for the analysis game with t

(39) =

6.404, p

< .001, d

= 1.013 and Wilcoxon signed-

rank test for the personalized game with W

(48) =

775, p

< .001, r

= .717 (as a deviation from nor-

mality was detected; Shapiro-Wilk, p = .033).

In response to RQ-1, we begin by comparing the

post-test results on all 30 post-test URLs of players

of the two games, i.e. the analysis game (N

= 40)

and the personalized game (N

= 49). Taking a look

CSEDU 2022 - 14th International Conference on Computer Supported Education

462

Table 3: Means (M) and standard deviations (SD) for performance and conﬁdence in pre- and post-test including means on

partial URL sets for new URLs in post-test (post-new) as well as base URLs used in pre- and post-test (post-pre).

Game N Performance (relative score) Conﬁdence (range: 1-6)

pre

(SD) M

post-pre

(SD) M

post

(SD) M

post-new

(SD) M

pre

(SD) M

post-pre

(SD) M

post

(SD) M

post-new

(SD)

Analysis 40 .695 (.098) .828 (.115) .840 (.095) .853 (.140) 4.065 (.637) 5.034 (.468) 5.086 (.461) 5.065 (.764)

Personalized 49 .726 (.114) .811 (.110) .823 (.104) .855 (.123) 4.114 (.747) 4.948 (.655) 5.016 (.658) 5.259 (.478)

Table 4: Performance in longitudinal test (long), pre- and post-test scores (pre and post-pre) as well as means of partial URL

sets for new URLs in longitudinal test (long-new) and base URLs used in pre- and longitudinal test (long-pre).

Game N M

pre

(SD) M

post-pre

(SD) M

long-pre

(SD) M

long-new

(SD) M

long

(SD)

Analysis 17 .679 (.095) .865 (.077) .812 (.070) .782 (.119) .802 (.061)

Personalized 19 .679 (.121) .800 (.118) .776 (.112) .826 (.115) .793 (.103)

at the mean test results (see Table 3) reveals that per-

sonalization did not lead to increased performances

or conﬁdences. Even though the analysis game group

performed better on average, we did not ﬁnd this dif-

ference to be signiﬁcant using a two-tailed Welch’s

t-test (t(85.891) = .797, p = .428, d = .169, with no

deviation from normality: Shapiro-Wilk, p > .035).

Similar results could be observed for conﬁdence lev-

els: Here, the Shapiro-Wilk test was signiﬁcant (p <

.001), a Mann-Whitney test returns no signiﬁcant re-

sults (U(85.157) = 995.5, p = .901, r = .016).

As it might be possible, that the personalization

had an effect on the classiﬁcation results of different

levels of familiarity in the tests, we next perform a

repeated-measures ANOVA comparing the three lev-

els of familiarity, with the games as between-groups

factor. Note, that N

= 34, N

= 39 in this test, as

some participants did not select any services as un-

known, known or used. Mauchly’s test for sphericity

is signiﬁcant (p < .001), and Greenhouse-Geisser cor-

rections are applied (ε = .728). Here, we do not ob-

serve signiﬁcant differences between the two games

either: F(1, 71) = .084, p = .772, η

= .001. We do,

however, ﬁnd signiﬁcant differences between the lev-

els of familiarity: F(1.455, 103.308) = 10.204, p <

.001, η

= .126. Post-hoc tests (Holm) conﬁrm, that

URLs of unknown services are classiﬁed signiﬁcantly

less accurately than known and used in both games

(p <= .001 in both cases), with no signiﬁcant differ-

ences between known and used (p = .525). In all, our

study setup did not yield any signiﬁcant differences of

performance scores and conﬁdence levels between the

personalized game group and analysis game group.

For RQ-2, we are interested in the long-term ef-

fect of the two versions of the learning game. Due

to a low response rate for longitudinal testing, partici-

pant samples are smaller for both groups (N

= 17, N

= 19). Data exploration seems to indicate a decline in

performance between post- and longitudinal test, with

the pre-test score remaining the lowest (see Table 4).

To test for signiﬁcance of the mean differences,

Table 5: Behavioral Change Questionnaire results with

item group reliabilities (Cronbach’s alpha).

Game M

App

(SD) M

Int

(SD) M

(SD)

Analysis 3.509 (1.285) 4.059 (0.966) 3.853 (1.412) 4.176 (0.557)

Personal. 4.071 (1.275) 4.684 (1.121) 4.105 (1.174) 4.368 (1.141)

we perform a repeated-measures ANOVA, using the

three tests (pre, post, longitudinal) as repeated mea-

sures and the games as between-subject factors.

Mauchly’s test for sphericity is not signiﬁcant, and

the ANOVA (F(2, 68) = 28.432, p < .001, η

= .455)

conﬁrms, that there are signiﬁcant differences. Post-

hoc tests (Holm) show, that pre-test performance is

signiﬁcantly lower than both post- and longitudinal-

test performances (p < .001 in both cases), while the

differences between post- and longitudinal tests are

not signiﬁcant (p = .074).

Finally, we take an exploratory look at the results

of the self-reported behavioral changes questionnaire

of the longitudinal test (see Table 5). As explained in

Section 3.3 we split the items of the behavioral change

questionnaire into four constructs: whether lessons

from the game were applied after playing (Appli-

cation), how interested participants are in security-

related learning games (Interest), whether participants

changed their everyday behavior after playing the

games (Behavior Change), and to what extend the

participants perceive phishing as a threat (Perceived

Threat). As expected of a self-reported measure,

where we expect a certain amount of bias, the overall

results are rather positive (see Table 5). Comparing

the mean values, we can observe minor differences

between the two groups in all constructs. In partic-

ular, the means of the personalized game group are

higher in all four constructs. As for differences be-

tween the four constructs, it seems that participants

were less likely to have applied the learned knowledge

and changed their behavior, as the mean scores are

lower than the results for “Interest” and “Perceived

Threat”. Due to the small sample size, we refrain

from further statistical testing, but the observed dif-

ference calls for more thorough testing in the future.

Better the Phish You Know: Evaluating Personalization in Anti-Phishing Learning Games

463

4.2 In-game Results

To answer RQ-3, we perform an exploratory analysis

of the game log data of the personalized game. The

personalized game gives more insight into the play-

ers’ interactions with different services during game-

play, as this information is not available for the origi-

nal analysis game. Python scripts were used to parse

the in-game log data and extract different event se-

quences, including timing information as well as the

outcomes of classiﬁcation events. In the following,

mean values are ﬁrst computed per player and then

analyzed, e.g., as the average of all players.

We start by taking a look at the sorting outcomes

and time needed for the classiﬁcation of URLs of

different levels of familiarity (see Table 6). We ob-

serve notable differences in relative classiﬁcation out-

comes, with URLs of unknown services being classi-

ﬁed with the least accuracy with a mean difference of

.068 to known and .083 to used services.

Next, we assess the differences in correct classi-

ﬁcation outcomes per familiarity level per URL cat-

egory to gain a better understanding of which cate-

gories contribute to this difference. As there is a large

number of comparisons for all possible levels of fa-

miliarity and categories, we focus on percentages of

misclassiﬁcations (phishing URLs as benign, or be-

nign as phishing URLs), per familiarity level per URL

category present in the game (see Table 7). The table

also includes the number of valid (and missing) values

per category per familiarity, as some players did not

classify any URLs of e.g. Path URLs of unknown ser-

vices. There are only minor differences between the

familiarity levels for the URL categories “Path”, “IP”,

and “Random” (mean differences <= .02), which

were generally detected very well. URLs of the cate-

gories “RegDomain” (mean differences <= .137) and

“No-Phish” (mean differences <= .044) have notable

differences, with the highest rates of mistakes for un-

known services. The classiﬁcation accuracy for URLs

of the “Subdomain” category, interestingly, is high-

est for unknown services (mean differences <= .017).

Note, however, that the large number of possible fa-

miliarity and category combinations leads to a higher

probability of these differences happening by chance.

In all, the detailed analysis of the personalized

game seems to indicate, that URLs of unknown ser-

vices are classiﬁed with less accuracy than URLs with

Table 6: In-game means and standard deviations.

Familiarity Correct Incorrect Unclassiﬁed Time (sec)

Used .680 (.170) .186 (.108) .133 (.117) 4.13 (1.39)

Known .665 (.180) .192 (.142) .143 (.115) 4.11 (1.69)

Unknown .597 (.221) .250 (.187) .154 (.187) 4.27 (1.62)

Table 7: Mean of misclassiﬁcations per type per familiarity.

Category Familiarity N (Missing) Mean

IP unknown 44(5) .011

known 48(1) .006

used 48(1) .019

No-Phish unknown 46(3) .221

known 47(2) .178

used 49(0) .177

Path unknown 27(22) .000

known 24(25) .000

used 35(14) .000

Random unknown 47(2) .022

known 49(0) .008

used 49(0) .002

RegDomain unknown 40(9) .246

known 44(5) .109

used 44(5) .153

Subdomain unknown 40(9) .096

known 40(9) .113

used 39(10) .109

services of the other familiarity levels, i.e. used or

known, which can mainly be attributed to the URL

categories “RegDomain” and “No-Phish”.

5 DISCUSSION

In the previous section, the results of our user study

and in-game analysis were described in response to

the RQs presented in Section 3. While there were

no signiﬁcant differences in the participants’ perfor-

mance and conﬁdence between the two games (RQ-

1), we found signiﬁcant differences between the lev-

els of familiarity with services. In particular, URLs

of unknown services were classiﬁed signiﬁcantly less

accurately than those of known and used services.

For RQ-2, longitudinal testing revealed an overall

improvement of the participants’ performance, since

performance means of both post and longitudinal tests

are signiﬁcantly higher than the participants’ pre-test

performance. Differences based on levels of familiar-

ity were also conﬁrmed in the in-game log analysis in

RQ-3. In the following, we discuss issues and open

questions regarding the overall setup and results of

our user study and analysis of in-game behavior.

5.1 Study Setup

Our study setup uses a pre-/post and longitudinal

between-group design comparing two versions of

the anti-phishing learning game “All sorts of Phish”

(Roepke et al., 2021a). Participation for the pre-

and post-test was independent from the longitudinal

test, which led to high dropout rate of 59.55% and

only 36 participants (compared to 89 participants at

ﬁrst). The question arises whether only already in-

terested participants agreed to take part in the longi-

tudinal test, which introduces additional bias, in par-

CSEDU 2022 - 14th International Conference on Computer Supported Education

464

ticular to the results of the behavioral change ques-

tionnaire. As such, results cannot be generalized and

we recommend repeating the study with larger sam-

ple to strengthen the evidence base. Furthermore, we

would be interested in evaluating with even longer

time spans to see how the participants’ performance

changes and whether regular repetitions might be

needed in order for the knowledge to remain present.

As the selected game only focuses on teaching es-

sential knowledge about URLs and possible manip-

ulation techniques used for phishing, a limitation of

the game as well as the complete study is that we can

not make any assumptions on the participants’ overall

awareness and real-world performance in regards to

phishing attacks. Here, we do not claim that the game

or its personalized version raise situational awareness

and help avoiding phishing attacks in real-world set-

tings. For this, we would recommend additional ed-

ucational resources to teach how and when phish-

ers lure potential victims into disclosing personal in-

formation or redesigning the game to include neces-

sary information and approaches to raise awareness.

Whether personalization has an effect on awareness

might be an interesting question for future work.

5.2 Study Results

As described in Section 4, we found that while there

are no differences between the personalized and non-

personalized versions of the games, familiarity with

a service did have an effect on the classiﬁcation out-

come in our study (RQ-1). We note, that the results

of our comparison do not mean that personalization

does not have an effect at all, as the URLs that appear

in the analysis game were customized and selected to

have a high chance of being known by participants.

As such, the only difference between the two ver-

sions that we can be sure of is the inclusion of the ser-

vice familiarity selection interface in the personalized

game. In particular, it is possible that ﬁxing the ratio

of unknown services in the game to different values

(currently 20%), or integrating explicit instructions to

deal with URLs of unknown services might have an

impact on the learning outcome or awareness.

When analyzing the longitudinal test, we found

that while the mean performance scores decreased

compared to the post-test immediately after playing

the game, the scores were still higher than the pre-test

(RQ-2). Even though the sample size was small, we

found signiﬁcant differences between pre- and longi-

tudinal tests, which implies that the knowledge con-

veyed in the games was retained, at least partly, by

the participants of the longitudinal test. In the self-

reported behavioral change questionnaire, we found

that players of the personalized game had higher

mean values than players of the non-personalized

analysis game. We note, however, that these results

rely on self-reported data from a custom question-

naire, designed to be used in this study setup. Thus,

our ﬁndings should only be seen as a ﬁrst indicator

that there might be differences when including per-

sonalization in the games, but is far from conclusive

evidence. It is possible that personalization makes

the game more appealing and its learning content

more transferable to the real-world contexts in which

users have to deal with potential phishing attacks

from services they know and use. Future work might

explore how simply making personalization options

more present might already lead to a more immer-

sive or relevant gaming experience. In addition, we

suggest evaluating the used questionnaire on a larger

sample size and with domain experts to strengthen its

quality and suitability for future studies.

For RQ-3, the analysis of in-game data of the per-

sonalized game showed, that there are some URL cat-

egories with a larger difference in accuracy when clas-

sifying URLs of unknown services. Though we ar-

gue that it makes sense that the “RegDomain” and

“No-Phish” categories have a high impact, as these

URLs can be ambiguous if the original domain is un-

known, we also note the interesting ﬁnding that the

classiﬁcation of URLs in the “subdomain” category

was performed with a higher accuracy for URLs of

unknown services. As the difference for the “subdo-

main” category is small compared to the other cate-

gories, it is, however, also possible that the difference

is due to chance. A general problem with the analysis

of in-game data is, that players might have different

strategies when playing the game, e.g., ﬁrst opening a

large number of coins and only classifying the easiest

ones. These strategies might have affected the anal-

ysis outcomes, in particular some differences might

have been inﬂated by a small number of players.

In all, we found that service familiarity has sev-

eral effects on the participants’ classiﬁcation abilities.

While our study setup and the current version of the

games did not exhaust more methods for personaliza-

tion, we argue that content personalization, which has

not been explored in much detail in other domains ei-

ther, is a worthwhile pursuit. Future work opportu-

nities include the redesign of the personalized game

to support adaptive gameplay in which players’ ac-

tions guide the continuation of the game, as well as

the inclusion of contextual information in the games

and researching the effect on situational awareness,

in particular in a personalized game that closely re-

ﬂects the players’ real-world environments. Further

future work lies in the reproduction of our results

Better the Phish You Know: Evaluating Personalization in Anti-Phishing Learning Games

465

with larger participant samples and possibly lower

dropout rates in longitudinal testing to strengthen the

evidence when answering questions regarding long-

term effects.

6 CONCLUSION

In this paper, we present the results of a compar-

ative user study of an anti-phishing learning game

and its personalized version as well as an analysis

of in-game behavior to understand how personaliza-

tion inﬂuences the participants’ gameplay and perfor-

mance. We ﬁnd, that users interact differently when

confronted with URLs based on services they are not

familiar with, both during gameplay and in the URL

tests of our user study. While we did not ﬁnd signif-

icant differences in the classiﬁcation performance of

participants of the personalized and non-personalized

versions of the game, we ﬁnd some indications that

personalization might potentially have positive effects

on the players’ awareness. Our work therefore moti-

vates further analyses of learning games with person-

alized content and how it affects players during and

after playing the game. Furthermore, we performed

longitudinal testing three months after the game was

played and ﬁnd, that while the participants’ perfor-

mance seems to drop compared to the post-test, it

is still signiﬁcantly higher than the pre-test. These

results indicate, that general knowledge about the

URL structure and possible manipulation techniques

can help users detect malicious URLs even several

months after the intervention.

ACKNOWLEDGEMENTS

This research was supported by the research train-

ing group “Human Centered Systems Security” spon-

sored by the state of North Rhine-Westphalia.

REFERENCES

Aleroud, A. and Zhou, L. (2017). Phishing environments,

techniques, and countermeasures: A survey. Comput-

ers & Security, 68:160–196.

APWG (2021). APWG Phishing Activity Trends Report,

3rd Quarter 2021. Technical report, Anti-Phishing

Working Group.

Bull, S. (2004). Supporting learning with open learner mod-

els. Planning, 29(14):1.

Canova, G., Volkamer, M., Bergmann, C., and Reinheimer,

B. (2015). NoPhish app evaluation: Lab and retention

study. In NDSS Workshop on Usable Security 2015,

USEC ’15, San Diego, California. Internet Society.

Dey, R. and Konert, J. (2016). Content Generation for Se-

rious Games. In D

orner, R., G

obel, S., Kickmeier-

Rust, M., Masuch, M., and Zweig, K., editors, En-

tertainment Computing and Serious Games: Interna-

tional GI-Dagstuhl Seminar 15283, Revised Selected

Papers, pages 174–188. Springer, Cham.

Drury, V., Roepke, R., Schroeder, U., and Meyer, U. (2022).

Analyzing and Creating Malicious URLs: A Compar-

ative Study on Anti-Phishing Learning Games. In Us-

able Security and Privacy Symposium 2022, USEC

’22, pages 1–13, San Diego, USA. IEEE. [in publi-

cation].

Kaspersky (2021). Spam and phishing in Q3 2021. Techni-

cal report, Kaspersky.

Kickmeier-Rust, M. D. and Albert, D. (2010). Micro-

adaptivity: Protecting immersion in didactically adap-

tive digital educational games. Journal of Computer

Assisted Learning, 26(2):95–105.

Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L. F., and

Hong, J. (2008). Lessons from a real world evaluation

of anti-phishing training. In 2008 eCrime Researchers

Summit, pages 1–12.

Lastdrager, E. E. (2014). Achieving a consensual deﬁni-

tion of phishing based on a systematic review of the

literature. Crime Science, 3(1):1–10.

Law, E. L.-C. and Rust-Kickmeier, M. (2008). 80Days:

Immersive Digital Educational Games with Adaptive

Storytelling. In Proceedings of the 1st International

Workshop on Story-Telling and Educational Games,

STEG ’08, pages 56–62, Maastricht, Netherlands.

CEUR.

Roepke, R., Drury, V., Meyer, U., and Schroeder, U.

(2021a). Exploring Different Game Mechanics for

Anti-Phishing Learning Games. In Games and Learn-

ing Alliance, GaLA ’21, Cham. Springer.

Roepke, R., Drury, V., Schroeder, U., and Meyer, U.

(2021b). A Modular Architecture for Personalized

Learning Content in Anti-Phishing Learning Games.

In Software Engineering 2021 Satellite Events, SE-SE

’21, Braunschweig, Germany. CEUR.

Roepke, R., Koehler, K., Drury, V., Schroeder, U., Wolf,

M. R., and Meyer, U. (2020a). A Pond Full of Phish-

ing Games - Analysis of Learning Games for Anti-

Phishing Education. In Hatzivasilis, G. and Ioanni-

dis, S., editors, Model-Driven Simulation and Train-

ing Environments for Cybersecurity, Lecture Notes in

Computer Science, pages 41–60, Cham. Springer.

Roepke, R., Schroeder, U., Drury, V., and Meyer, U.

(2020b). Towards Personalized Game-Based Learn-

ing in Anti-Phishing Education. In 20th Interna-

tional Conference on Advanced Learning Technolo-

gies, ICALT ’20, pages 65–66, Tartu, Estonia. IEEE.

Sheng, S., Magnien, B., Kumaraguru, P., Acquisti, A., Cra-

nor, L. F., Hong, J., and Nunge, E. (2007). Anti-

Phishing Phil: The Design and Evaluation of a Game

That Teaches People Not to Fall for Phish. In Pro-

ceedings of the 3rd Symposium on Usable Privacy and

Security, SOUPS ’07, pages 88–99, New York, USA.

ACM.

CSEDU 2022 - 14th International Conference on Computer Supported Education

466