Temporal Constraints in Online Dating Fraud Classiﬁcation

Harrison Bullock and Matthew Edwards

Department of Computer Science, University of Bristol, Bristol, U.K.

Keywords:

Concept Drift, Time Decay, Fraud, Online Dating, Uncertainty Sampling.

Abstract:

A number of automated systems attempt to combat online fraud through the application of classiﬁers created

using machine learning techniques. However, online fraud is a moving target, and cybercriminals alter their

strategies over time, causing a gradual decay in the effectiveness of classiﬁers designed to detect them. In

this paper, we demonstrate the existence of this concept drift in an online dating fraud classiﬁcation problem.

Working with a dataset of real and fraudulent dating site proﬁles spread over 6 years, we measure the extent to

which dating fraud classiﬁcation performance may be expected to decay, ﬁnding substantial decay in classiﬁer

F1 over time, amounting to a decrease of more than 0.2 F1 by the end of our evaluation period. We also

evaluate strategies for keeping fraud classiﬁcation performance robust over time, suggesting mitigations that

may be deployed in practice.

1 INTRODUCTION

Concept drift is a problem in machine learning clas-

siﬁcation, in which classiﬁers become less accu-

rate over time as the underlying data’s distribution

changes. Any subsequent fall in performance is

known as time decay. Concept drift often goes unex-

amined in classiﬁer design due to temporal and spa-

tial biases in classiﬁer evaluations. Temporal bias ex-

ists when a dataset is temporally inconsistent, which

means that the training and test sets are not chrono-

logically ordered, while spatial bias occurs when a

dataset is unrealistically balanced relative to the real

occurrence rates (Pendlebury et al., 2019).

In this paper, we investigate the presence and ef-

fect of concept drift for a novel application, namely,

an online dating fraud classiﬁcation task. Online dat-

ing fraud, also referred to as romance scamming, is

a form of fraud in which a criminal entices a target

into an online romantic relationship using a false pro-

ﬁle, and then uses this relationship to extract money

from the target. In the US, the Federal Trade Commis-

sion reported losses of $537m in 2021 from romance

scams, up 80% from 2020 (FTC, 2022), a rapid in-

crease highlighting the urgent need for work tackling

this crime. Recent works have attempted to combat

this problem through a variety of classiﬁcation ap-

proaches (Suarez-Tangil et al., 2020; Al-Rousan et al.,

2020; He et al., 2021; Shen et al., 2022). However, no

https://orcid.org/0000-0001-8099-0646

previous work on dating fraud classiﬁcation has ad-

dressed the possibility of concept drift in deployment,

the speed or scale with which it may occur, or how it

may be mitigated.

In this work, we correct this gap and characterise

the temporal constraints relevant to dating fraud clas-

siﬁcation using a large dataset of over 100,000 dat-

ing proﬁles, including over 6,000 scam proﬁles, in an

evaluation window spread over 6 years of data. Using

the TESSERACT Python library (Pendlebury et al.,

2019), an example classiﬁer for these scams is ﬁrst

built and subsequently evaluated for concept drift un-

der constraints that correct for temporal and spatial

bias. Following that, two plausible mitigation strate-

gies – classiﬁcation with rejection and uncertainty

sampling are then evaluated as solutions to make clas-

siﬁcation models more robust to concept drift and re-

duce the time decay in classiﬁer performance.

2 BACKGROUND

2.1 Concept Drift

Concept drift, also known as concept shift or dataset

shift, occurs when the relationship between the in-

put and target variables changes between the training

dataset for a model and its deployment scenario. For

example, certain features of an Android application

may be reliably associated with a label of ‘malware’

Bullock, H. and Edwards, M.

Temporal Constraints in Online Dating Fraud Classiﬁcation.

DOI: 10.5220/0011689000003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 535-542

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

535

in a training dataset, but when running in the wild, a

model trained on such data may perform poorly. This

may occur because these features are no longer partic-

ularly associated with malware, as malware authors

have moved on from the techniques that produced

such a pattern. In other words, the concept of what

a safe Android application is has changed since the

model was trained. Concept drift has been described

as “the great elephant in the room for machine learn-

ing” (Webb et al., 2017), as it can considerably affect

the accuracy and reliability of applied machine learn-

ing models when deployed, meaning reported perfor-

mance ﬁgures from research results may be less trust-

worthy than expected.

Pendlebury et al. have created an open-source

evaluation tool for concept drift called TESSER-

ACT (Pendlebury et al., 2019), which includes a

Python library. Their paper examined the presence

of temporal and spatial bias, looking at three different

classiﬁers for Android malware detection (a support

vector classiﬁer, a random forest, and a neural net-

work). These classiﬁers and their earlier published

work were believed to be temporally and spatially

biased. Pendlebury et al. deﬁne three constraints

that must be enforced for more realistic evaluations

(referred to as space-time aware evaluation), which

are (Pendlebury et al., 2019):

C1. Temporal Training Consistency, under which in-

stances in the training dataset must temporally

precede (i.e. chronologically come before) the in-

stances in the testing dataset.

C2. Temporal Testing Windows Consistency, under

which all instances in a testing window must be

from the same time slot. TESSERACT splits the

testing set into slots of ﬁxed size. The exam-

ple provided is that a testing dataset of two years

could be split into slots of one month. This con-

straint states that each testing window should be

consistent, with all instances from the same time

slot. The user chooses the interval, but it should

contain a substantial number of instances in each

testing window. Pendlebury et al. suggest at least

1,000 instances in each window.

C3. Realistic Label Ratio in Testing, under which the

average percentage of class labels in each cate-

gory in a testing dataset should be close to the es-

timated distribution that would be seen in the real

world.

This literature also deﬁnes a time-aware performance

metric, Area Under Time (AUT), and an algorithm

that optimises classiﬁer performance by adjusting the

class ratio of the training dataset. AUT is calcu-

lated as the area under a curve of point estimates of

performance scores (such as F1 scores) over time,

where each point estimate is for a different testing

slot (Pendlebury et al., 2019). We adopt AUT as

our primary evaluation metric in the experiments de-

scribed later in the paper.

Two techniques that may be applied to mitigate

the effects of concept drift are Classiﬁcation with re-

jection and Uncertainty sampling. Classiﬁcation with

rejection is a mitigation in which lower conﬁdence

predictions are rejected (Bartlett and Wegkamp, 2008;

Barbero et al., 2020). Observations with a conditional

probability close to 50% (when a binary classiﬁcation

problem) are the most challenging instances to clas-

sify. Therefore, a reject option can be used to express

doubt over these more uncertain examples (Bartlett

and Wegkamp, 2008). When these examples are re-

jected, they can be quarantined and manually classi-

ﬁed.

Uncertainty sampling is a technique under which,

rather than refusing to label, class labels are requested

for uncertain instances. These instances are found by

using the prediction probabilities of an existing model

and are then used for retraining the classiﬁer (Ku-

bat, 2017). The technique was originally proposed

as a methodology for situations where large quanti-

ties of labelled data are difﬁcult to obtain. However,

it can also be used to mitigate the effects of concept

drift (Pendlebury et al., 2019).

2.2 Online Dating Fraud

Online dating is becoming more popular, and this in-

creased popularity has become an attractant for crime.

Online dating fraud started to attract research inter-

est in the 2010s (Rege, 2009; Whitty and Buchanan,

2012), but Huang et al. (2015) were the ﬁrst to quan-

titatively study how romance scammers operated on-

line, using data from an undisclosed Chinese dating

site between 2012 and 2013. They found there were

four types of scammers, including a category they re-

ferred to as Swindlers, who establish a long-distance

relationship online, and after a certain amount of time,

request money from the victim. This form of romance

fraud is the one that most resembles that described by

Whitty & Buchanan.

Edwards et al. (2018) discussed indicators of dat-

ing fraud proﬁles such as reused proﬁle elements and

common geographic origins, but it was Suarez-Tangil

et al. (2020) who ﬁrst described an ensemble classi-

ﬁer for automatically detecting proﬁles likely to be

romance scammers, using only passively-accessible

static proﬁle elements. Since then, a variety of ap-

proaches have been attempted. Al-Rousan et al.

(2020) focused on the detection of celebrity images

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

536

used in some scam proﬁles, describing a system us-

ing reverse image search mechanisms to reveal such

impersonation. He et al. (2021) described DatingSec,

which built upon the approach of Suarez-Tangil et

al. by additionally examining dynamic behaviour and

textual messaging features within data from the Chi-

nese dating app Momo, with promising results. Most

recently, Shen et al. (2022) have proposed a detection

approach grounded in a user trust model, which also

integrates both static and dynamic features to identify

the accounts used in online dating fraud.

3 BASELINE CLASSIFIER

3.1 Dataset

The data and methods used to extract and process the

dating fraud datasets were heavily inﬂuenced by pre-

vious work by Suarez-Tangil et al. (2020). Our study

aims to extend previous work to classify this fraud-

ulent activity to evaluate and mitigate the effects of

concept drift. It is not intended to be a heavy re-

working of the feature selection or classiﬁer models

in these domains. It is also not intended to be a criti-

cism of previous work. The quality of previous work

has provided an in-depth understanding of classiﬁca-

tion in these domains and has allowed particular pro-

cesses to be replicated.

The data was scraped from the websites https:

//datingnmore.com and https://scamdigger.com using

a slight modiﬁcation of the method used by Suarez-

Tangil et al. (2020). 96,960 real dating proﬁles and

6,074 scam proﬁles were scraped and stored in JSON

format. To reduce the costs of training and evaluation,

only the demographic data from proﬁles were used in

these experiments. The data was cleaned following

the same process as described by the original authors,

with slight modiﬁcations for ﬁelds that have altered

format in the online data source.

There were rows of data that either did not have a

username or were duplicates. This duplication was

part of the original cleaning process, as particular

ﬁelds in the scam reports contain several options. By

way of example, the location given for a scam may

originally have been “New York, USA or Amsterdam,

Netherlands”, and this would create two instances in

the original cleaning process, a variant proﬁle for each

location. We dropped these near-duplicate variants in

a process that consolidated the dataset and meant that

proﬁles with multiple entries in a ﬁeld were not given

greater weight within the dataset. When doing this,

the ﬁrst of the variant ﬁeld values were kept in the

dataset. This did mean that some information was

lost, as, in this example, only the proﬁle with New

York as the location would remain in the dataset.

The presence of timestamps is crucial when exam-

ining concept drift. Without them, we cannot evaluate

classiﬁers under the relevant constraints. Timestamps

were provided within the scam set, as the scamdig-

ger.com website has two ﬁelds that reﬂect the month

and the year that a scam was reported. For real pro-

ﬁles, however, there is no reported date (due to their

very nature of being genuine). As an appropriate

comparison date, the date a proﬁle was last active was

scraped and used to create the timestamp ﬁeld for real

proﬁles – this being how the real dating site user chose

to present themselves at a given date.

Figure 1 shows the real and scam proﬁle counts

across different years. The visualisation highlights

several imbalances. Firstly, there is a lack of real pro-

ﬁles in earlier years, between 2012 and 2015. Con-

versely, there has been a relative lack of reported scam

proﬁles in recent years. An imbalance can also be

seen between the real and scam proﬁles, where ap-

proximately 6% of the proﬁles were scams in the orig-

inal downloaded data. This proportion is less than the

10% estimated in Sift’s research (Beldo, 2022).

Figure 1: Count of real and scam dating proﬁles.

3.2 Baseline Classiﬁer

Suarez-Tangil et al. used an ensemble classiﬁer. The

different classiﬁers within the ensemble used different

data sources, which fell under three categories: de-

mographics, images, and description (Suarez-Tangil

et al., 2020). We focus on a classiﬁer using only the

demographic data in this study, with the hope that the

variety of ﬁelds used by the demographic classiﬁer

would better enable us to typify any concept drift in

the fraudulent or real proﬁle data. The original study

combined a random forest (RF) and naive Bayes (NB)

classiﬁer to handle demographic data, since an RF

classiﬁer does not work with missing values, but the

NB model can appropriately deal with them. Miss-

Temporal Constraints in Online Dating Fraud Classiﬁcation

537

ing data is common in dating sites where users will

have ‘incomplete’ proﬁles because they have decided

not to ﬁll in certain sections. After initial compar-

isons, we instead opted to use a histogram gradient

boosting classiﬁer (HGBC), which is also capable of

handling missing values, and achieves good perfor-

mance (scikit-learn developers, 2022).

The HGBC was trained initially on a dataset with-

out any temporal or spatial bias constraints. The

dataset was split into training and test sets and ﬁt-

ted using the training data – operating a grid search

with k-fold cross-validation to decide speciﬁc model

hyperparameters. These were the learning rate, the

maximum number of leaf nodes, and the maximum

number of iterations. K-fold cross-validation splits

the training set into k number of partitions, set to ten,

and trains the model on all but one partition. It is

then tested on the remaining fold, and this operation

repeats k times, leaving one partition out to test each

time. Scores are calculated as the average of the rel-

evant performance metric from these tests. A grid

search repeats the 10-fold cross-validation but uses a

different hyperparameter combination each time. The

best-performing combinations were used for training

the model, and then this model was scored on the

test dataset. The results in Table 1, while underper-

forming relative to Suarez-Tangil et al.’s full ensem-

ble model, show performance similar to that of their

individual demographics classiﬁer, with an F1 score

of 0.77. The HGBC classiﬁer has a lower recall than

precision, and just 70% of scam proﬁles in the test

dataset have been correctly identiﬁed.

Table 1: Performance metrics for the baseline HGBC clas-

siﬁer.

Precision Recall F1 Accuracy

0.85 0.70 0.77 0.98

This result is reasonably encouraging and will be

referred to as the baseline classiﬁer. However, in de-

ployment in the real world, for how long can such

a performance result be trusted? This question is at

the heart of this paper’s investigation and will be ad-

dressed in the following sections.

4 CONCEPT DRIFT

EVALUATION

A key concern when evaluating a classiﬁer under tem-

poral constraints is how to split the data into training

and testing windows. The classiﬁer uses the training

window to learn and then is evaluated for each testing

window subsequently.

Table 2: Minimum outcomes of different training and test-

ing window lengths.

Training

time

(months)

Test

window

(months)

Min.

testing

window

sample

size

Min.

positive

cases

Min.

positive

ratio

12 1 303 0 0.00

12 3 1574 0 0.00

12 4 2099 77 1.78

12 6 3528 153 1.90

18 1 303 0 0.00

18 3 1574 0 0.00

18 4 1044 0 0.00

18 6 3528 153 1.90

24 1 303 0 0.00

24 3 1574 0 0.00

24 4 2099 77 1.78

24 6 3528 153 1.90

Different training and testing window length com-

binations are examined in Table 2. When deciding

the testing window length, a rule of thumb of at least

1,000 samples in a split (Pendlebury et al., 2019) is

enforced. The minimum testing window sample size

column gives information on whether this occurs in

all windows for each combination. Scam proﬁles

need to be present in all windows, so the minimum

positive cases column depicts if this is true. The min-

imum positive ratio is how low the ratio of scam pro-

ﬁles to genuine proﬁles could be. This ratio is mean-

ingful if it is changed by the spatial constraint to make

it closer to the in-the-wild ratio. Based on this re-

view, 18 months for the training set and 6 months for

the testing windows were deemed appropriate for this

task. These intervals ensured no less than 1,000 pro-

ﬁles and a reasonable number of scam proﬁles in each

testing window

With the training and testing window sizes de-

cided, the HGBC classiﬁer was trained with the ﬁrst

18 months of data (starting from 2015). We then

evaluated the dataset under the constraints of tempo-

ral training consistency and temporal testing windows

consistency (C1 & C2). Figure 2 indicates that con-

cept drift is present. There is time decay, as the classi-

ﬁer’s performance reduces in subsequent six-monthly

periods, with the F1 scores dropping lower than the

previously reported ﬁgure of 0.77. The classiﬁer’s

ability to correctly identify the scam proﬁles dimin-

ishes, scoring an AUT of 0.63. The recall and F1

scores fall to 0.45 and 0.51 in the ninth testing pe-

riod, four years after the initial training. This result is

crucial as it answers one of the critical questions: is

there concept drift present in the online dating fraud

classiﬁer? The answer is yes – substantial drops in

performance can be seen over time.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

538

Figure 2: Evaluation of online dating fraud classiﬁcation

with constraints C1 & C2.

The size of the datasets used for training reduces

under this evaluation methodology, as the baseline

classiﬁer has the advantage of learning from more

data. It used over 70,174 samples, whereas the con-

strained classiﬁer learned from 26,017 samples from

the ﬁrst 18 months of the dataset (from 2015). The

performance of a classiﬁer with no temporal con-

straints but with a similar amount of data can be

compared to the baseline classiﬁer. The test scores

were similar to the baseline classiﬁer when training

an HGBC model with a randomised split of around

26,000 instances. The F1 score for this comparison

classiﬁer, trained on a smaller amount of data, was

0.76. This result suggested it is still a sufﬁcient quan-

tity of data.

The previous results in Figure 2 did not include

the spatial constraint of a realistic label ratio in test-

ing. This additional constraint is included, where the

ratio is forced to be between 7.5% and 12.5% in the

testing windows. This range includes the 10% es-

timate of scam proﬁles from Sift’s research (Beldo,

2022). A caveat of using this estimate was that it

was the best research on the in-the-wild ratio that

could be found, but it is just one estimate and was

not tailored for the datingnmore.com population but

the wider online dating population. This constraint is

implemented by changing the sample size and down-

sampling the scam or real proﬁles until one of the

bounds of the range is met. To clarify, if 20% of

a testing window were scam proﬁles, they would be

downsampled until the proportion was 12.5%. If 2%

of the window were scam proﬁles, then the real pro-

ﬁles would be downsampled until 7.5% is the ratio

(this is the scenario encountered for most of the test-

ing windows for the online dating fraud dataset). The

AUT improved, rather than decreased, to 0.67 when

imposing this constraint. The difference can be ex-

plained by the lower number of real proﬁles tested for

many windows (with some downsampled to force the

sample ratio to 7.5%). A smaller magnitude of false

positives is reported, increasing the classiﬁer’s preci-

sion since it decreases the number in the calculation

denominator (all else being equal - the true positives

do not change as the number of scam proﬁles remains

consistent in this scenario). This increase in precision

leads to a higher F1 score and AUT metric. The recall

is not affected in most windows, as this metric only

considers scam proﬁles and they are not downsam-

pled, apart from in the ﬁrst testing window. Figure 3

demonstrates the revised impact under all three con-

straints, still presenting evidence of a substantial drop

in performance over time.

Figure 3: Evaluation of online dating fraud classiﬁcation

with all three constraints (C1, C2 & C3).

To understand the dataset differences underlying

this performance drop, we made both visual and au-

tomated comparisons of the windows at the two most

different time windows. The comparisons were made

between the training window dataset (the ﬁrst 18

months of data) and the ninth testing window dataset,

which is the last six months of the data in 2020. A

HGBC model was trained to distinguish between pro-

ﬁles from the two time windows but achieved only

poor performance (0.16 F1 for distinguishing the cor-

rect window for any proﬁle, 0.32 F1 for distinguish-

ing the correct window for only scam proﬁles). The

differences between the proﬁles were not strongly evi-

dent in the distribution of demographic features: mar-

ital status, ethnicity and other factors appear to have

Temporal Constraints in Online Dating Fraud Classiﬁcation

539

similar distributions in both the training window and

the ﬁnal test window. One feature which did show

some variation was the ‘occupation’ ﬁeld, in which

female scam proﬁles became relatively more likely to

report ‘self-employed’, ‘student’ or ‘military’ occu-

pations, while male scam proﬁles became more likely

to report occupations in ‘construction’. These dif-

ferences, together with possible alterations in the co-

occurrence patterns of other more stable demographic

features, could explain why the classiﬁer performance

degraded. However, it is important to note that these

changes in the underlying data are small and difﬁcult

to detect, and so any plan for mitigating the impact of

concept drift on classiﬁcation will likely need to do

more than monitor cohort statistics.

5 CONCEPT DRIFT MITIGATION

Classiﬁcation with Rejection: can mitigate the ef-

fects of concept drift. This technique looks at the clas-

siﬁer’s probability prediction for each testing example

and will reject those that fall below a chosen thresh-

old. Samples for which the HGBC prediction prob-

abilities fall below the rejection threshold are placed

into ‘quarantine’, ofﬂoading decisions on these sam-

ples to manual labelling. A higher rejection thresh-

old means more predictions are rejected. The trade-

off is that there is a higher cost to label the examples

that have been quarantined since it is a manual task.

Figure 4 displays the results when rejecting instances

with a predicted probability below 80%.

Figure 4: Evaluation under classiﬁcation with rejection

(80%) (C1, C2 & C3).

Classiﬁcation with rejection increases the AUT

metric to 0.71 but does not stop time decay. The ninth

testing period has an F1 score of 0.56, which com-

pares to 0.77 for the baseline classiﬁer. The perfor-

mance is not a large improvement from the results

seen in the evaluation of the classiﬁer without any

mitigation. It is interesting to note that only 1,785

testing instances out of 37,651 were rejected based

on this 80% threshold. The choice of 80% as the re-

jection threshold was arbitrary, and a higher rejection

threshold could improve the AUT performance of the

model, but with the trade-off of higher labelling costs

– rejection means the classiﬁer is producing fewer

decisions, increasing the manual workload. Model

probability prediction also does not always translate

to expected outcomes; if the rejection threshold is set

to 95%, the AUT changes to 0.72. However, if it is set

to 99%, the AUT is 0.62 – worse than at 80%. There

is reason to believe that model probabilities can be

skewed towards high values (Jordaney et al., 2017),

suggesting that the gains from conﬁdence-based re-

jection may be limited.

Uncertainty Sampling: extends beyond rejection

to create a process that can make the classiﬁer more

robust to concept drift and involves retraining the

classiﬁer with a subset of the most uncertain exam-

ples. A proportion parameter is used, and there is

a subtle difference between this method and classi-

ﬁcation with rejection. Classiﬁcation with rejection

examines each example’s prediction scores and quar-

antines it if it falls below α%, where α is some pre-

determined threshold. With uncertainty sampling, the

predictions are ﬁrst sorted by their highest prediction

probabilities. A subset containing β% of the most

uncertain instances is then used by the model to re-

train. If β were set to 100%, this would be an example

of complete incremental learning, where the model

would learn from all the labelled data in each period.

Under this method, the AUT metric improves to

0.77 when retraining with the 20% most uncertain

predictions. The visualisation in Figure 5 shows that

uncertainty sampling makes the classiﬁer more robust

against a falling F1 score performance over time; its

trend is ﬂat and close to the baseline level. As the

model is starting to learn from samples in each period,

it gives it a better chance to improve its recognition

of future scam proﬁles. The classiﬁer now maintains

its performance to a level consistent with the baseline

classiﬁer, which had an F1 score of 0.77. The 20%

was an arbitrary selection, but different subset sizes

were also tested. 5% can still achieve an AUT score

of 0.75, with the beneﬁt that it requires a quarter of

the labelling compared to a level of 20%. This de-

cision regarding the appropriate level of sampling is

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

540

best determined by the available resource.

Figure 5: Evaluation with uncertainty sampling (20%) (C1,

C2 & C3).

The AUT measures for our evaluations are sum-

marised in Table 3. Both the imposition of con-

straint C3 and each of our mitigation strategies in-

creased performance evaluated over time, with uncer-

tainty sampling displaying the highest AUT and the

least deviation from the performance established in

the baseline. However, it must be acknowledged that

uncertainty sampling in a deployment setting would

require a regular manual review of a sample of cases.

This poses a potential tradeoff for online dating sites

and similar platforms aiming to screen out fraud us-

ing automated means, and highlights the need for such

platforms to invest in reliable ground-truth-recording

mechanisms through both user fraud reporting and

manual review processes.

Table 3: Summarised AUT metrics for different constraints

and mitigation methods.

Method AUT

C1 & C2 0.63

C1, C2 & C3 0.67

C1–3 with classiﬁcation with rejection (80%) 0.71

C1–3 with uncertainty sampling (20%) 0.77

6 CONCLUSION

Our central research question was assessing concept

drift and mitigating its effects in the domain of online

dating fraud. This is a serious problem, with growing

numbers of victims, and we ground our investigation

in a large real-world dataset. We ﬁnd that substantial

declines in classiﬁer performance can be seen across

the period covered by our testing windows. Our base-

line classiﬁer is not intended to demonstrate state-of-

the-art performance levels, but rather to exemplify

how performance may decay over time in this do-

main, using features common to many current mod-

els (Suarez-Tangil et al., 2020; He et al., 2021; Shen

et al., 2022). We see that a classiﬁer naively assessed

as performing at 0.77 F1 performs at 0.51 in the most

recent testing window when controlling for temporal

biases.

Similarly to Singh et al. (2012) in the domain of

malware, we ﬁnd that the underlying shifts in the dis-

tribution of dating proﬁle features are not easy to de-

tect or explain, highlighting that monitoring new data

may be insufﬁcient protection for concept drift. We

evaluated two mitigation techniques and discovered

that classiﬁcation with rejection does slow the decay

in performance over time, but does not halt it. Un-

certainty sampling, which involves the regular intro-

duction of new labelled data, is far more effective but

may pose operational concerns.

The practical takeaways from our work can be

summarised with two main considerations. Firstly,

online dating platforms need to be aware of this risk

wherever they may be deploying automated solutions

to prevent romance fraud, and should consider the

use of uncertainty sampling to guide their retrain-

ing methodology. Secondly, and more broadly, we

hope to demonstrate that concept drift is a measurable

problem for security and online safety classiﬁcation

systems, beyond the speciﬁc domains in which it has

previously been established, and argue for the need

for temporal constraints to be more widely adopted as

checks on the robustness of detection and prevention

models. To give what support we might for this aim,

the code for this project is made publicly available as

a Github repository, to enable replication and future

comparisons

One requirement for temporal robustness checks

is the availability of a large, longitudinal dataset la-

belled for classiﬁcation purposes. As part of our in-

vestigation we also attempted to investigate concept

drift in pet scams (Price and Edwards, 2020), but the

comparatively short period of time for which data was

available made the extent of any drift difﬁcult to es-

tablish reliably. Other domains in which concept drift

might be a operational concern could also be suffering

from the lack of suitable data, meaning researchers

and developers willing to perform robustness checks

are not able to do so. Reliable access to well-designed

security datasets remains a crucial hurdle for many

https://github.com/hbu90/Online-dating-fraud-

classiﬁcation-and-dataset-shift

Temporal Constraints in Online Dating Fraud Classiﬁcation

541

technological developments in online safety and se-

curity.

REFERENCES

Al-Rousan, S., Abuhussein, A., Alsubaei, F., Kahveci, O.,

Farra, H., and Shiva, S. (2020). Social-guard: detect-

ing scammers in online dating. In 2020 IEEE Interna-

tional Conference on Electro Information Technology

(EIT), pages 416–422. IEEE.

Barbero, F., Pendlebury, F., Pierazzi, F., and Cavallaro, L.

(2020). Transcending TRANSCEND: Revisiting mal-

ware classiﬁcation with conformal evaluation. arXiv

preprint arXiv:2010.03856.

Bartlett, P. L. and Wegkamp, M. H. (2008). Classiﬁcation

with a reject option using a hinge loss. Journal of

Machine Learning Research, 9(8).

Beldo, S. (2022). What percentage of dating proﬁles

are fake? https://blog.sift.com/what-percentage-of-

dating-proﬁles-are-fake/, accessed 2022-10-18.

Edwards, M., Suarez-Tangil, G., Peersman, C., Stringhini,

G., Rashid, A., and Whitty, M. (2018). The geography

of online dating fraud. In Workshop on Technology

and Consumer Protection. IEEE.

FTC (2022). FTC data show show romance scams

hit record high; $547 million reported lost in

2021. https://www.ftc.gov/news-events/news/press-

releases/2022/02/ftc-data-show-romance-scams-

hit-record-high-547-million-reported-lost-2021,

accessed 2022-10-18.

He, X., Gong, Q., Chen, Y., Zhang, Y., Wang, X., and Fu, X.

(2021). DatingSec: Detecting malicious accounts in

dating apps using a content-based attention network.

IEEE Transactions on Dependable and Secure Com-

puting, 18(5):2193–2208.

Huang, J., Stringhini, G., and Yong, P. (2015). Quit play-

ing games with my heart: Understanding online dat-

ing scams. In International Conference on Detection

of Intrusions and Malware, and Vulnerability Assess-

ment, pages 216–236. Springer.

Jordaney, R., Sharad, K., Dash, S. K., Wang, Z., Papini, D.,

Nouretdinov, I., and Cavallaro, L. (2017). Transcend:

Detecting concept drift in malware classiﬁcation mod-

els. In 26th USENIX Security Symposium (USENIX

Security 17), pages 625–642.

Kubat, M. (2017). An Introduction to Machine Learning,

volume 2. Springer.

Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and

Cavallaro, L. (2019). TESSERACT: Eliminating ex-

perimental bias in malware classiﬁcation across space

and time. In 28th USENIX Security Symposium

(USENIX Security 19), pages 729–746.

Price, B. and Edwards, M. (2020). Resource networks of pet

scam websites. In Proceedings of the Symposium on

Electronic Crime Research (eCrime). Anti-Phishing

Working Group.

Rege, A. (2009). What’s love got to do with it? Exploring

online dating scams and identity fraud. International

Journal of Cyber Criminology, 3(2).

scikit-learn developers (2022). Histgradientboostingclassi-

ﬁer. https://scikit-learn.org/stable/modules/generated/

sklearn.ensemble.HistGradientBoostingClassiﬁer.

html, accessed 2022-10-18.

Shen, X., Lv, W., Qiu, J., Kaur, A., Xiao, F., and Xia, F.

(2022). Trust-aware detection of malicious users in

dating social networks. IEEE Transactions on Com-

putational Social Systems.

Singh, A., Walenstein, A., and Lakhotia, A. (2012). Track-

ing concept drift in malware families. In Proceedings

of the 5th ACM workshop on Security and Artiﬁcial

Intelligence, pages 81–92.

Suarez-Tangil, G., Edwards, M., Peersman, C., Stringhini,

G., Rashid, A., and Whitty, M. (2020). Automati-

cally dismantling online dating fraud. IEEE Transac-

tions on Information Forensics and Security, 15:1128–

1137.

Webb, G. I., Lee, L. K., Petitjean, F., and Goethals, B.

(2017). Understanding concept drift. arXiv preprint

arXiv:1704.00362.

Whitty, M. T. and Buchanan, T. (2012). The online romance

scam: A serious cybercrime. CyberPsychology, Be-

havior, and Social Networking, 15(3):181–183.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

542