The WikiWooW Dataset: Harnessing Semantic Similarity and

Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in

Wikipedia

Cosimo Palma

1 a

and Bence Molnár

University of Pisa, University of Naples "L’Orientale", Italy

University of Pécs, Hungary

Keywords:

Serendipity, Interestingness, Wikipedia, DBpedia, Clickstream-Data, Semantic Similarity, Knowledge Graphs.

Abstract:

This paper introduces WikiWooW, a dataset generator designed for distilling a formal model of Wikipedia

entity-pairs serendipity. The task, foundational to mining serendipitous hyperlinked paths, builds upon cogni-

tive theory and exploits serendipity sub-components: graph centrality, popularity, clickstream, corpus-based

and knowledge-based similarity. Two proof-of-concept experiments were conducted, based on two differ-

ent datasets. The ﬁrst one uses a single Wikipedia entity linked through the DBpedia dbo:wikiPageWikiLink

property to other 413 entities. These pairs are searched in Wikimedia clickstream data and scored for in-

terestingness according to a principled mathematical model, which is validated against Amazon Mechanical

Turk- and author annotations. The second dataset contains 146 random Wikipedia entity-pairs annotated by

10 postgraduate students following detailed guidelines. Average serendipity scores are then correlated with

dataset features using the original model and four alternatives. The proposed dataset-generator aims to support

Serendipity Mining for Computational Creativity, particularly Knowledge-based Automatic Story Generation,

where serendipity matters more than similarity-based interestingness metrics. First results, despite their lim-

itations, conﬁrm the principles initially deduced for modelling serendipity, showing that serendipity can be

effectively modeled through comprehensive parameter optimization.

1 INTRODUCTION

From a cognitive perspective, the most salient con-

nections, known as hard beliefs, are both easily acti-

vated and highly resistant to change.

Their surprisal-level is low.

Recent ﬁndings in psychology conﬁrm that

surprise is summoned by unexpected (schema-

discrepant) events and its intensity is determined by

the degree of schema-discrepancy (Reisenzein et al.,

2019). Intuitively, humans ﬁnd interesting not obvi-

ous, yet at the same time not random facts.

The typical data structure for representing facts is

the Knowledge Graph, which consists of semantic re-

lationships, i.e. typically unweighted, labelled edges

between entity nodes (Hogan et al., 2022). Encoding

the strength of a link, either in terms of surprisal or

according to any other measure, cannot be achieved

but by means of numerical values. Enhancing knowl-

edge graphs with weighted relationships can enable

https://orcid.org/0000-0002-8161-9782

more nuanced analytical approaches, including cen-

trality measures and community detection algorithms

that account for relationship strength (Ristoski and

Paulheim, 2016), as well as other network analysis

methods such as spectral clustering approaches and

information diffusion models that currently ﬁnd lim-

ited application in semantic networks (Bojchevski and

Günnemann, 2020).

The implementation of weighting in Linked Open

Data (LOD) is attempted in (Hees, 2018) by gamify-

ing data acquisition tasks, thus building the necessary

ground truth for validating whether Linked Data can

effectively model human associative thinking.

DBpedia-NYD addresses the lack of large-scale

benchmarks for assessing the different approaches to

the automatic computation of semantic relatedness in

DBpedia links by providing a synthetic silver stan-

dard benchmark with symmetric/asymmetric similar-

ity values from web search data (Paulheim, 2013).

However, despite these contributions, weights

based on serendipity to enrich DBpedia relationships

represents a research direction which still eludes the

Palma, C. and Molnár, B.

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia.

DOI: 10.5220/0013747000004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 415-425

415

attention of the scientiﬁc community (1.1).

1.1 Related Work

The concept of interestingness varies by discipline.

In (Hilderman and Hamilton, 1999), a comprehensive

survey of measures in Knowledge Discovery (KD)

ranks them by representation (dataset format: classi-

ﬁcation rules, summaries, association rules), founda-

tion (probabilistic, distance-based, syntactic, utilitar-

ian), scope (single rule/rule set) and class (objective/-

subjective).

Subjective measures involve user’s background

knowledge (bias, constraints, beliefs, expectations,

interactive feedback) and are integrated into the min-

ing process.

Objective measures rely solely on data without

user inputs. To this regard, particularly relevant is

Silbershatz and Tuzhilin’s Interestingness, measuring

how soft beliefs change with new evidence.

Serendipity is difﬁcult to deﬁne and model com-

putationally (Kotkov et al., 2016; McCay-Peet and

Toms, 2015).

Generally, serendipity in recommendation systems is

calculated as a ratio between unexpectedness and rele-

vance/usefulness. All serendipity metrics include user

preference data, tailored for individuals. For econ-

omy constraints, they are not further discussed here.

Generally, they stem from sub-component assessment

aligning with our paper’s components, representing

an attempt at modeling serendipity in a principled,

holistic fashion without individual user subjectivity.

Formal serendipity deﬁnition for recommendation

systems assumes users have goals (e.g., acquiring

items), but web browsing is often erratic. For this rea-

son, manual evaluation followed established princi-

ples from psychology and Information Retrieval: Sil-

via’s (Silvia, 2009) appraisal theory links interest to

novelty, complexity, comprehensibility, driving cu-

riosity and information-seeking and Belkin’s (Belkin,

2014) "anomalous states of knowledge" (ASK) views

interestingness as information’s degree of knowledge-

gap resolution. Schmidhuber’s (Schmidhuber, 2010)

theory suggests interest arises when data balances

novelty and comprehensibility, where understanding

improves but remains incomplete.

Knowledge Discovery in databases (aka Knowledge

Mining) identiﬁes previously undiscovered, potentially

valuable patterns in extensive databases using diverse tech-

niques and algorithms.

1.2 Problem Statement and Paper’s

Contribution

A serendipity model beyond similarity-based recom-

mendation systems in databases appears missing in

the literature. This problem is in good part due to the

Knowledge Graph data-structure of the Web, not orig-

inally intended to be weighted, though its expressivity

allows such a setting (see section [7.1]). Formulating

serendipity for Wikipedia-entities remains open: in-

terestingness uses association rules based on subjec-

tive user preferences tuned on similarity. Serendipity

itself, mainly used in recommendation systems, adds

unexpectedness to interestingness but remains user-

dependent. Leveraging a knowledge graph for calcu-

lating serendipity means detaching from the subjec-

tive dimension of user-based data to distill an objec-

tive measure based on axioms grounded in cognitive

sciences transformed in logical clauses, in turn mathe-

matically rendered by means of T-Norm conversion

The proposed measure for Serendipity diverges

from typical literature deﬁnitions. Renouncing

subjectivity, it captures curiosity towards the un-

known rather than positive response to novel dis-

covery. The resource presented in this article was

principally built to investigate correlations between

automatically-mined serendipity sub-components and

human serendipity scoring. However, as further ex-

pounded in the "Future Work" section [7], it can

serve a variety of further purposes. The paper’s main

contribution is an exhaustive method for distilling a

serendipity formula for Wikipedia entity pairs, appli-

cable (with due caveats) for building the weighted

semantic network foreshadowed in the introduction.

Subsequently, retrieving serendipitous paths repre-

sents a distinct challenge, as cognitive principles de-

termining path interestingness don’t fully align with

those for simple entity associations (Palma, 2023).

This research direction is primary to Computational

Creativity, which also emphasizes metrics like nov-

elty, usefulness and unexpectedness (Chhun et al.,

2022).

The codebase for dataset generation and analysis

is freely accessible at https://github.com/Glottocrisio/

WikiWooW

Objective is in our ﬁnal experiment setting equated

to inter-subjective. Consensus on serendipity among users

will be used as a heuristic of objectivity, which will be fur-

ther materialized in the distilled mathematical formula.

It has been implemented in Python 3.9, requiring 2000

lines of code. Processor: Intel(R) Core(TM) i7-6600U CPU

@ 2.60 GHz. Run time for the ﬁrst dataset: 1h47min; sec-

ond dataset: 2h19min.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

416

2 A NOVEL MODEL OF

SERENDIPITY

An entity in a Cross-domain Knowledge Graph like

DBpedia can be classiﬁed by measures simpler to

model mathematically than serendipity, such as popu-

larity. Intuitively, more page accesses indicate higher

popularity. However, if two entities have equal views

(clickstream

), the one with fewer incoming links

should be deemed more popular, since access is less

immediate. Among node centrality measures, we se-

lected PageRank centrality (Page et al., 1999) for

its off-the-shelf availability via Wikiﬁer API (Brank

et al., 2017)

and consideration of page ﬁndability

(as for DBpedia Relatedness and Similarity, as well

as Cosine Similarity values range from 0 to 1). This

heuristic models as following:

P(i) ≃

clickstream(i)

PageRank(i)

...where P is Popularity and C Centrality. Click-

stream and Popularity of a node are usually highly

correlated. Among relative measures, similarity is

deﬁnitely the most known in literature.

Two similarity types exist: corpus- and knowl-

edge-based (Mihalcea et al., 2006). Capturing this

gap delivers the interestingness degree. Knowledge-

based similarity uses Relatedness:

relatedness(a, b) =

log(max(|A|,|B|))−log(|A∩B|)

log(|W |)−log(min(|A|,|B|))

where a and b are the two articles of interest, A and B

are the sets of all articles that link to a and b respec-

tively, and W is set of all articles in Wikipedia (Milne

and Witten, 2008). Corpus similarity of two entities

is calculated by Pointwise Mutual Information or La-

tent Semantic Analysis of the related concept word

lists, based on large corpora (Mihalcea et al., 2006).

Having decomposed serendipity into modelable com-

ponents, we proceed toward a general model as in

(Palma, 2024)

n1,n2

∼

n1,n2

× | S

n1,n2

− DBpediaRel

n1,n2

log

(clickstream

n1,n2

)

(1)

where:

n1,n2

∼





CosineSim

n1,n2

+ DBpediaSim

n1,n2





"Clickstream" sometimes synonymous with "click-

path" (sequence of hyperlinks visitors follow). Here, it de-

notes amount of accesses between pages (entity pair) or to

a single page (single entity).

https://wikiﬁer.org/.

We use "I" instead of the more intuitive "S" for

"Serendipity" to differentiate it from the Similarity ("S")

measures.

The distributional similarity between two entities

is obtained through the simple average between co-

sine similarity (corpus-based) and DBpedia similar-

ity, whereas the joint popularity has been modelled in

order to capture the following set of constraints:

1. The overall serendipity increases if both entities

have a high popularity;

2. The overall serendipity increases if one entity is

considerably more popular than the other.

The ﬁrst formulation is "affect-driven" interesting-

ness (Hidi, 2006), while the second is "gap-driven"

(Belkin, 2014). The ﬁrst constraint assumes that pop-

ular entities drive attention and curiosity. A high de-

gree of interestingness is also identiﬁed in the as-

sociation of two entities with huge gap in popular-

ity, for two reasons: a gap, as pointed out before,

is always interesting; secondly, what is unexpected

is also considered interesting. These two conditions

have been integrated to model popularity as: P

n1,n2

∼

ln(P

n1+n2

n1−n2

3 ALTERNATIVE APPROACHES

FOR THE COMPUTATION OF

SERENDIPITY IN WikiWooW

Since an entity/node can be either popular or unpopu-

lar and a link as corpus- or knowledge-based, we want

now to ﬁnd out how many possible combinations of

those elements can occur, taking into account the fol-

lowing constraints:

• Entities can be either popular or unpopular (abso-

lute measures);

• The relationship between two entities can be la-

belled according to both deﬁnitions of similarity

(relative measures).

According to this, our problem can be modelled as

a permutation with repetition: if the node can be of

two types and the relationship of four types (namely,

all possible combinations of two similarities, whose

one is knowledge- and the other corpus-based), we

can have: 2 ×4 ×2 = 16 possibilities, among which

the following ﬁve are the only ones showing the pre-

viously mentioned gap/contradiction principle:

1. Popular Entity (-) high corpus- AND knowledge-

based similarity (-) Unpopular Entity;

2. Popular Entity (-) high corpus- BUT NOT

knowledge-based similarity (-) Unpopular Entity;

3. Popular Entity (-) high knowledge- BUT NOT

corpus-based similarity (-) Unpopular Entity (e.g.

Trivia);

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia

417

4. Popular Entity (-) high corpus- BUT NOT

knowledge-based similarity (-) Popular Entity;

5. Popular Entity (-) high knowledge- BUT NOT

corpus-based similarity (-) Popular Entity

;

Though deﬁning popularity thresholds is inherently

fuzzy, we established workable boundaries through

trial and error. Based also on annotator consultation,

we classiﬁed entities with fewer than 6,000 monthly

pageviews as unpopular and those exceeding 12,000

as popular.

3.1 Serendipity Models Using

Lukasiewicz Operators

The literature presents several approaches to convert

logical operators into mathematical ones. To produce

the candidate modelings to be tested against the hu-

man evaluation, we refer only to the basic operations

as listed in Lyrics (Marra et al., 2019) (see Figure 1).

Figure 1: Logic operation/T-norm conversion table from

Lyrics (Marra et al., 2019).

To showcase how to apply the T-norm, in the fol-

lowing we adopt only Lucasiewicz operators.

The used variables are hereby deﬁned:

• P(e): Popularity of entity e (normalized to [0, 1]);

• C(e

, e

): Corpus-based similarity;

• K(e

, e

): Knowledge-based similarity;

• S(e

, e

): Serendipity of the relationship.

We can express a basic serendipity function as:

S(e

, e

) = PopularityContrast(e

, e

)·

·SimilarityAsymmetry(e

, e

) (2)

where:

An example of unexpected conceptual relation be-

tween popular entities with already known knowledge-

based relationship is the one relating Casanova and

Goldoni. It is renown that they were both active in the

eighteenth-century Venice, but few know that they are

linked through Zanetta Farussi, the mother of Giacomo

Casanova, and one of the actresses of Carlo Goldoni. This

piece of information can be fully exploited to conceive a

story, and would be retrieved by this taxonomy.

PopularityContrast(e

, e

) =

(

min(2, P(e

) + P(e

)), if P(e

) > τ

∧P(e

) > τ

max(0, P(e

) −P(e

)), otherwise

(3)

and:

SimilarityAsymmetry(e

, e

) =

max(0,C(e

, e

) + K(e

, e

) −1) (4)

3.2 Further Alternative Serendipity

Models

The following alternative mathematical modelings of

serendipity have been computed on the basis of the

features extracted through the WikiWooW project,

with the goal of individuating the one which most re-

sembles the values of the human annotations and eval-

uations (refer to section [5])

Notation:

• H(·, ·): Harmonic mean function;

• A(·, ·): Arithmetic mean function;

• R: Resultant length in circular statistics.

Model 2: Logarithmic Popularity Contrast with

Similarity Divergence.

(2)

pop

, e

) = log



1 +

max(P(e

), P(e

))

min(P(e

), P(e

)) + 1



(5)

(2)

sim

, e

) = |S

cos

, e

) −S

dbp

, e

+ |S

cos

, e

) −R

dbp

, e

)| (6)

Model 3: Entropy-Based Popularity with KL Di-

vergence Proxy.

(3)

pop

, e

) = −p

log(p

) − p

log(p

) (7)

where p

P(e

)

P(e

) + P(e

)

(8)

(3)

sim

, e

) =



′

cos

log



′

cos

¯s





′

dbp

log

′

dbp

¯s



(9)

′

= s

+ 0.1, ¯s =

′

cos

+ s

′

dbp

+ s

′

rel

(10)

The alternative models presented in this subsection

have been brainstormed and mathematically rendered with

support of Artiﬁcial Intelligence. Human intervention en-

compassed prompting, output analysis, selection, clean-up

and correction.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

418

Model 4: Harmonic Mean Contrast with Weighted

Variance.

(4)

pop

, e

) = log



A(P(e

), P(e

))

H(P(e

), P(e

))

+ 1



(11)

(4)

sim

, e

) =

∑

i=1

− ¯s)

·(1 + |s

− ¯s|) (12)

Model 5: Geometric Dispersion with Circular

Variance.

(5)

pop

, e

) = log(D) ·log(G) (13)

D =

max(p

, p

)

min(p

, p

)

, G =

√

·p

(14)

= P(e

, e

) + 1, p

= |P(e

, e

)|+ 1

(15)

(5)

sim

, e

) = 1 −R + 0.1 (16)

R =

¯c

+ ¯s

, θ

= 2πs

(17)

¯c =

∑

i=1

cos(θ

), ¯s =

∑

i=1

sin(θ

)

(18)

4 EXPERIMENT 1: FEATURES

IMPORTANCE ASSESSMENT

We chose an a-prioristic serendipity formulation

to test correlation between subjectivity and objec-

tivity/intersubjectivity via annotations in this and

subsequent experiments. Following the theoreti-

cal background, WikiWooW was designed with the

features: Entity1; Entity2; ClickstreamEnt1Ent2;

PopularityEnt1; PopularityEnt2 (calculated using

PageView and PageRank); PopularityDiff; CosineS-

imilarityEnt1Ent2; DBpediaSimilarityEnt1Ent2; DB-

pediaRelatednessEnt1Ent2; InterestingnessEnt1Ent2

(all 5 proposed models); Serendipity Ground Truth

Values.

The clickstream data on single Wikipedia-entities

is collected by means of MediaWikiAPI

. On the

other hand, to fetch the clickstream-data related to

entity couples, we exploit Wikimedia Clickstream

Data Dumps

, as performed in the project WikiNav

. Similarity measures use Sematch (Zhu and Igle-

sias, 2017)

. Initially, "Ground Truth Values" were

https://pypi.org/project/mediawikiapi/.

https://dumps.wikimedia.org/other/clickstream/

readme.html. Dataset uses English entities accessed

November 2018.

https://wikinav.toolforge.org/.

https://pypi.org/project/sematch/.

manual interestingness annotations by Amazon Me-

chanical Turk (AMT) workers

, guided by: Would

you be interested in deepening the connection be-

tween these Wikipedia entities? Does this connection

spark your curiosity? Of 413 entity pairs, all except

two were labeled interesting, creating drastic dataset

imbalance. To alleviate this, we randomly selected

1000 Wikipedia entity pairs

, shufﬂing the original

413 among them. The second annotation formulation

omits to make explicit that the entity pairs are linked.

This raised pairs to 31 (19 original), encouraging for

larger-scale analysis but insufﬁcient for evaluation:

individuals with different interests and "interesting-

ness" conceptions unlikely agree so extensively.

For this reason, and given the exploratory research

nature, serendipity values were balanced using me-

dian threshold for "non-interesting" values.

The related literature proposes, among others,

Principal Component Analysis (PCA) and SHapley

Additive exPlanations (SHAP) for features impor-

tance assessment (FIA).

Principal Component Analysis (PCA) (Jolliffe

and Cadima, 2016) identiﬁes key patterns by project-

ing the original data onto a new set of axes, known as

principal components. These components are hierar-

chically arranged based on their ability to capture data

variability, with the ﬁrst component accounting for

the most variance. The unsupervised nature and vari-

ance maximization objective make it unsuitable for

identifying features most relevant to prediction tasks

and should be reserved only for dimensionality reduc-

tion.

SHAP (SHapley Additive exPlanations) (Lund-

berg and Lee, 2017) draws inspiration from cooper-

ative game theory to quantify the contribution of each

feature to a model’s output.

However, given the characteristics of our dataset

and the general superior performance compared to the

above-mentioned alternatives for the task of FIA, we

have selected the cforest function (Strobl et al., 2007),

which provides unbiased variable selection within

classiﬁcation trees. When implemented with subsam-

pling without replacement, it produces reliable impor-

tance measures robust across predictor variables with

different measurement scales or category numbers.

The calculation of feature importance showcased

in Table 1 shows how our ﬁrst attempt of serendipity

modeling (ref. to [1]) performs slightly better than all

other features, demonstrating that unifying the sub-

components in a single expression might lead to a sat-

isfactory formulation.

Three workers label each entity pair; ﬁnal value aver-

ages their conﬁdence ratings.

Still directly linked through dbo:wikiPageWikiLink.

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia

419

Table 1: Random Forest - Features Importances.

Feature Importance

Serendipity Model [1] 0.231

Clickstream 0.178

DBpediaSimilarity 0.127

DBpediaRelatedness 0.121

CosineSimilarity 0.118

PopularityDiff 0.112

In the following section, we apply another ap-

proach to bind the features in an equation which might

at best express the serendipity how emerging from the

human annotation.

4.1 Symbolic Regression Analysis

Discovering interpretable mathematical expressions

directly from data is a task that has traditionally been

managed using genetic programming (Vanneschi and

Poli, 2012). Recently, however, there has been an

increasing interest in employing a deep learning ap-

proach for this purpose (Makke and Chawla, 2023).

The following expression has been found by using

the Python library gplearn, and tries to capture the

weighting of features to result in the ground truth.

Results from other libraries, showing similarly convo-

luted results, have been omitted for reasons of space.

I(s

, s

) =

−0.166

−

(0.729−X

+ X

0.729−X

0.729−

clickstream

n1,n2

0.811

0.647

+ X



−0.606 ×clickstream

n1,n2



−X

where:

= CosineSim

n1,n2

= DBpediaSim

n1,n2

= DBpediaRel

n1,n2

Despite hyperparameter optimization efforts, the

model’s performance remained unchanged, yielding

outputs that exhibited either excessive simplicity or

unnecessary complexity (as observed in the output

proposed above).

5 EXPERIMENT 2: MORE

ANNOTATORS, BETTER

GUIDELINES

In a last attempt to reconcile subjective evaluations

in a seeming inter-subjectivity, a more thorough eval-

uation has been designed, featuring the drafting of

guidelines and the selection of reliable evaluators

Low inter-annotators agreement on serendipity

performed on a 60 entity-pairs dataset, represent-

ing 10% of the ﬁnal dataset envisioned in case of

successful evaluation has lead to the creation of a

smaller dataset of 146 entity pairs, extracted from the

Wikimedia Clickstream Data Dump, where the ﬁnal

serendipity value for each entity couple was calcu-

lated as an average of all scores.

Table 2: Overall Inter-Annotator Agreement Metrics.

Metric Value Interpretation

Fleiss’ Kappa 0.38 Fair agreement

Cronbach’s Alpha 0.14 Poor reliability

Krippendorff’s Alpha 0.09 Slight agreement

Avg. MSE 0.35 –

Avg. Percent Agreement 0.43 –

In order to investigate the link between entity-pair

popularity and serendipity, one third of the dataset

is comprised of very popular entities (above 12000

page-views), one third of unpopular entities (below

6000), and one third of mixed popularity (one entity

popular and the other not).

The following guidelines instruct annotators on

assessing entity pairs from English Wikipedia for

serendipity, deﬁned as: "the property of an entity pair

whose relationship is simultaneously unexpected and

relevant or interesting." We deﬁne "close" relation-

ships as non-trivial direct connections between en-

tities. Trivial relationships include generic connec-

tions like "related to," "same as," "are things," or "are

persons." Non-trivial connections include "child of,"

"successor of," "grown in," "coeval with," and other

speciﬁc, meaningful relationships. Annotators evalu-

ate entity pairs (e.g., ’Mark Antony’; ’Alexander the

Great’) using a Google Form with yes/no questions,

resulting in scores of 1 (serendipitous), 0.5 (unex-

pected but irrelevant), or 0 (too obvious).

Example 1: ’Classical antiquity’; ’Alexander the

Great’ Score: 0 (too obvious)

The ten evaluators are Master of Education students

from the Faculty of Humanities and Social Sciences at the

University of Pécs, Hungary, each one majoring in English

language and culture and selected based on their level of

English proﬁciency.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

420

• Q1: Could there be a close relationship? → Yes

• Q2: Do you know the relationship’s nature? →

Yes (general knowledge)

• Result: Too obvious

Example 2: ’Chanakya (TV series)’; ’Alexander the

Great’.

Score: 0.5 or 1.

A relationship seems plausible but its nature is un-

known. The score depends on curiosity: lack of in-

terest yields 0.5 (irrelevant), while genuine curiosity

yields 1 (serendipitous).

Example 3: ’Mark Antony’; ’Alexander the Great’.

Score: 1 (serendipitous).

Despite different historical periods, a connection ap-

pears likely but remains unclear, generating curiosity.

The annotation process captures perceived related-

ness (Q1), subjective knowledge (Q2), and serendip-

ity scores, enabling analysis of how perceived con-

nections and prior knowledge inﬂuence serendipity

judgments and providing empirical data for model re-

ﬁnement.

6 VISUALIZATION AND

EVALUATION OF RESULTS

To investigate serendipity behavior across popularity

combinations (as shown in section 2), we categorized

entity couples into three groups: popular-popular,

unpopular-unpopular, and popular-unpopular pairs.

Table 3: Average Serendipity and Knowledge Metrics by

Entity Popularity.

Category PR SK Ser.

Overall 0.60 0.40 0.50

Popular 0.59 0.44 0.57

Unpopular 0.60 0.39 0.39

Pop-Unpop 0.62 0.38 0.49

PR: Perceived Relatedness; SK: Subjective Knowledge

Serendipity: 0.6 = Moderate; 0.7 = High; 0.8+ = Very High

Table 3 reveals that popular entities exhibit signif-

icantly higher average serendipity (0.572) compared

to unpopular entities (0.386), validating our princi-

pled assumption from section 2. Furthermore, re-

sults conﬁrm that high clickstream correlates with

decreased serendipity scores: all highly serendipi-

tous pairs (scores higher than 0.8) consistently asso-

ciate with low clickstream values. This observation,

aligning with our initial axioms, suggests potential

optimization strategies for computationally expensive

serendipitous couple retrieval algorithms.

The serendipity values’ considerable variance

around the mean (0.5) validates our choice to calcu-

late serendipity as an average. However, this variance

simultaneously demonstrates that objective serendip-

ity measurement (at least within our problem formu-

lation) remains infeasible, at least for small datasets,

as the ones adopted for the experiments.

Although deeper analysis is required to identify the

best-performing model against ground truth, any re-

sulting model will be far from deﬁnitive, having been

derived through induction rather than the desirable

deductive approach. Nevertheless, these results pro-

vide foundational value for serendipity computation

in other computational creativity settings, as they

emerge from logical, explicit principles readily adapt-

able to diverse applications.

Figure 3 reveals key differences between popu-

lar and mixed entity couples: in the ﬁrst case, per-

ceived relatedness highly correlates with serendipity,

while in the second case only mildly. Interestingly,

popular pairs show also a mild correlation with sub-

jective knowledge and clickstream, which conversely

anti-correlate for pairs of mixed popularity, where the

best predictors are identiﬁed with DBpedia Related-

ness and the mathematical modelling conceptualized

in 1. These are amongst the worst predictors in pop-

ular couples, where the relevance of clickstream for

serendipity annotation also reﬂects in the alternative

serendipity modelling that were tuned with it, per-

forming better than the same without clickstream. For

the same reason, an opposite behavior is expected for

couples of mixed popularity.

We also notice from the graphics how high

serendipity is indeed correlated with considerable dif-

ferentials between entity popularity values, as well

as between Cosine Similarity (together with DBpe-

dia Relatedness) and DBpedia Similarity, observation

already postulated in the principled approach.

Figure 4 illustrates model performance across

serendipity ranges: all models perform adequately

for low serendipity values; however, only Model 2

demonstrates encouraging results for high serendip-

ity cases. Medium serendipity values prove entirely

unpredictable across all models, suggesting inherent

complexity in this range.

7 FUTURE WORK

Several avenues for future research emerge from this

study. First, multivariate analysis can be employed to

examine relationships between multiple variables si-

multaneously. The model itself can even be enhanced

by incorporating additional features into the equa-

tion, while expanding the number of testers will im-

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia

421

Figure 2: Parallel coordinate plot of serendipity and the extracted features together with the human annotated subcomponents.

From left to right: Clickstream, PopularityEnt1, PopularityEnt2, PopularityDiff, Cosine Similarity, DBpedia Similarity, DB-

pedia Relatedness, Perceived Relatedness, Subjective Knowledge, Serendipity.

Figure 3: Serendipity predictors based on Pearson correlation. On the left, values computed on entity couples with mixed

popularity. The AI-generated alternative serendipity models have not been tuned with clickstream data. On the right, values

computed on entity couples with high popularity; clickstream (*Click) included in serendipity models.

Figure 4: Parallel coordinate plot of serendipity and the alternative models. From left to right: the ﬁve serendipity models,

Perceived Relatedness, Subjective Knowledge, Serendipity.

prove annotation reliability. The extensive multilin-

gual coverage of constantly-maintained clickstream

data enables future cross-cultural studies assessing in-

terestingness variance across languages—with signif-

icant implications for Computational Creativity. Ad-

ditionally, diachronic analysis comparing data-dumps

across years would reveal user interest shifts over

time.

7.1 Implementing Weighting of

Knowledge Graph Relations

Through SW-Technologies

To transform traditional knowledge graphs into

weighted networks, numerical values can be incorpo-

rated directly into the graph structure using Seman-

tic Web (SW) technologies such as RDF-star (RDF*)

(Hartig et al., 2021), which enables the annotation of

RDF triples with additional metadata.

For instance, the DBpedia property

dbo:wikiPageWikiLink could be enhanced with

narrative interestingness scores by creating qualiﬁed

statements that attach numerical weights to each link.

Using RDF-star syntax, a weighted link could be

expressed as,

«:Entity1 dbo:wikiPageWikiLink :Entity2»

:weight 0.8

enabling SPARQL queries to ﬁlter and rank relation-

ships based on their numerical signiﬁcance:

SELECT ?s ?p ?o WHERE {

BIND(<<?s ?p ?o>> AS ?t)

?t serendipity ?s .

FILTER ( ?s > 0.7 )

}

Achieving the data coverage necessary for a large

knowledge base such as Wikipedia requires a collec-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

422

tive effort, which can only thrive on enhanced User

Experience and even gamiﬁcation strategies.

7.2 Of Colors and Gadgets

Links are fundamental to Wikipedia and web develop-

ment, though (Dimitrov et al., 2017) found that only

about 4% of Wikipedia links receive more than 10

monthly clicks. To understand WikiWooW’s poten-

tial applications and beneﬁts, we must ﬁrst examine

the various link types, colors, and the concept of "or-

phan articles" on Wikipedia.

Wikipedia employs three distinct link cate-

gories.

Internal links (wikilinks) connect pages

within the same project, while interwiki links connect

to different Wikimedia projects using preﬁxes (e.g.,

"de" for German Wikipedia). External links, marked

with an icon, can direct to any web page. This article

focuses exclusively on internal links, which are the

sole links considered in our dataset.

Related to link connectivity is the issue of "or-

phan articles"—Wikipedia pages lacking incoming

links from other main namespace pages.

Although

these pages remain searchable via internal search and

external services, Wikipedia’s principle advocates for

their integration through related page links. Such inte-

gration not only increases readership but also attracts

contributors who can enhance content quality.

To facilitate navigation and user experience,

Wikipedia implements a color-coding system for

links.

Blue indicates unvisited existing pages, pur-

ple shows visited existing pages, red marks non-

existent unvisited pages, and light maroon denotes

non-existent visited pages. While these colors vary

by skin and can be customized through user scripts

and CSS, the fundamental principle persists: blue for

existing articles, red for non-existent ones.

Building on this customization capability, user

scripts commonly modify default link colors, with

community-approved scripts called "gadgets" being

widely adopted. For instance, the "Disambiguation-

Links" gadget highlights disambiguation pages in dif-

ferent colors.

Following this model, WikiWooW’s

values could similarly identify and color-code inter-

esting links based on their clickstream data, graph- or

cosine-similarity, or even serendipity. This approach

would harness users with a more objective discovery

experience while assisting editors in ﬁnding less pop-

ular articles requiring improvement.

https://en.wikipedia.org/wiki/Help:Link.

https://en.wikipedia.org/wiki/Wikipedia:Orphan.

https://en.wikipedia.org/wiki/Help:Link_color.

https://en.wikipedia.org/wiki/MediaWiki:

Gadget-DisambiguationLinks.css.

However, the effectiveness of such color-coding

must consider user behavior patterns. (Dimitrov et al.,

2016) discovered that readers primarily click links in

prominent locations: lead sections, right sidebars with

infoboxes, and left body areas, while generally avoid-

ing right-side regions. Consequently, WikiWooW’s

model could counteract this spatial bias by identifying

and marking valuable links in underutilized regions,

though initial clickstream data would inevitably re-

ﬂect these existing patterns.

Ultimately, implementing a user script with Wiki-

WooW’s model would enable testing whether users

ﬁnd the suggested links engaging within Wikipedia’s

native environment: even if an explicit feedback API

is not provided, the impact of the proposed enhanced

user experience can be implicitly assessed through

the shift of clickstreams between entities as they are

already monthly collected in the Wikimedia Click-

stream Data Dump.

8 LIMITATIONS, CHALLENGES

AND FINAL REMARKS

If attempts to modelling serendipity in a principled

fashion continue to prove unsatisfactory, a machine

learning approach represents the logical next step for

modeling the complex interactions among the estab-

lished features, though careful attention must be paid

to avoiding overﬁtting.

Since harvested values rely primarily on Sematch,

an application developed nearly a decade ago despite

ongoing maintenance, a thorough assessment of value

retrieval methods is necessary to purge outdated in-

formation from the dataset. Corpus-based similar-

ity could be better captured using word embeddings

from Wikipedia2vec, which temporally aligns with

Wikimedia clickstream data. To enhance precision,

clickstream data should be computed as monthly av-

erages rather than our current single-month snapshot

(November 2018).

Furthermore, its normalization could explore alterna-

tives that, better than logarithms, can better capture

the present remarkable ﬂuctuations. For clickstream-

agnostic Knowledge Graph analysis, the equation

should gradually de-emphasize clickstream in favor

of graph centrality. Adding centrality as an explicit

dataset feature (currently implicit in popularity, see

equation 2) represents the immediate next step for

observing its behavior against annotations. With im-

proved dataset quality and size following these guide-

lines, even symbolic regression should yield better re-

sults.

Figure 3 already demonstrates how clickstream

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia

423

conserve its relevance in predicting serendipity

only for popular-popular entity pairs. Exploring

clickstream-centrality correlations could yield sat-

isfactory popularity approximations using centrality

alone, enabling analysis on any graph without click-

stream requirements.

We have shown how all basic initial assumptions,

based on intuition and consolidated from cognitive-

and information theory, were validated from the ex-

perimental data. Despite a model of Serendipity

which clearly outperforms the other has not been

found yet, our experimental results show that combi-

nation of the individuated sub-components is a proxy

for serendipity measure. Beyond these speciﬁc direc-

tions, the comprehensive nature of our dataset posi-

tions it as a valuable resource for broader research in

computational creativity, offering multiple possibili-

ties for exploration that constitute a signiﬁcant contri-

bution in itself.

AUTHORS CONTRIBUTION

Cosimo Palma: Conceptualization, Methodology,

Codebase development, Formal Analysis, Investiga-

tion, Data Curation, Writing (paper original draft,

review and editing, annotators guidelines original

draft, review and editing), Information Visualization,

Project administration.

Bence Molnár: Methodology, Investigation, Re-

cruitment, Coaching and Supervision of annotators,

Writing (section 7.2 original draft, paper review and

editing, Annotators guidelines review and editing),

Project administration.

ACKNOWLEDGEMENTS

This work could not have been carried out without

the support of several scholars. First of all, Dr. Maria

Pia Di Buono, who actively participated throughout

the ideation and design of the second experiment; her

suggestions for the annotator guidelines were particu-

larly instrumental in improving the quality of data col-

lection. The work has also beneﬁted from insightful

discussions with Dr. Emanuele Marconato, Philipp

Bous, Prof. Dr. Carlo Strapparava, Sebastien Al-

bouze, Dr. Vassilis Tzouvaras, and Dr. Victor De

Boer. We extend our sincere gratitude to the data

annotators and to the anonymous reviewers, whose

constructive feedback signiﬁcantly contributed to en-

hancing this paper.

REFERENCES

Belkin, N. (2014). Anomalous states of knowledge as a

basis for information retrieval. Canadian Journal of

Information Science, pages 133–143.

Bojchevski, A. and Günnemann, S. (2020). Adversarial at-

tacks on node embeddings via graph poisoning. In In-

ternational Conference on Machine Learning, pages

695–704. PMLR.

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and

Yakhnenko, O. (2013). Translating embeddings for

modeling multi-relational data. In Advances in Neural

Information Processing Systems, pages 2787–2795.

Brank, J., Leban, G., and Grobelnik, M. (2017). Annotating

documents with relevant wikipedia concepts. In Pro-

ceedings of the Slovenian Conference on Data Mining

and Data Warehouses (SiKDD 2017).

Cheng, G., Gunaratna, K., Thalhammer, A., Paulheim, H.,

Voigt, M., and García, R. (2015). Sematch: Semantic

entity search from knowledge graph. In Cheng, G.,

Gunaratna, K., Thalhammer, A., Paulheim, H., Voigt,

M., and García, R., editors, Joint Proceedings of the

1st International Workshop on Summarizing and Pre-

senting Entities and Ontologies and the 3rd Interna-

tional Workshop on Human Semantic Web Interfaces

(SumPre 2015, HSWI 2015), Portoroz, Slovenia.

Chhun, C., Colombo, P., Suchanek, F. M., and Clavel, C.

(2022). Of human criteria and automatic metrics: A

benchmark of the evaluation of story generation.

Diedrich, J., Benedek, M., Jauk, E., and Neubauer, A.

(2015). Are creative ideas novel and useful? Psychol-

ogy of Aesthetics, Creativity, and the Arts, 9:35–40.

Dimitrov, D., Singer, P., Lemmerich, F., and Strohmaier,

M. (2016). Visual positions of links and clicks on

wikipedia. In Proceedings of the 25th International

Conference Companion on World Wide Web, WWW

’16 Companion, page 27–28, Republic and Canton of

Geneva, CHE. International World Wide Web Confer-

ences Steering Committee.

Dimitrov, D., Singer, P., Lemmerich, F., and Strohmaier, M.

(2017). What makes a link successful on wikipedia?

In Proceedings of the 26th International Conference

on World Wide Web, WWW ’17, page 917–926, Re-

public and Canton of Geneva, CHE. International

World Wide Web Conferences Steering Committee.

Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N.,

Murphy, K., Strohmann, T., Sun, S., and Zhang, W.

(2014). Knowledge vault: A web-scale approach to

probabilistic knowledge fusion. In Proceedings of

the 20th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 601–

610. ACM.

Freitas, A. A. (1998). On objective measures of rule sur-

prisingness. In Carbonell, J. G., Siekmann, J., Goos,

G., Hartmanis, J., Van Leeuwen, J.,

Zytkow, J. M.,

and Quafafou, M., editors, Principles of Data Mining

and Knowledge Discovery, volume 1510, pages 1–9.

Springer Berlin Heidelberg, Berlin, Heidelberg. Se-

ries Title: Lecture Notes in Computer Science.

Guo, Q., Zhuang, F., Qin, C., Zhu, H., Xie, X., Xiong, H.,

and He, Q. (2020). A survey on knowledge graph-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

424

based recommender systems. volume 34, pages 3549–

3568. IEEE.

Hartig, O., Champin, P.-A., Kellogg, G., and Seaborne, A.

(2021). RDF-star and SPARQL-star. W3c community

group ﬁnal report, W3C.

Hees, J. (2018). Simulating Human Associations with

Linked Data. doctoralthesis, Technische Universität

Kaiserslautern.

Hidi, S. (2006). Interest: A unique motivational variable.

Educational Research Review, 1(2):69–82.

Hilderman, R. J. and Hamilton, H. J. (1999). Knowledge

Discovery and Interestingness Measures: A Survey.

Computer Science, page 28.

Hoffart, J., Suchanek, F. M., Berberich, K., and Weikum,

G. (2013). Yago2: A spatially and temporally en-

hanced knowledge base from wikipedia. Artiﬁcial In-

telligence, 194:28–61.

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C.,

de Melo, G., Gutierrez, C., Gayo, J. E. L., Kirrane,

S., Neumaier, S., Polleres, A., Navigli, R., Ngomo,

A.-C. N., Rashid, S. M., Rula, A., Schmelzeisen, L.,

Sequeda, J., Staab, S., and Zimmermann, A. (2022).

Knowledge Graphs. ACM Comput. Surv., 54(4):1–37.

arXiv:2003.02320 [cs].

Itti, L. and Baldi, P. (2006). Bayesian surprise attracts hu-

man attention. In Advances in neural information pro-

cessing systems, pages 547–554.

Jolliffe, I. T. and Cadima, J. (2016). Principal compo-

nent analysis: a review and recent developments.

Philosophical Transactions of the Royal Society A:

Mathematical, Physical and Engineering Sciences,

374(2065).

Kotkov, D., Wang, S., and Veijalainen, J. (2016). A survey

of serendipity in recommender systems. Knowledge-

Based Systems, 111.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed ap-

proach to interpreting model predictions. In Guyon, I.,

Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,

Vishwanathan, S., and Garnett, R., editors, Advances

in Neural Information Processing Systems 30, pages

4765–4774. Curran Associates, Inc.

Makke, N. and Chawla, S. (2023). Interpretable scientiﬁc

discovery with symbolic regression: A review.

Marra, G., Giannini, F., Diligenti, M., and Gori, M. (2019).

Lyrics: a general interface layer to integrate logic in-

ference and deep learning. In Joint European Confer-

ence on Machine Learning and Knowledge Discovery

in Databases, pages 283–298. Springer.

McCay-Peet, L. and Toms, E. G. (2015). Investigating

serendipity: How it unfolds and what may inﬂuence

it. Journal of the Association for Information Science

and Technology, 66(7):1463–1476.

Mcgarry, K. (2005). Mcgarry, k.: A survey of interesting-

ness measures for knowledge discovery. know. eng.

rev. 20(01), 39-61. Knowledge Eng. Review, 20:39–

61.

Mihalcea, R., Corley, C., and Strapparava, C. (2006).

Corpus-based and knowledge-based measures of text

semantic similarity. In AAAI, volume 6.

Milne, D. and Witten, I. H. (2008). Learning to link with

wikipedia. In Proceedings of the 17th ACM Con-

ference on Information and Knowledge Management.

ACM.

Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E.

(2016). A review of relational machine learning

for knowledge graphs. Proceedings of the IEEE,

104(1):11–33.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).

The pagerank citation ranking: Bringing order to the

web. Technical Report 1999-66, Stanford InfoLab.

Previous number = SIDL-WP-1999-0120.

Palma, C. (2023). Modelling interestingness: Stories as L-

Systems and Magic Squares. In Text2Story@ECIR,

Dublin (Republic of Ireland).

Palma, C. (2024). Modelling interestingness: a workﬂow

for surprisal-based knowledge mining in narrative se-

mantic networks. In SEMMES’24: Semantic Methods

for Events and Stories, co-located with the 21th Ex-

tended Semantic Web Conference (ESWC2024).

Paulheim, H. (2013). Dbpedianyd - a silver standard bench-

mark dataset for semantic relatedness in dbpedia. In

NLP-DBPEDIA@ISWC.

Reisenzein, R., Horstmann, G., and Schützwohl, A. (2019).

The Cognitive-Evolutionary Model of Surprise: A Re-

view of the Evidence. Topics in Cognitive Science,

11(1):50–74.

Ristoski, P. and Paulheim, H. (2016). Semantic web in data

mining and knowledge discovery: A comprehensive

survey. Journal of Web Semantics, 36:1–22.

Schmidhuber, J. (2010). Formal theory of creativity, fun,

and intrinsic motivation (1990–2010). IEEE Trans-

actions on Autonomous Mental Development, 2:230–

247.

Silvia, P. J. (2009). Looking past pleasure: Anger, confu-

sion, disgust, pride, surprise, and other unusual aes-

thetic emotions. Psychology of Aesthetics, Creativity,

and the Arts, 3(1):48–51.

Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T.

(2007). Bias in random forest variable importance

measures: Illustrations, sources and a solution. BMC

Bioinformatics, 8(1).

Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., and

Bouchard, G. (2016). Complex embeddings for sim-

ple link prediction. In International Conference on

Machine Learning, pages 2071–2080.

Vanneschi, L. and Poli, R. (2012). Genetic programming

— introduction, applications, theory and open issues.

In Rozenberg, G., Bäck, T., and Kok, J. N., editors,

Handbook of Natural Computing. Springer, Berlin,

Heidelberg.

Zhu, G. and Iglesias, C. A. (2017). Computing semantic

similarity of concepts in knowledge graphs. IEEE

Transactions on Knowledge and Data Engineering,

29(1):72–85.

The WikiWooW Dataset: Harnessing Semantic Similarity and Clickstream-Data for Serendipitous Hyperlinked-Paths Mining in Wikipedia

425